Performance: Difference between revisions

From APL Wiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
As an interpreted language, some have lambasted APL as unreasonably slow compared to other languages. However, much work has been done to ensure that interpreted code from a typical user is reasonably performant. Since its first implementation, work has been done on compiling APL.
'''Performance''' refers to the speed with which programs are executed in a particular language implementation. While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance. Currently-developed array-family implementations that advertise high performance include [[Dyalog APL]], [[J]], [[K]] (both Kx and Shakti), and [[Q]], while research projects focused primarily on performance include [[APEX]], [[Co-dfns]], [[SaC]], [[Futhark]], and [[TAIL]].


== Performant usage ==
While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,<ref>Martin Thompson. "Rectangles All The Way Down" ([https://www.dyalog.com/uploads/conference/dyalog18/presentations/U12_Rectangles_All_The_Way_Down.pdf slides], [https://dyalog.tv/Dyalog18/?v=mK2WUDIY4hk video]) at [[Dyalog '18]].</ref> and [[Dyalog APL]] can in some cases perform better than straightforward [[wikipedia:C (programming language)|C]] implementations.<ref>Matthew Maycock. [https://ummaycoc.github.io/wc.apl/ Beating C with Dyalog APL: wc]. 2019-10.</ref><ref name="advantage">[[Marshall Lochbaum]]. "The Interpretive Advantage" ([https://www.dyalog.com/user-meetings/uploads/conference/dyalog18/presentations/D15_The_Interpretive_Advantage.zip slides (0.5 MB)], [https://dyalog.tv/Dyalog18/?v=-6no6N3i9Tg video]) at [[Dyalog '18]].</ref> Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no [[box]]es or [[nested]] arrays, and compiled or GPU-based APLs may not fully support nested arrays.
For the user, there are a few strategies to consider for reasonable performance.
=== Mechanical sympathy ===
Internally, APL arrays are stored as two lists in memory. The first is a list of the shape (although some implementations also include the "stride"<ref>Nick Nickolov ''Compiling APL to JavaScript'' (Vector Volume 26)</ref>). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block. This idea is also explored in the study of inverted tables<ref>Roger Hui ''Inverted Tables'' (Dyalog '18 User Meeting)</ref>.


== Performant implementation ==
== Performant implementation ==
For now enjoy a list of references.
Even the first APL implementation, [[APL\360]], was considered fast for an interpreted language: [[Larry Breed]] said it executed programs "often one-tenth to one-fifth as fast as compiled code". He attributed its high performance to fast array operations with development guided by analysis of user code, and its low system overhead to a well-implemented superviser with complete control over system resources.<ref>[[Larry Breed]]. [https://dl.acm.org/doi/10.1145/2402536.2402581 "The Implementation of APL\360"]. 1967-08.</ref> Performance of system operations remained a point of focus for various dialects in the [[time-sharing]] era, but in modern times resources such as files are always simply accessed through the host operating system.
 
=== Internal datatypes ===
{{Main|Internal datatypes}}
Most APLs expose only a small number of scalar types to the user: one or two [[numeric]] types (such as double-precision real or [[complex]] numbers), and a single [[character]] type. However, for performance reasons these types can be implemented internally using various subset types. For example, [[APL\360#Internal types|APL\360 uses]] numeric arrays of 1-bit [[Boolean]]s, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In [[Dyalog APL]] this hierarchy is [[Dyalog APL#Internal types|significantly expanded]], adding 1-byte and 2-byte integers as well as 16-byte [[complex]] numbers containing the other types (however, Dyalog also allows the user access to [[decimal float]]s if requested, which breaks the strict hierarchy).
 
When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using [[vector instruction]]s, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.<ref name="advantage"/>


=== Interpreters ===
=== Fast array operations ===
==== Writing ====
{{Main|APL primitive performance}}
[http://www.dyalog.com/blog/2015/06/in-praise-of-magic-functions-part-one/ Roger Hui; In Praise of Magic Functions; Dyalog Blog]
Most of the effort in optimizing mainstream APL implementations is focused on optimizing particular array operations.<ref>[[Morten Kromberg]] and [[Roger Hui]]. "D11: Primitive Performance" ([https://www.dyalog.com/uploads/conference/dyalog13/presentations/D11_Primitive_Performance.pps slides (1.3 MB)], [https://www.dyalog.com/uploads/conference/dyalog13/presentations/D11_Primitive_Performance materials (1.4 MB)], [https://dyalog.tv/Dyalog13/?v=84t87EO5ZEE video]) at [[Dyalog '13]].</ref>
==== Video ====
[https://www.youtube.com/watch?v=84t87EO5ZEE Morten Kromberg, Roger Hui; Primitive Performance; Dyalog '13 User Meeting]


=== Compilers ===
==== Implementing APL with APL ====
{{Main|Magic function}}
The technique of implementing APL primitives using other primitives, or even simpler cases of the same primitive, can be advantageous for performance in addition to being easier for the implementer.<ref>[[Roger Hui]]. [http://www.dyalog.com/blog/2015/06/in-praise-of-magic-functions-part-one/ "In Praise of Magic Functions: Part I"]. [[Dyalog Ltd.|Dyalog]] blog. 2015-06-22.</ref> Even when a primitive does not use APL directly, reasoning in APL can lead to faster implementation techniques.<ref>[[Marshall Lochbaum]]. [https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/ "Expanding Bits in Shrinking Time"]. [[Dyalog Ltd.|Dyalog]] blog. 2018-06-11.</ref>
 
=== Alternate array representations ===
Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although some implementations also include the "stride"<ref>Nick Nickolov ''Compiling APL to JavaScript'' (Vector Volume 26)</ref>). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.
 
=== Operation merging and dynamic compilation ===
=== Ahead-of-time compilation ===
* [https://snakeisland.com/ Snake Island Research]
* [https://snakeisland.com/ Snake Island Research]
== Performant usage ==
For the user, there are a few strategies to consider for reasonable performance.
=== Changing representation ===
While an APL user cannot change the way the language stores their arrays, a common optimization strategy is to improve the layout of data within arrays in a program. This is typically done to reduce the use of [[nested]] arrays with many leaves in favor of one or a few flat arrays. The most obvious such improvement is simply to change a nested array in which all child arrays have the same shape into a higher-rank array (the [[Mix]]ed nested array); the [[Rank operator]] can make working with such arrays easier. [[Roger Hui]] has advocated the use of [[inverted table]]s to store database-like tables consisting of a [[matrix]] where [[element]]s in each column share a single type but different columns may have different types.<ref>[[Roger Hui]]. "Inverted Tables" ([https://www.dyalog.com/uploads/conference/dyalog18/presentations/D14_Inverted_Tables.zip slides (0.9 MB)], [https://dyalog.tv/Dyalog18/?v=IOWDkqKbMwk video]) at [[Dyalog '18]].</ref> [[Bob Smith]], before the introduction of [[Nested array model|nested]] APLs, suggested using a [[Boolean]] partition vector (like the one used by [[Partitioned Enclose]]) to encode vectors of vectors in flat arrays,<ref>[[Bob Smith]]. [https://doi.org/10.1145/800136.804488 "A programming technique for non-rectangular data"] (included in [http://www.sudleyplace.com/APL/boolean.pdf Boolean functions (pdf)]) at [[APL79]].</ref> and [[Aaron Hsu]] has developed techniques for working with [[wikipedia:Tree (data structure)|trees]] using flat depth, parent, or sibling vectors.<ref>[[Aaron Hsu]]. "High-performance Tree Wrangling, the APL Way" ([https://www.dyalog.com/uploads/conference/dyalog18/presentations/U19_Tree_Wrangling_the_APL_Way.pdf slides (0.3 MB)], [https://dyalog.tv/Dyalog18/?v=hzPd3umu78g video]) at [[Dyalog '18]].</ref>


== References ==
== References ==
<references />
<references />
{{APL performance}}[[Category:Performance| ]]

Revision as of 13:57, 12 May 2020

Performance refers to the speed with which programs are executed in a particular language implementation. While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance. Currently-developed array-family implementations that advertise high performance include Dyalog APL, J, K (both Kx and Shakti), and Q, while research projects focused primarily on performance include APEX, Co-dfns, SaC, Futhark, and TAIL.

While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,[1] and Dyalog APL can in some cases perform better than straightforward C implementations.[2][3] Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no boxes or nested arrays, and compiled or GPU-based APLs may not fully support nested arrays.

Performant implementation

Even the first APL implementation, APL\360, was considered fast for an interpreted language: Larry Breed said it executed programs "often one-tenth to one-fifth as fast as compiled code". He attributed its high performance to fast array operations with development guided by analysis of user code, and its low system overhead to a well-implemented superviser with complete control over system resources.[4] Performance of system operations remained a point of focus for various dialects in the time-sharing era, but in modern times resources such as files are always simply accessed through the host operating system.

Internal datatypes

Main article: Internal datatypes

Most APLs expose only a small number of scalar types to the user: one or two numeric types (such as double-precision real or complex numbers), and a single character type. However, for performance reasons these types can be implemented internally using various subset types. For example, APL\360 uses numeric arrays of 1-bit Booleans, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In Dyalog APL this hierarchy is significantly expanded, adding 1-byte and 2-byte integers as well as 16-byte complex numbers containing the other types (however, Dyalog also allows the user access to decimal floats if requested, which breaks the strict hierarchy).

When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using vector instructions, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.[3]

Fast array operations

Main article: APL primitive performance

Most of the effort in optimizing mainstream APL implementations is focused on optimizing particular array operations.[5]

Implementing APL with APL

Main article: Magic function

The technique of implementing APL primitives using other primitives, or even simpler cases of the same primitive, can be advantageous for performance in addition to being easier for the implementer.[6] Even when a primitive does not use APL directly, reasoning in APL can lead to faster implementation techniques.[7]

Alternate array representations

Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although some implementations also include the "stride"[8]). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.

Operation merging and dynamic compilation

Ahead-of-time compilation

Performant usage

For the user, there are a few strategies to consider for reasonable performance.

Changing representation

While an APL user cannot change the way the language stores their arrays, a common optimization strategy is to improve the layout of data within arrays in a program. This is typically done to reduce the use of nested arrays with many leaves in favor of one or a few flat arrays. The most obvious such improvement is simply to change a nested array in which all child arrays have the same shape into a higher-rank array (the Mixed nested array); the Rank operator can make working with such arrays easier. Roger Hui has advocated the use of inverted tables to store database-like tables consisting of a matrix where elements in each column share a single type but different columns may have different types.[9] Bob Smith, before the introduction of nested APLs, suggested using a Boolean partition vector (like the one used by Partitioned Enclose) to encode vectors of vectors in flat arrays,[10] and Aaron Hsu has developed techniques for working with trees using flat depth, parent, or sibling vectors.[11]

References

  1. Martin Thompson. "Rectangles All The Way Down" (slides, video) at Dyalog '18.
  2. Matthew Maycock. Beating C with Dyalog APL: wc. 2019-10.
  3. 3.0 3.1 Marshall Lochbaum. "The Interpretive Advantage" (slides (0.5 MB), video) at Dyalog '18.
  4. Larry Breed. "The Implementation of APL\360". 1967-08.
  5. Morten Kromberg and Roger Hui. "D11: Primitive Performance" (slides (1.3 MB), materials (1.4 MB), video) at Dyalog '13.
  6. Roger Hui. "In Praise of Magic Functions: Part I". Dyalog blog. 2015-06-22.
  7. Marshall Lochbaum. "Expanding Bits in Shrinking Time". Dyalog blog. 2018-06-11.
  8. Nick Nickolov Compiling APL to JavaScript (Vector Volume 26)
  9. Roger Hui. "Inverted Tables" (slides (0.9 MB), video) at Dyalog '18.
  10. Bob Smith. "A programming technique for non-rectangular data" (included in Boolean functions (pdf)) at APL79.
  11. Aaron Hsu. "High-performance Tree Wrangling, the APL Way" (slides (0.3 MB), video) at Dyalog '18.

Template:APL performance