Performance: Difference between revisions

From APL Wiki
Jump to navigation Jump to search
mNo edit summary
Line 10: Line 10:
Most APLs expose only a small number of scalar types to the user: one or two [[numeric]] types (such as double-precision real or [[complex]] numbers), and a single [[character]] type. However, for performance reasons these types can be implemented internally using various subset types. For example, [[APL\360#Internal types|APL\360 uses]] numeric arrays of 1-bit [[Boolean]]s, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In [[Dyalog APL]] this hierarchy is [[Dyalog APL#Internal types|significantly expanded]], adding 1-byte and 2-byte integers as well as 16-byte [[complex]] numbers containing the other types (however, Dyalog also allows the user access to [[decimal float]]s if requested, which breaks the strict hierarchy).
Most APLs expose only a small number of scalar types to the user: one or two [[numeric]] types (such as double-precision real or [[complex]] numbers), and a single [[character]] type. However, for performance reasons these types can be implemented internally using various subset types. For example, [[APL\360#Internal types|APL\360 uses]] numeric arrays of 1-bit [[Boolean]]s, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In [[Dyalog APL]] this hierarchy is [[Dyalog APL#Internal types|significantly expanded]], adding 1-byte and 2-byte integers as well as 16-byte [[complex]] numbers containing the other types (however, Dyalog also allows the user access to [[decimal float]]s if requested, which breaks the strict hierarchy).


When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using [[vector instruction]]s, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.<ref name="advantage"/>
When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using [[vector instructions]], a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.<ref name="advantage"/>


=== Fast array operations ===
=== Fast array operations ===

Revision as of 08:26, 14 May 2020

Performance refers to the speed with which programs are executed in a particular language implementation. While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance. Currently-developed array-family implementations that advertise high performance include Dyalog APL, J, K (both Kx and Shakti), and Q, while research projects focused primarily on performance include APEX, Co-dfns, SaC, Futhark, and TAIL.

While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,[1] and Dyalog APL can in some cases perform better than straightforward C implementations.[2][3] Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no boxes or nested arrays, and compiled or GPU-based APLs may not fully support nested arrays.

Performant implementation

Even the first APL implementation, APL\360, was considered fast for an interpreted language: Larry Breed said it executed programs "often one-tenth to one-fifth as fast as compiled code". He attributed its high performance to fast array operations with development guided by analysis of user code, and its low system overhead to a well-implemented superviser with complete control over system resources.[4] Performance of system operations remained a point of focus for various dialects in the time-sharing era, but in modern times resources such as files are always simply accessed through the host operating system.

Internal datatypes

Main article: Internal type

Most APLs expose only a small number of scalar types to the user: one or two numeric types (such as double-precision real or complex numbers), and a single character type. However, for performance reasons these types can be implemented internally using various subset types. For example, APL\360 uses numeric arrays of 1-bit Booleans, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In Dyalog APL this hierarchy is significantly expanded, adding 1-byte and 2-byte integers as well as 16-byte complex numbers containing the other types (however, Dyalog also allows the user access to decimal floats if requested, which breaks the strict hierarchy).

When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using vector instructions, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.[3]

Fast array operations

Main article: APL primitive performance

Most of the effort in optimizing mainstream APL implementations is focused on optimizing particular array operations.[5]

Implementing APL with APL

Main article: Magic function

The technique of implementing APL primitives using other primitives, or even simpler cases of the same primitive, can be advantageous for performance in addition to being easier for the implementer.[6] Even when a primitive does not use APL directly, reasoning in APL can lead to faster implementation techniques.[7]

Alternate array representations

Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although some implementations also include the "stride"[8]). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.

Reference counting and data reuse

Because APL's immutable arrays do not permit circular references (although other features like objects might), APL implementations almost universally use reference counting as a memory management technique. In some implementations, such as Dyalog APL, reference counting is supplemented with tracing garbage collection, which is run infrequently to handle circular references.

Because reference counting keeps track of the exact number of references, and not just whether an array is referenced or not, it can be used not only to find when an array can be released (reference count 0), but also to find when it can be reused when passed as an argument (reference count 1).[9] When permitted, reusing arguments can reduce memory usage and improve cache locality. In some cases, it also allows for faster primitive implementations: for example, Reshape can change only an array's shape while leaving its ravel data in place, Take can free trailing major cells from an array while leaving the remainder, and At can modify only part of an array.

Operation merging and dynamic compilation

Ahead-of-time compilation

Performant usage

For the user, there are a few strategies to consider for reasonable performance.

Changing representation

While an APL user cannot change the way the language stores their arrays, a common optimization strategy is to improve the layout of data within arrays in a program. This is typically done to reduce the use of nested arrays with many leaves in favor of one or a few flat arrays. The most obvious such improvement is simply to change a nested array in which all child arrays have the same shape into a higher-rank array (the Mixed nested array); the Rank operator can make working with such arrays easier. Roger Hui has advocated the use of inverted tables to store database-like tables consisting of a matrix where elements in each column share a single type but different columns may have different types.[10] Bob Smith, before the introduction of nested APLs, suggested using a Boolean partition vector (like the one used by Partitioned Enclose) to encode vectors of vectors in flat arrays,[11] and Aaron Hsu has developed techniques for working with trees using flat depth, parent, or sibling vectors.[12]

References

  1. Martin Thompson. "Rectangles All The Way Down" (slides, video) at Dyalog '18.
  2. Matthew Maycock. Beating C with Dyalog APL: wc. 2019-10.
  3. 3.0 3.1 Marshall Lochbaum. "The Interpretive Advantage" (slides (0.5 MB), video) at Dyalog '18.
  4. Larry Breed. "The Implementation of APL\360". 1967-08.
  5. Morten Kromberg and Roger Hui. "D11: Primitive Performance" (slides (1.3 MB), materials (1.4 MB), video) at Dyalog '13.
  6. Roger Hui. "In Praise of Magic Functions: Part I". Dyalog blog. 2015-06-22.
  7. Marshall Lochbaum. "Expanding Bits in Shrinking Time". Dyalog blog. 2018-06-11.
  8. Nick Nickolov Compiling APL to JavaScript (Vector Volume 26)
  9. Cantrill, Brian. "A Conversation with Arthur Whitney". 2009.
  10. Roger Hui. "Inverted Tables" (slides (0.9 MB), video) at Dyalog '18.
  11. Bob Smith. "A programming technique for non-rectangular data" (included in Boolean functions (pdf)) at APL79.
  12. Aaron Hsu. "High-performance Tree Wrangling, the APL Way" (slides (0.3 MB), video) at Dyalog '18.

Template:APL performance