Performance

From APL Wiki
Jump to navigation Jump to search

Performance refers to the speed with which programs are executed in a particular language implementation. While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance. Currently-developed array-family implementations that advertise high performance include Dyalog APL, J, K (both Kx and Shakti), and Q, while research projects focused primarily on performance include APEX, Co-dfns, SaC, Futhark, and TAIL.

While dynamically-typed interpreted languages are typically considered to be slow (that is, by nature they lead implementations to run slowly), APL code which uses primarily flat arrays has been described as an excellent fit for modern hardware,[1] and Dyalog APL can in some cases perform better than straightforward C implementations.[2][3] Taking advantage of a high-performance implementation often requires writing in a flatter style, with few or no boxes or nested arrays, and compiled or GPU-based APLs may not fully support nested arrays.

Arrays and performance

The central concept underlying the high-performance array languages currently in use is to take advantage of the regularity of arrays. Implementations implement primitives or short combinations of primitives with fast array algorithms, so that a program can make effective use of computer hardware if these primitives make up the bulk of its computation. Arrays allow for these fast algorithms when they are homogeneous in type: that is, they have a regular shape, which is part of the definition of a multidimensional array; and their elements can be stored with the same type. The type used should also be supported by the hardware (usually, the CPU's vector processor), which for example makes IEEE 754 floats the standard floating-point format in high-performance APLs.

When APL programs are evaluated as a sequence of array operations that don't interact with each other, the time taken for a program is simply the total of the time taken by each primitive. Two different types of costs contribute to a primitive's performance: "fixed" costs that don't scale with the size of the arguments (but could vary somewhat based on other properties), and "variable" costs that do. These variable costs are usually proportional to the argument or result size, but may be more complicated. The simplest model of array primitive performance that takes this divide into account is that a given primitive has a constant "overhead" time for each evaluation, and then processes its input at a fixed rate, called "throughput". The advantage of array programming is that this throughput can be made much lower than in other paradigms. However, when the arrays used are small, a program's performance will instead be dominated by overhead, which tends to be high even relative to other interpreted languages because most APL interpreters store all data in arrays. Overhead for a primitive might be tens or hundreds of nanoseconds while throughput can be many elements per nanosecond, so that throughput begins to be the dominant cost at arrays larger than about a thousand elements.

To make array operations fast, arrays are stored in a regular format in memory, so that all data about the format can be kept in the array's "header" and read only before, not during, data processing. This means that elements in the array must have the same storage type for efficient operation. Flat array model languages with or without boxes require that arrays have a homogeneous type (all numbers, all characters, or all boxes); nested APLs do not, but store homogeneous arrays differently and aren't as efficient when processing mixed-type arrays. Fast APLs tend to use several internal types to operate efficiently on subsets of the available numbers like integers or booleans, and may coerce numbers to a common type to ensure that an array is homogeneous in memory.

In the simplest case, homogeneous array storage allows an implementation to use specialized code to run a primitive on each type. Because this special code is written in a low-level language with complete knowledge of the types in use, it can attain a throughput similar to that of the implementation language (usually a low-level language such as C) for the primitive—although this will tend to leave overall algorithms running slower than that language because in APL they have been split into multiple passes. APL's array operations are also ideal for implementation with SIMD, or "single instruction, multiple data", operations, that perform a single action on several different values. In some cases, such as scalar functions, the primitives are SIMD operations; in others such as Reverse, they are easily implemented using SIMD—for Reverse, SIMD selection or "shuffle". While experimental SIMD machines (such as the APL-influenced CDC Star-100) were created as early as the 1960s, SIMD computing first entered the personal computing mainstream in the 1990s and has steadily grown in prominence for high-performance computing since then. In APL, CPU vector instruction sets such as Intel's SSE are the most often way to access SIMD optimization, although Co-dfns instead runs on a GPU to attain much higher throughput at the cost of increased overhead and restriction of available algorithms.

Not all primitives are equally amenable to SIMD implementation or other high-performance techniques. In order to get the most out of an APL implementation, the programmer must take advantage of the fast operations available, which may require significant testing to determine when primitives perform well and alternative strategies either to work around cases where an operation is slow or to take advantage of information the programmer knows but the language doesn't.

Performant implementation

Even the first APL implementation, APL\360, was considered fast for an interpreted language: Larry Breed said it executed programs "often one-tenth to one-fifth as fast as compiled code". He attributed its high performance to fast array operations with development guided by analysis of user code, and its low system overhead to a well-implemented superviser with complete control over system resources.[4] Performance of system operations remained a point of focus for various dialects in the time-sharing era, but in modern times resources such as files are always simply accessed through the host operating system.

Internal datatypes

Main article: Internal type

Most APLs expose only a small number of scalar types to the user: one or two numeric types (such as double-precision real or complex numbers), and a single character type. However, for performance reasons these types can be implemented internally using various subset types. For example, APL\360 uses numeric arrays of 1-bit Booleans, 4-byte integers, or 8-byte floating point numbers, but converts between them transparently so that from the user's perspective all numbers behave like 8-byte floats (as this type contains the others). In Dyalog APL this hierarchy is significantly expanded, adding 1-byte and 2-byte integers as well as 16-byte complex numbers containing the other types (however, Dyalog also allows the user access to decimal floats if requested, which breaks the strict hierarchy).

When working with large arrays, an implementation can dynamically choose the type of arrays as execution progresses. For some operations it is advantageous to force an array to the smallest possible type, a procedure known as "squeezing". The ability to dynamically change array type can be a practical advantage of interpreted array languages over statically typed compiled languages, since the interpreter is sometimes able to choose a smaller type than the compiler. This may be because the programmer chooses a suboptimal type or because the interpreter can take advantage of situations where an array could possible require a larger type, but doesn't in a particular instance of a program. With an implementation using vector instructions, a smaller internal type can directly translate to faster execution because a vector register (and hence a vector operation) can fit more elements when they are smaller.[3]

Fast array operations

Main article: APL primitive performance

Most of the effort in optimizing mainstream APL implementations is focused on optimizing particular array operations.[5]

Implementing APL with APL

Main article: Magic function

The technique of implementing APL primitives using other primitives, or even simpler cases of the same primitive, can be advantageous for performance in addition to being easier for the implementer.[6] Even when a primitive does not use APL directly, reasoning in APL can lead to faster implementation techniques.[7]

Alternate array representations

Internally, APL arrays are usually stored as two lists in memory. The first is a list of the shape (although it's also possible to store the "stride", enabling different views of the same data[8][9]). The second is the ravel of elements in the array. Nested arrays consist of pointers to arrays which may be distributed across memory, their use can lead to very inefficient memory read patterns - in contrast to flat arrays which are stored as a contiguous block.

Reference counting and data reuse

Because APL's immutable arrays do not permit circular references (although other features like objects might), APL implementations almost universally use reference counting as a memory management technique. In some implementations, such as Dyalog APL, reference counting is supplemented with tracing garbage collection, which is run infrequently to handle circular references.

Because reference counting keeps track of the exact number of references, and not just whether an array is referenced or not, it can be used not only to find when an array can be released (reference count 0), but also to find when it can be reused when passed as an argument (reference count 1).[10] When permitted, reusing arguments can reduce memory usage and improve cache locality. In some cases, it also allows for faster primitive implementations: for example, Reshape can change only an array's shape while leaving its ravel data in place, Take can free trailing major cells from an array while leaving the remainder, and At can modify only part of an array.

Operation merging and dynamic compilation

Ahead-of-time compilation

APL hardware

Main article: APL hardware

APL hardware is hardware that has been designed to natively support APL array operations, and was a topic of some interest in the 1970s and 80s. APL hardware in this sense has not been developed. However, SIMD and vector processors can be used for similar purposes and in some cases are directly inspired by APL. SIMD instructions are now widely available on consumer hardware, having been introduced to Intel's processors beginning in the late 90s.

Performant usage

For the user, there are a few strategies to consider for reasonable performance.

Changing representation

While an APL user cannot change the way the language stores their arrays, a common optimization strategy is to improve the layout of data within arrays in a program. This is typically done to reduce the use of nested arrays with many leaves in favor of one or a few flat arrays. The most obvious such improvement is simply to change a nested array in which all child arrays have the same shape into a higher-rank array (the Mixed nested array); the Rank operator can make working with such arrays easier. Roger Hui has advocated the use of inverted tables to store database-like tables consisting of a matrix where elements in each column share a single type but different columns may have different types.[11] Bob Smith, before the introduction of nested APLs, suggested using a Boolean partition vector (like the one used by Partitioned Enclose) to encode vectors of vectors in flat arrays,[12] and Aaron Hsu has developed techniques for working with trees using flat depth, parent, or sibling vectors.[13]

References

  1. Martin Thompson. "Rectangles All The Way Down" (slides, video) at Dyalog '18.
  2. Matthew Maycock. Beating C with Dyalog APL: wc. 2019-10.
  3. 3.0 3.1 Marshall Lochbaum. "The Interpretive Advantage" (slides (0.5 MB), video) at Dyalog '18.
  4. Larry Breed. "The Implementation of APL\360". 1967-08.
  5. Morten Kromberg and Roger Hui. "D11: Primitive Performance" (slides (1.3 MB), materials (1.4 MB), video) at Dyalog '13.
  6. Roger Hui. "In Praise of Magic Functions: Part I". Dyalog blog. 2015-06-22.
  7. Marshall Lochbaum. "Expanding Bits in Shrinking Time". Dyalog blog. 2018-06-11.
  8. NumPy Reference. "ndarray.strides". Accessed 2020-11-09.
  9. Nick Nickolov. "Compiling APL to JavaScript". Vector Journal Volume 26 No. 1. 2013-09. (The strided representation was later removed from ngn/apl.)
  10. Cantrill, Brian. "A Conversation with Arthur Whitney". 2009.
  11. Roger Hui. "Inverted Tables" (slides (0.9 MB), video) at Dyalog '18.
  12. Bob Smith. "A programming technique for non-rectangular data" (included in Boolean functions (pdf)) at APL79.
  13. Aaron Hsu. "High-performance Tree Wrangling, the APL Way" (slides (0.3 MB), video) at Dyalog '18.


APL development [edit]
Interface SessionTyping glyphs (on Linux) ∙ FontsText editors
Publications IntroductionsLearning resourcesSimple examplesAdvanced examplesMnemonicsStandardsA Dictionary of APLCase studiesDocumentation suitesBooksPapersVideosPeriodicalsTerminology (Chinese, German) ∙ Neural networksError trapping with Dyalog APL (in forms)
Sharing code Backwards compatibilityAPLcartAPLTreeAPL-CationDfns workspaceTatin
Implementation Developers (APL2000, Dyalog, GNU APL community, IBM, IPSA, STSC) ∙ ResourcesOpen-sourceMagic functionPerformanceAPL hardware