Benchmarks of APL vs. Other Languages

This page contains the following items:

Singular value decomposition (APL vs. C)

From: jimw@math.umass.edu (Jim Weigang)
Newsgroups: comp.lang.apl
Date: 20 Apr 1994 15:26:26 GMT
Subject: Benchmarking APL interpreters

Timing "+/1e6 {reshape} 1.0" is not a reasonable way to benchmark an APL system. It can make an APL interpreter seem much faster than it really is.

1. Many APL systems (including APL*PLUS/386) don't care how you format numbers on input, and "1.0" is Boolean. 1e6{reshape}1.0 is a 125K byte bit vector instead of an 8M byte vector of reals. Summing a Boolean vector can be done very fast using a 256-byte translate table containing the population count for each possible byte. When C programmers write "1.0", they mean floating point. APLers have to be careful not to compare grapes with watermelons.

2. Assume you change 1.0 to 1.1 so you get floating point numbers. Most of the execution time will be spent in +/, so mainly you are evaluating the quality of the plus-reduction loop. This is a common operation and is usually carefully crafted by the interpreter authors. How carefully? Well, the plus-reduction loop for the IBM mainframe- based VSAPL consists of TWO machine instructions! An Add and a "BXH" loop instruction. Pretty hard to beat (although it can be done by unrolling the loop). This benchmark should make VSAPL look about as fast as any compiled language.

3. You gotta be realistic about the problems with the APL statement. For example, did anyone mention Workspace Full to the C programmers? APL has to be able to materialize that 8 MEGAbyte array or it can't execute the statement.

Even admitting the memory requirements, the benchmark gives an unrepresentative picture of APL. It is rare that a real-life APL application can achieve this degree of "parallelism" (pass arrays this large to primitives without any there being a lot of other operations on MUCH smaller arrays). And it is rare that an application requires only summation within the loop. There are plenty of simple calculations that force an APL program to loop at a scalar or short vector level.

My own rule of thumb, based on benchmarks that unfortunately I have not cataloged, is that well-written (i.e., no unnecessary loops) real-life APL code typically runs about 4 times slower than compiled code, give or take a factor of two. (Assuming the problem can be vectorized to some extent.) Note that this is within the 10:1 ratio that John Nagle was looking for. If the APL code has to loop at a scalar level, it typically runs about 150 times slower (+/- 2x) than compiled code. Not great, but compare that with the figure of >1,000 times slower that Nagle quotes for Smalltalk and Python.

Jim

From: jimw@math.umass.edu (Jim Weigang)
Newsgroups: comp.lang.apl
Date: 23 Apr 1994 22:20:48 GMT
Subject: Benchmarking APL interpreters

In real life, the answer is never a scalar...

I resurrected one of my nontrivial APL vs. C benchmarks. It's a program that computes the singular value decomposition of a matrix. I translated an Algol program to APL back in 1987, and in 1991 I compiled a C program from "Numerical Recipes in C" in order to test the APL*PLUS #NA facility, which lets you run compiled programs from within APL. I recently reran my timing tests, and here are the results:

          N       C        APL   Ratio
        ----  -------   -------  -----
          2     0.029     0.456   15.9
          5     0.080     1.797   22.3
         10     0.380     7.770   20.4
         15     1.026    15.731   15.3
         20     2.208    28.060   12.7
         30     7.372    73.899   10.0
         40    16.395   130.059    7.9
         50    31.894   226.468    7.1
         60    55.393   348.052    6.3
         70    84.168   520.032    6.2
         80   121.924   684.869    5.6
         90   176.534   941.398    5.3

N: the size of the square input matrix.

C: execution time (in seconds) for the double-precision version of the NRC "svdcmp" program, compiled using Turbo C version 2.0. I had to add a small amount of code to create the vector-of-pointers structure used by the NRC program to address the matrices. Compiler options unknown, probably the default. Far pointers are used for the data, so the program has to load segment registers when accessing the arrays. The C program does use the math coprocessor, because I have a second version that doesn't use it, and it's about 15x (!) slower. The program was run under APL*PLUS /PC version 11 using #NA. All memory allocation was done in APL, and that time is included in the C total. For the 2x2, 5x5, 10x10 and 15x15 cases the APL setup code accounted for 25%, 18%, 4%, and 1.8% of the total C time.

APL: timed using APL*PLUS II/386 version 3.3, the '387 version (which assumes a math coprocessor is available). Version 5.2 runs at about the same speed (tiny bit slower). The benchmark was run twice to take advantage of the "partial compilation" that APL*PLUS II does the first time a program is executed.

Ratio: APL time divided by C time.

The programs were run on my powerhouse 16 Mhz 80386 system with 80387 math coprocessor. For the APL version, #WA was about 5.3e6.

The APL version was adapted from the Algol program in the 1970 article by Golub and Reinsch in 28 Numerische Mathematik, Bd. 14, pp 401-420. I originally thought the Algol and NRC versions were the same since all the Algol comments appear in the NRC version (hmmm...), but a close examination showed that the NRC version uses scaling to improve accuracy. I changed the APL program to match the C code, and the timings above are for exactly identical algorithms. The APL version is 143 lines long, excluding comments (and not using diamonds). It uses vector operations, inner products and outer products where possible, but most of the APL primitives executed by the program have scalar arguments and results.

Note that the speed ratio gets smaller as the matrices get larger. This is because when APL handles larger arrays, the fraction of time spent in interpretive overhead is smaller. The speed ratio between APL and compiled programs is never a single number; it varies depending on the size of the arrays being passed to primitive operations. (The downward turn in the ratio for N=2 is because the APL setup for the C program amounts to 25% of the total time.) I modeled the execution time using a cubic polynomial and extrapolated for some higher values of N. At N=200, 500 and 1000 the APL program should be about 3.9, 3.2 and 3.0 times slower than the C program.

Incidentally, I produced the APL program by first translating the Algol directly to scalar APL and then adapting the code to use vector and matrix operations. I retained the scalar algorithm for testing purposes, and I timed it as well. It runs about 13 times slower than the C version for 2x2 matrices and about 90 times slower for 15x15 and 20x20 matrices. (These ratios are estimates because the scalar version is the original Algol algorithm, which runs in about 77% of the time taken by the NRC algorithm.)

I also tried timing the APL*PLUS /PC system, version 11 (STSC's 8088-compatible interpreter). It ran about 2.5 times slower than the /386 system. Unlike the /386 interpreter, the /PC system has only a single version for computers with and without math coprocessors, and it doesn't use the coprocessor as smoothly as its big brother. The non- coprocessor version of the /386 system ran between 1.5 and 2 times slower than the coprocessor version.

After spending a lot of time working on this benchmark, I realized a fatal flaw. I'm comparing a real-mode C program (an 8088 program) with a protected-mode APL interpreter. The C program spends an unknown, but certainly significant, amount of time loading segment registers that would be eliminated if it were a 32-bit program running in protected mode. Also, timing tests with assembler programs converted from 16-bit to 32-bit mode (mostly by changing AX to EAX, etc.) showed that at least some mixes of 32-bit instructions run about 20% faster than 16-bit instructions, even without having to load segment registers. So the C program is at a significant disadvantage, and to obtain a true comparison I should be timing a 32-bit C program. Unfortunately, I don't have a 32-bit C compiler.

Jim

From: jimw@math.umass.edu (Jim Weigang)
Newsgroups: comp.lang.apl
Date: 27 Apr 1994 03:01:27 GMT
Subject: Re: Benchmarking APL interpreters

Bill Chang writes:

What seems a little puzzling to me, is that the APL running times appear to go up by 5X when N is doubled, whereas the C running times are clearly cubic, i.e. 8X larger when N is doubled. This is kind of hard to justify

I believe the APL program's lower factor is because interpreter overhead is decreasing (as a fraction of the total time) as N increases. A larger N causes the arguments of the vector operations, inner products, and outer products to be larger, resulting in better efficiency. Plus, with increasing N a greater percentage of the total time is pushed from the operations with scalar arguments into the more efficient array operations.

Because the two programs perform exactly the same arithmetic, I expect that for a large enough matrix, doubling N would increase the APL program's execution time by the same factor as the C program's.

Bill:

It appears that "idiomatic" APL is within a factor of 5 of "unoptimized" C.

This benchmark reminded me that quoting a single speed ratio, even for a single program, gives an inadequate description of the real situation. The APL version ranged from 22 times slower to (an estimated) 3 times slower, depending on the size of the arrays. The results an APL user observes in practice depends on the size of the arrays. "Gimme arrays! Gimme a bigga raise!"

Jim

From: jimw@math.umass.edu (Jim Weigang)
Newsgroups: comp.lang.apl
Date: 22 Oct 1994 15:28:47 GMT
Subject: APL vs. Assembler benchmark

Here's another APL vs. a-faster-language benchmark, using a program that I recently translated to assembler for a client. The small (10 line) APL program is inherently iterative--the output from one iteration is the input to the next iteration. Within the loop, all the operations apply to vectors (of shorter and shorter length with each iteration). The operations in the loop are: indexing (4), addition (2), multiplication (2), division (2), drop (2), and take (1), plus the usual loop control operations. The arithmetic is floating point, and the assembler code uses '387 floating point operations (as does the interpreter).

To produce the assembler routine, an "assembler prototype" was first written in APL; this is a version of the program in which all function arguments are scalars. The prototype was timed along with the original program and the assembler program to give an idea of what you can expect if you completely ignore APL's array handling capabilities. The following table gives the execution times for the programs on my Toshiba 3100SX, 16 MHz 80386SX computer (#WA=3.9e6) using APL*PLUS II/386 version 3.3:

                  ASM
       N       Prototype            APL          ASM
      --    --------------     ------------     -----
       2      .0974 ( 42x)      .0438 (19x)     .0023
       5      .4066 ( 99x)      .0931 (23x)     .0041
      10     1.4178 (151x)      .2006 (21x)     .0094
      20     5.2329 (177x)      .4956 (17x)     .0296
      40    20.2205 (190x)     1.4145 (13x)     .1063
      80    79.2102 (193x)     4.5667 (11x)     .4105

N: The initial size of the arrays handled in the loop, and the number of iterations of the loop.

ASM Prototype: Execution time for the scalar APL program, in seconds. The "NNx" figure is the prototype execution time divided by the assembler routine's execution time.

APL: Execution time for the original APL program.

ASM: Execution time for the assembler code.

Both the APL and the assembler program are tightly written--there are no unnecessary operations, but I didn't spent a lot of time trying to see if a little more time could be squeezed out of the execution time for either version. In short, good production-quality code.

There is one significant difference between the APL and assembler code: In the setup code before the loop, the APL program performs some calculations on square matrices, but only the lower-triangular portion (and diagonal) is really needed. The assembler program avoids doing the arithmetic on the upper-triangular portion. On the one hand, this makes this benchmark of questionable value since the APL program is doing more work than the assembler program. But on the other hand, this is how triangular procedures are often coded in APL (even though there are better, but more complicated, methods), so maybe the benchmark isn't so meaningless. In any case, the wasted time in the APL program can be measured and subtracted for a more balanced comparison. In the APL version, the setup code accounts for 40% of the total time for N=10, 48% for N=20, 58% for N=40, and 68% for N=80. If the APL code didn't perform these wasted arithmetic operations, the execution time for N=80 would be about

        @ W is fraction of square matrix that's used:
        W{<-}(80{times}81{divide}2){divide}80*2

        (4.5667{times}1-.68) + W{times}4.5667{times}.68

or 3.2 seconds, which is about 8 times slower than the assembler code, compared with 11 times slower when the wasted arithmetic is included. (The assembler prototype, by the way, avoids the wasted arithmetic.)

As seen before in my April 23 benchmark posting (the singular value decomposition example), the ratio between the APL and compiled program improves as the array size increases. This is because the fixed overhead of interpretation and memory management becomes a smaller fraction of the total time as more time is spent processing longer vectors. The anamolous ratio for N=2 is, I believe, because the fixed overhead of calling the assembler routine (via #CALL) and the extensive checking that the assembler routine performs on the inputs has become a significant portion of the total time.

The ratios for the assembler prototype vividly demonstrate why looping has such a bad reputation in APL. Of course, what's really bad about this version is that it processes scalars throughout. If you unrolled all the loops for a specific case, making it a loopless scalar algorithm, it would still be miserably slow.

Jim

Home Page