Tuesday, February 03, 2009

.NET numeric performance disappointing

For the purpose of implementing a perceptron I recently decided to take a quick peak at the low-level performance of various approaches to numerics in .NET - specifically, how to store a vector.  A common operation (and thus my benchmark) is a dot-product, which requires iterating over all elements of two vectors, and computing the sum of their products.

A vector can be represented as a close-to-the-metal array, or some higher level construct.  List<double> provides resizability, IList<double> provides an interface which standardizes access to array-like data structures, and ReadOnlyCollection<double> provides a read-only view on an underlying IList<double>.

It turns out the performance difference varies rather dramatically:

image

Notice the logarithmic scale!  The numbers are reproducible each run to within 0.01 seconds (with warm caches).

The test consisted of the sum of 10'000'000 dot products of a pair of vectors with each 100 doubles.  Tests consisted simply of two nested for loops with constant bounds.  Tests marked "+member" accessed the array via a shared member variable, others received a reference as a parameter.  Tests marked "+length" didn't use the constant known size of each vector (being 100), but rather the array's .Length property.  Finally, tests marked "+subroutine" used a dot-product subroutine (rather than the inline nested loop).  Ideally, all of these variations would have virtually no performance cost.

It's clear that abstraction imposes a heavy cost in .NET; List<double> is not a viable implementation choice for a computation-bound program in .NET, unfortunately.  Using an array cast as an IList may be handy, but whatever is happening behind the screens destroys performance - it's possible that each array access causes a virtual method dispatch when performed via the interface.

The ReadOnlyCollection has a low impact, considering that it wraps an IList.  However, this collection should essentially be a no-op: all ReadOnlyCollection methods can be implemented by directly accessing the underlying IList and indeed an inlining compiler could completely remove the cost.  ReadOnlyCollection simply removes methods with permit mutating the underlying collection.

If we remove the slower options the differences become more visible.  As shown on a linear scale then:

image

In the 32-bit case, all variations are reasonable close together; clearly accessing static member variables is slightly slower, but the differences aren't large.  Using the array's built-in Length property is actually somewhat faster than using a constant, but the difference is negligible.  C# versions required around 60% more time than the C++ versions.

Unfortunately, the 64-bit compiler is less forgiving; it seems this JIT cannot inline the very same method which the 32-bit JIT CLR will:

static double Dot(this double[] vecA, double[] vecB) {

    double sum = 0.0;

    for (int i = 0; i < SIZE; i++)

        sum += vecA[i] * vecB[i];

    return sum;

}

 

Using the array's .Length property or using a subroutine causes a serious slowdown (suggesting neither is properly being optimized by the JIT).  Using static variables rather than parameters, as in the 32-bit case, causes a slowdown.

In short, if you want to do heavy numerical lifting, .Net isn't the best tool at hand.  If you do want to use .net, the 32-bit version is your best bet; at least in this version you can split a complex algorithm into subroutines without undue penalty.  Should you choose to use ReadOnlyCollections or virtually any high-level functionality, your code will be anywhere between 6 and 60 times slower than the equivalent C++ code.

The benchmarks were executed on a 64-bit Vista machine with .NET framework 3.5 sp1.