How does looping order affect the performance of a 2D array

Question

How does the order of the loops affect performance when iterating over a 2D array?

Like the code below:

int[][] array = new int[100][100];
for(int i = 0 ; i < 100 ; i++ ){
    for(int j = 0 ; j < 100 ; j++ ){
        System.out.printlin(array[j][i]); // column first
        System.out.printlin(array[i][j]); // row first
    }
}

Which one is more efficient?

Answer

It depends on what programming language you are using.

To answer this question, the first thing you need to know is whether your programming language is row-major or column-major. It will decide the way that how does the computer store 2-D array in 1-D array in memory.

Row-major order is used in C)/C++/Objective-C (for C-style arrays), Mathematica, PL/I, Pascal), Speakeasy), SAS, and C#)/CLI/.Net.

Column-major order is used in Fortran, OpenGL and OpenGL ES, MATLAB, GNU Octave, S-Plus R),Julia), Scilab.

A typical alternative is to use Iliffe vectors, which store elements in the same row contiguously (like row-major order), but not the rows themselves. They are used in Java), Scala) and Swift).

But why does this matter? All memory accesses are the same, right?

NO. Because of CPU caches.

A CPU cache is a hardware cache used by the Central Processing Unit (CPU) of a computer to reduce the average time to access data from the main memory. The cache is a smaller, faster memory which stores copies of the data from frequently used main memory locations.

Data from memory gets brought over to the CPU in little chunks (called ‘cache lines’), typically 64 bytes. If you have 4-byte integers, that means you’re getting 16 consecutive integers in a neat little bundle. It’s actually fairly slow to fetch these chunks of memory.

If you access the data in a consecutive memory order, every time you access a variable, it will directly read from CPU cache. 100*100/16 times fetches from memory.

But if you access the data in a nonconsecutive memory order, it will fetch the data from memory to CPU cache. 100*100 times fetches from memory.

It is very obvious that which way is more efficient.