• Status: Solved
• Priority: Medium
• Security: Public
• Views: 722

# Matrix Blocking Code Optimization in C. HELP!!!

I have to optimize this code by improving locality in order to make my program run in less clock cycles overall.  I realize that the cache size and exact standard of clock cycle increase isn't exactly specified yet, but I'm just trying to get started.

int readArray(int* arr, int n, int r, int c)
{
return *(arr + (r * n) + c);
}

void writeArray(int* arr, int n, int r, int c, int value)
{
*(arr + (r * n) + c) = value;
}

void finalMatrix(int n, int* A, int* result)
{
int r, c;
for (r = 0; r < n; r++)
{
for (c = 0; c < n; c++)
{
// compute and write the value for the result array
writeArray( result, n, r, c, ( readArray(A, n, r, c) * readArray(A, n, r, c) ) );`
}
}

}

CODE IDEA: The initial code computes a final matrix whose elements ( row , column ) is the product of an input matrice's (here called "A") similar (row, column) * (column, row).

I realize that I need to use matrix blocking to allow for less cache misses when accessing the elements of A to produce the result matrix.

I will eventually play around with the matrix blocking size, but for starters I wanted to grab like n/4 x n/4 blocks at a time.

obviously the matrix "A" has good spatial locality when it goes in row major order (r,c) but when it goes by column this is terrible.  I was initially thinking of transposing the matrix "A" into another matrix in which I can then use row major order again, but I think just using the same way as (r,c)*(c,r) with smaller matrix blocks will produce efficient results.

---------------------------------------------------------------------
0
Fallacy911
• 5
• 5
3 Solutions

Author Commented:
As an update, I changed the alternate function calls into one line and got a significant upgrade but not enough. I changed writeArray and readArray into just *(result + (r*n) + c ) = .... * ....
Still need to use matrix blocking.

Thanks!
0

Commented:
>>                   writeArray( result, n, r, c, ( readArray(A, n, r, c) * readArray(A, n, r, c) ) );`

I assume that this line is wrong and should have been :

writeArray( result, n, r, c, ( readArray(A, n, r, c) * readArray(A, n, c, r) ) );

? (ie. r,c * c,r)

So, just multiply each value with the value at the transpose position in the matrix (not multiply the entire row with the entire column like in matrix multiplication - just single values).

It can be immediately observed that this will give a symmetric matrix. This means that you have to process only a triangular sub-part of the matrix (above or below the main diagonal).

There's no easy way to improve cache locality for this kind of calculation, because the further away you move from the diagonal, the further apart the values that need to be multiplied are in memory.

You could, as you noted, make a second copy of the matrix in memory - so you have one in row major order, and one in column major order. That is likely to improve cache locality, but might add other overheads that need to be investigated (performance tests are probably the easiest way to do that).

But keep in mind that this will likely only have a positive impact for really big matrices (if any at all). So, it might be more trouble than it's worth.

Similarly, splitting the matrix up in blocks that fit in the cache (cache blocking) might have a positive effect, or they might not. Performance tests will show whether it works for your specific scenario.

Other hints that will likely have an impact on performance :

1) use arr[r][c] to retrieve an item in a 2D array. Using pointer arithmetic like *(arr + (r * n) + c) might be slower, because it might cause more instructions to be generated for it.

2) make use of certain compiler optimization flags (for loop unrolling eg.).

3) don't use function calls in the inner block of a (nested) loop if you want to optimize performance for that loop. Function calls have quite a bit of overhead, and decrease code locality.

4) you could consider making use of specialized (optimized) libraries like BLAS (http://www.netlib.org/blas/), rather than rolling your own.
0

Author Commented:
Infinity, first, youre awesome, you helped me out on my other question too.

Second, thank you, I realized this will be a symetric matrix.
In one instance I did a loop from say "row = 0 to n" and "col = row to n", that way it calculates along the diagonal and right.
____________
|\ xxxxxxxxxx |
|     \ xxxxxxxx|
|         \ xxxxxx|
|             \ xxxx|
|_________\xx|

And then copied those values into the bottom and left portion, but for some reason that ran slower.

I should point out that this IS run on very large matrices.  (Matrices up to 1024x1024)

>>
You could, as you noted, make a second copy of the matrix in memory - so you have one in row major order, and one in column major order.
>>
by making a second copy of the matrix in memory, your referring to actually creating a new array/matrix, (lets call it matrix "B"), traversing the initial "A" matrix in column major order and then adding those values to the new array, correct?  Wouldn't that be just as slow?

What I'm confused about, is that the elements in this matrix (matrix "A") are only accessed ONE TIME each.  The temporal locality is non existent (poor) and I don't exactly see a way of improving this, so I can only improve spatial locality.

However, if every element is only accessed once in the first place, then wouldnt creating this new matrix "B" be just as (or more) inefficient, because we have to access each element in "A" once to create the matrix B in the first place, am I correct?
After creating this matrix, B, we then have great spatial locality in computing row major order * row major order, but getting there is the initial downfall.

>>
Similarly, splitting the matrix up in blocks that fit in the cache (cache blocking) might have a positive effect, or they might not. Performance tests will show whether it works for your specific scenario.
>>
I realize that I can split the initial matrix "A" into blocks starting from the top right and going both right and down (two seperate blocks at a time) and this will be the new two matrices I perform operations on, (Graphical Representation:)
____________
> |       |++|         |
> |___ |++|         |
> |++|                 |
> |++|                 |
> |                      |
> |___________|

but I'm also wondering how to go about coding this without making it obviously tedious and less optimized.
(If you need clarification on what I'm trying to show by this picture, plz let me know)

>>
1) use arr[r][c] to retrieve an item in a 2D array. Using pointer arithmetic like *(arr + (r * n) + c) might be slower, because it might cause more instructions to be generated for it.
>>
Even if the item passed in is a integer pointer, not an array?  Would we have to create a new array, then? Or can the pointer passed in be applied [x][y] offsets to it?  I realize this might be a dumb question.

>>
2) make use of certain compiler optimization flags (for loop unrolling eg.).
>>
I realize there are flags like -O but I'm trying to keep everything for this optimized code personally optimized.  That is, I'm essentially playing compiler myself and omitting these flags

>>
3) don't use function calls in the inner block of a (nested) loop if you want to optimize performance for that loop. Function calls have quite a bit of overhead, and decrease code locality.
>>
Right, I changed these function calls to in-line code and sped up the process about twice as fast, but not fast enough.

Please continue to help me out here! You're extremely helpful so far.

0

Author Commented:
UPDATE.

I realize now what I'm supposed to do, but I would like input and thoughts on how to do it or if it can be optimal.

The operation is obviously the element-wise product of Ar,c * Ac,r.
So my advice was to matrix block the A matrix's transpose and this helps with spatial locality.

Basically, block A and then create B by each A's transpose
Like,
Divide A into
a1,1    a1,2
a2,1    a2,2

where a1,1 and a1,2 have the same number of rows obviously and then a2,1 and a2,2 have the remaining # of rows (Total rows - Rows in B1,1)

THEN THE FINAL B WOULD BE
a1,1     a2,1
a1,2     a2,2

Using this I suppose I can compute Ar,c * B(r,c)  (notice, now optimized since it is r,c not c,r).

But is creating B at all optimal getting to that point?
I'd have to go ahead and concatenate the a1,1 a1,2 etc matrices to create B.  That is still optimal?  I'm having a hard time wrapping my head around that.

Does this look right?  I was basically told to do this and I'm not sure how to implement this without it being completely unoptimal.  I just feel like the necessary checks and creating the matrices involves other functions or loops and would evidently be unoptimal...

Thank you!
0

Author Commented:
TIP:

THIS ONLY HAS TO BE OPTIMIZED FOR THE VALUES: 32, 64, 128, 512, 1024, 1280.

That should help a lot.
0

Commented:
>> And then copied those values into the bottom and left portion, but for some reason that ran slower.

The point of having a triangular matrix, is that you don't really have to copy the data. A triangular matrix is easier to manage if you don't copy in fact :)

Say you have a Matrix type that describes a regular nxn matrix. If you have a specialized DiagonalMatrix type that implements all functionality by taking advantage of the diagonal matrix of the matrix, things can be speeded up quite a bit.
I realize that this would be a lot easier to implement in C++, but you can use the same idea to treat diagonal matrices in a more efficient way in C.

>> I should point out that this IS run on very large matrices.  (Matrices up to 1024x1024)

That's what I assumed, yes :) Otherwise, the kind of optimizations you were referring to wouldn't really make much of a difference.

>> by making a second copy of the matrix in memory, your referring to actually creating a new array/matrix, (lets call it matrix "B"), traversing the initial "A" matrix in column major order and then adding those values to the new array, correct?  Wouldn't that be just as slow?

There's a good chance of that yes. But all depends on how many operations you have to do on those two matrices, and how long you can re-use the second copy (B). As I said in my previous post - I don't think this will make much of a difference. But if you have the time to test all ideas, you could perform some tests to see if it does help.

>> However, if every element is only accessed once in the first place, then wouldnt creating this new matrix "B" be just as (or more) inefficient, because we have to access each element in "A" once to create the matrix B in the first place, am I correct?

Indeed. If you can make use of this copy only once, then there's very little chance that creating the copy will have a positive effect on performance. If you can re-use the copy several times, there's a good chance that after a while things will become more favorable.

>> but I'm also wondering how to go about coding this without making it obviously tedious and less optimized.

Yes, that's indeed a tricky part. A large part of getting this exactly right, is to determine the right size for the blocks. Ideally, you'd fill up the cache with two blocks. But this becomes extremely hard when dealing with multiple platforms. You would have a big advantage if you can stick to optimizing for one specific platform for which you know the characteristics.

Another issue with this type of cache blocking is, that moving the blocks from memory to cache is not optimal. The optimal way of moving a matrix from memory to cache, is by moving it in the order it's stored in memory (row major order most likely), and that's not exactly what you're doing if you move blocks rather than rows.

It is a tricky question that you're asking here, and there's no easy solution for it (otherwise you wouldn't have asked). I would be inclined to start with the more straightforward optimizations (see my first post), and hold off on the cache blocking for now.

>> Even if the item passed in is a integer pointer, not an array?  Would we have to create a new array, then? Or can the pointer passed in be applied [x][y] offsets to it?  I realize this might be a dumb question.

This is far from a dumb question.

(a) a 2D array (int array [N][N];) : accessing items in it can usually be done with one or two instructions (the compiler is well aware of how to best access 2D array items). The downside is that you have a fixed size.

(b) a dynamically allocated 2D array (int** arr;) : this is likely to be the slowest when it comes to accessing items (because of the extra level of pointers). But it is the most flexible.

(c) a pointer to a block of memory that holds NxN ints (int* arr;) : I assume this is what you're using now. This is the middle road between the two above solutions. It's less flexible than (b), but does not have a fixed size like (a). It does not have the extra level of pointers like (b), so it will be faster in accessing items, but it will probably not be as fast as (a), because the compiler is likely to generate more than just 1 or 2 instructions just to calculate the address of the item to retrieve.

So, if you want the fastest approach, go with the 2D array. Otherwise, you're probably better off sticking with the current approach.

>> I realize there are flags like -O but I'm trying to keep everything for this optimized code personally optimized.  That is, I'm essentially playing compiler myself and omitting these flags

For some things, it's best to trust the compiler. The compiler has optimization rules that are specific for the platform you're compiling for.
If you are dealing with just one platform, you can take the time to try to optimize as much as possible yourself, but you're not likely to beat the compiler at this game, unless you plan to mix in assembly with the C code.
For getting the best level of optimization, you want to both make use of the compiler optimizations, as well as your own.

>> Right, I changed these function calls to in-line code and sped up the process about twice as fast, but not fast enough.

It's something at least. If you can keep posting your most recent code here, I might be able to give further ideas.

>> Divide A into
>> a1,1    a1,2
>> a2,1    a2,2

If you are not constrained by a certain layout in memory of A (I assumed you were dealing with row-major order, and were not allowed to change that - but that assumption might have been false), you could indeed split it up in sub-matrices, and store the matrices in consecutive locations in memory.

So, if you have a 4x4 matrix that looks like this :

1  2   3  4
5  6   7  8
9  A  B  C
D  E  F  0

then storing it in row major order would give this layout in memory :

1 2 3 4   5 6 7 8   9 A B C   D E F 0

ie. each row in consecutive locations in memory.

If you are allowed to change this layout, then you could split up the matrix into 4 2x2 submatrices, and store these consecutively in memory (each of them in row major order) :

1 2 5 6    3 4 7 8    9 A D E   B C F 0

There is a huge advantage to this layout in memory, because it's extremely fast to load a submatrix in the cache. Then it's just a matter of loading two of them in the caceh, and multiplying them together.
Note that you do not need to create a second matrix B for this. As long as the size of each submatrix is optimal for the cache size, and you load the correct two in the caceh, this will definitely speed up the calculation.

But as said : this would only be possible if you are allowed to modify the memory layout of the matrix.
0

Author Commented:
Infinity, thanks again.

One road block here is the absence of temporal locality.

I realize that this will produce a symmetric matrix so I wrote this code:

for (r = 0; r < n; r++)    //n is the dimension size (nxn) and always square
{
rTn = r*n;
for(c = r; c < n; c++)
{
int temps = (*((A+rTn) +c)) *  (*(A + (c*n) + r));
*(result + (rTn) + c) = *(result + (c*n) + r) = temps;
}
}
return;

That will go through the upper triangle and store the same result in two places in the "result" matrix.

Why does that run slower than simply going through each r=0 to n and c=0 to n and computing:
*(result + (rTn) + c) = (*((A+rTn) +c)) *  (*(A + (c*n) + r));

I am confused about why that's slower in clock cycles.

Further:
>>
If you are not constrained by a certain layout in memory of A (I assumed you were dealing with row-major order, and were not allowed to change that - but that assumption might have been false), you could indeed split it up in sub-matrices, and store the matrices in consecutive locations in memory.
>>
Well, I'm coding in C, so arrays are stored in row major order.  However, what I am given is a pointer to an initial int and a pointer to a destination int.  I realize that blocking the matrix like that in memory would be optimal, but I'm SO confused at how can you just "store these consecutively in memory"?
To elaborate, since we're just given a pointer to an integer, how do I take the necessary top right and bottom left blocks of:
1  2   3  4
5  6   7  8
9  A  B  C
D  E  F  0
and just "Store them in memory" without becoming less optimal. How can I do that without creating an entirely new array or pointer to an integer, because that will most likely de-optimize this.

I understand that getting those two blocks is what I need to do.  I then need to multiply 3*9, 4*D, 7*A, and 8*E obviously, so having one block as 3 4 7 8 and another as 9 A D E would allow for simple element by element multiplication assuming theyre sufficiently small enough to fit in the cache, but I just don't know how to like "put them in the cache" without being slow about it.

Like, I will have to, at some point, get the offset of A as  r*n + c... no matter what.  And all I see is that "getting" these offsets requires the valuable clock cycles that is slowing this down.

I'm just so freaking confused.

If this is unclear, let me know I will try and be more precise.

Thank you infinity!
0

Commented:
>> Why does that run slower than simply going through each r=0 to n and c=0 to n

Because of this :

>> store the same result in two places in the "result" matrix.

Storing the same result in two locations in the matrix basically thrashes the cache each time.

>> Well, I'm coding in C, so arrays are stored in row major order.

That's correct. But in C, you have complete control over how you store things in memory. If you use a 2D array, it will be stored in row major order, but if you take control of a memory block, you can store everything in there in the order you want. I'll elaborate on this a bit further.

>> And all I see is that "getting" these offsets requires the valuable clock cycles that is slowing this down.

The calculation of the offset obviously takes a bit of time, but not nearly as much as moving data from memory to the cache.
So, you calculate the address, then try to get the value at that address from the cache. If that value is not in the cache, it needs to be loaded from memory, adding an extra delay before the code can continue.
The idea is to have the value already available in the cache at the moment you need to access it.

So, how to get this to work ?

The idea would be to have a kxk 2D matrix consisting of mxm 2D matrices (conceptually).
The mxm matrices would be the blocks, and the kxk matrix would show the position of each block in the big matrix.

So, for example, if we work with the 4x4 matrix mentioned earlier, consisting of 4 2x2 blocks, then the definition of the matrix could be :

int matrix[K][K][M][M];

and you'd access an item at position (r,c) in the original matrix, using matrix[ra][ca][rb][cb] where :

ra = r / 2;
ca = c / 2;
rb = r % 2;
cb = c % 2;

This changed the memory layout of the matrix to have the 2x2 blocks consecutively, and provides a convenient way of accessing the data.

In order to get the cache blocking done correctly, you now need to iterate over the matrix in a different way. ie. :

for (ra = 0; ra < 2; ++ra) {
for (ca = 0; ca < 2; ++ca) {
for (rb = 0; rb < 2; ++rb) {
for (cb = 0; cb < 2; ++cb) {
/* process matrix[ra][ca][rb][cb] */
}
}
}
}

Just to give you a rough idea :)
0

Commented:
Just one extra note to avoid misunderstandings : the 4 nested loops from my previous post were just to illustrate the concept - they are not intended to actually use them like that.

Instead, iterating over the matrix can be done in a sequential fashion by using an int*, and incrementing by 1 repeatedly.
0

Commented:
I recommend a PAQ of the following posts :

http:#34182609 : general overview of possible performance improvements
http:#34186671 : some more specific (implementation) details about the interesting ones
http:#34188097 : some more elaboration on cache blocking

All of them together provide (a lot of) information to answer the initial question, and since the author hasn't come back, I can only assume that it was sufficient.
0

Commented:
This question has been classified as abandoned and is being closed as part of the Cleanup Program.  See my comment at the end of the question for more details.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.