C Code Optimization
does anyone see a way to optimize this code? i'm supposed to be able to get up to a 5 times speedup, but i'm not seeing anything.
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
{
for (j = 0; j < dim; j++)
{dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];}
}
}```
that RIDX definition is supposed to be so you can simulate a multidimensional array since C doesnt have those.
any help would be appreciated.
EDIT: nvm. I googled that code, it's an assignment, correct? I stumbled across this hint: "The above code scans the rows of the source image matrix, copying to the columns of the destination image matrix. Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling and blocking."
So what you want to do is move calculations OUTSIDE the for-loop, make the for loop less times, and only return the function when it has completed it's task.
@spyware - yeah, i was thinking that, but i can't seem to find a way to do that. any suggestions?
ynori7 wrote: @spyware - yeah, i was thinking that, but i can't seem to find a way to do that. any suggestions?
Yes, of course. Any calculations made inside of any for-loop that don't require the i or j variable can be moved outside. Example (pseudo-code):
In this example, assume the variables a and b get their information from another function, it's not static data.
b=15
for (x = 0; x < 10; x++)
{
c = a+b+x
}```
While you should be doing:
```markupa=10
b=15
d=a+b
for (x = 0; x < 10; x++)
{
c = d+x
}```
This way you will reduce the calculations made by the computer.
EDIT: I think, in your case, you can (also) merge the loops. i=j or something.
int temp, dim2=dim-1;
for (i = 0; i < dim; i++)
{
temp=i*dim;
for (j = 0; j < dim; j++)
{dst[RIDX(dim2-j, i, dim)] = src[temp+j];}
}
}
these are the only changes along those lines that i can see to make, but they dont speed it up any.
the best way i can see to speed it up would be to eliminate that inside loop, but i dont know how.
I got - roughly - a 1.5* speedup on the smallest input… (I imagine the factor would be greater for larger input).. I used characters as opposed to pixels for simplicity.. the same optimizations should provide the same relative speedup. I'm not done.. I just thought I would post it so some of you may review what I've done so far - and may have some ideas. I'll think about this problem some more.. I just have other stuff to do for the rest of the day >.<
#define RIDX(i,j,n) ((i)*(n)+(j))
#include <iostream>
#include <Windows.h>
typedef unsigned int u_int;
using namespace std;
void naive_rotate(register int dim, char *src, char *dst)
{
int i, j;
register u_int ov1, ov2, ov3;
ov1 = dim*dim - dim;
for (i = 0; i < dim; i++)
{
ov2 = ov1 +i;
ov3 = i*dim;
for (j = 0; j < dim; j+=32){
//dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
//dst[(dim-1-j) * dim + i] = src[(i)*(dim) + j];
dst[ov2 - (j*dim)] = src[ov3+j];
dst[ov2 - ((j+1)*dim)] = src[ov3+j+1];
dst[ov2 - ((j+2)*dim)] = src[ov3+j+2];
dst[ov2 - ((j+3)*dim)] = src[ov3+j+3];
dst[ov2 - ((j+4)*dim)] = src[ov3+j+4];
dst[ov2 - ((j+5)*dim)] = src[ov3+j+5];
dst[ov2 - ((j+6)*dim)] = src[ov3+j+6];
dst[ov2 - ((j+7)*dim)] = src[ov3+j+7];
dst[ov2 - ((j+8)*dim)] = src[ov3+j+8];
dst[ov2 - ((j+9)*dim)] = src[ov3+j+9];
dst[ov2 - ((j+10)*dim)] = src[ov3+j+10];
dst[ov2 - ((j+11)*dim)] = src[ov3+j+11];
dst[ov2 - ((j+12)*dim)] = src[ov3+j+12];
dst[ov2 - ((j+13)*dim)] = src[ov3+j+13];
dst[ov2 - ((j+14)*dim)] = src[ov3+j+14];
dst[ov2 - ((j+15)*dim)] = src[ov3+j+15];
dst[ov2 - ((j+16)*dim)] = src[ov3+j+16];
dst[ov2 - ((j+17)*dim)] = src[ov3+j+17];
dst[ov2 - ((j+18)*dim)] = src[ov3+j+18];
dst[ov2 - ((j+19)*dim)] = src[ov3+j+19];
dst[ov2 - ((j+20)*dim)] = src[ov3+j+20];
dst[ov2 - ((j+21)*dim)] = src[ov3+j+21];
dst[ov2 - ((j+22)*dim)] = src[ov3+j+22];
dst[ov2 - ((j+23)*dim)] = src[ov3+j+23];
dst[ov2 - ((j+24)*dim)] = src[ov3+j+24];
dst[ov2 - ((j+25)*dim)] = src[ov3+j+25];
dst[ov2 - ((j+26)*dim)] = src[ov3+j+26];
dst[ov2 - ((j+27)*dim)] = src[ov3+j+27];
dst[ov2 - ((j+28)*dim)] = src[ov3+j+28];
dst[ov2 - ((j+29)*dim)] = src[ov3+j+29];
dst[ov2 - ((j+30)*dim)] = src[ov3+j+30];
dst[ov2 - ((j+31)*dim)] = src[ov3+j+31];
}
}
}
int main(int argc, char* argv[])
{
char* srcbuf = (char*)malloc(1024);
char* dstbuf = (char*)malloc(1024);
LARGE_INTEGER start, end;
QueryPerformanceCounter(&start);
naive_rotate(32, srcbuf, dstbuf);
QueryPerformanceCounter(&end);
cout << (end.LowPart-start.LowPart) << "\n";
system("PAUSE");
return 0;
}
ynori7 wrote: yeah, i already tried loop unrolling, and it didnt gain me anything. i turned on some compiler optimization flags, so i'm already at about 1.6 times speedup. i guess the compiler is doing the loop unrolling itself. thanks anyway though
Maybe you're not using a high resolution timer. The compiler cannot do loop unrolling. With loop unrolling you are making assumptions about loops.. in this case you're making the assumption that 'dim' will be a multiple of 32.. how is the compiler supposed to know that? I did compiler optimization flags and got a 7* speedup. What optimizations did you do besides limiting execution within loops?
a-hack wrote: What are you guys using to time the code execution?
I'm using a performance counter.. you can see it being used in the sample code I provided had you read it.
For a higher data size I got about 4* speedup with compiler standard optimizations. I'll look at the assembler tomorrow. It doesn't unroll the loops though.. so I'm sure there are more optimizations that can be done.
This is an interesting problem. I want to investigate it more.
Chinchilla3k wrote: Maybe you're not using a high resolution timer. The compiler cannot do loop unrolling. With loop unrolling you are making assumptions about loops.. in this case you're making the assumption that 'dim' will be a multiple of 32.. how is the compiler supposed to know that? I did compiler optimization flags and got a 7* speedup. What optimizations did you do besides limiting execution within loops? i'm using the timer that my TA who is grading the problem will be using. i think he wrote it himself
i'm using the -O3 optimization flag for gcc. other than that the only thing i have is pretty much the same thing you wrote.
also, the compiler doesnt have to assume that it's a multiple of 32.
for (j = 0; j < dim; j+=32)
{}
it could use:
for (j = 0; j < dim-31; j+=32)//same loop, but it ends earlier
{}
for(;j<dim; j++)//then it finishes up with step-by-one loop
{}
EDIT: by the way, my speedup is about 1.5*
ynori7 wrote: i'm using the timer that my TA who is grading the problem will be using. i think he wrote it himself
i'm using the -O3 optimization flag for gcc. other than that the only thing i have is pretty much the same thing you wrote.
also, the compiler doesnt have to assume that it's a multiple of 32.
for (j = 0; j < dim; j+=32)
{}
it could use:
for (j = 0; j < dim-31; j+=32)//same loop, but it ends earlier
{}
for(;j<dim; j++)//then it finishes up with step-by-one loop
{}
EDIT: by the way, my speedup is about 1.5*
I used -O3 and got 7* speedup on dim=32 and 4* speedup on dim=256. Also.. that's not loop unrolling.. infact I don't see how it is an optimization.
Consider dim = 32.. you get the first loop to end after one iteration.. j = 32.. then you goto the next loop and test for j<dim.. which will always be false… once the first for loop fails it's test condition..the second one will fail it too.. so why do the extra test? The code I have will only test it once.
The idea of loop unrolling is to minimize the number of times you test the loop's condition.
Chinchilla3k wrote: I used -O3 and got 7* speedup on dim=32 and 4* speedup on dim=256. Also.. that's not loop unrolling.. infact I don't see how it is an optimization.
Consider dim = 32.. you get the first loop to end after one iteration.. j = 32.. then you goto the next loop and test for j<dim.. which will always be false… once the first for loop fails it's test condition..the second one will fail it too.. so why do the extra test? The code I have will only test it once.
The idea of loop unrolling is to minimize the number of times you test the loop's condition. we're using 64, 128, 256, 512, and 1024 and averaging the speedup. plus the baseline function we're comparing to is hard coded into the computer.
and that is loop unrolling. you unroll it up until you get to the point at which you could have problems if the array isn't the size you thought, and then you go by ones for the rest. you have to do this if you dont already know how many elements are in the array. it could be an odd number, and then you end up accessing memory out of the bounds of the array.
the code you have works fine for this scenario, but it wont always work.
ynori7 wrote: we're using 64, 128, 256, 512, and 1024 and averaging the speedup. plus the baseline function we're comparing to is hard coded into the computer.
and that is loop unrolling. you unroll it up until you get to the point at which you could have problems if the array isn't the size you thought, and then you go by ones for the rest. you have to do this if you dont already know how many elements are in the array. it could be an odd number, and then you end up accessing memory out of the bounds of the array.
the code you have works fine for this scenario, but it wont always work.
You were re-rolling the code I gave you.. using one for-loop at increments of 32 would work on all those sizes you are using.. you were re-rolling unrolled loops. I see what you mean though.. but in this problem are under the assumption that the sizes are multiples of 32.. well.. from what you just told me you are using only multiples of 64.. so you can safely use increments of 64.
This actually brings me back to my original question.. I provided code that was already unrolled.. you gave me an example of how a compiler could "unroll" a "rolled" loop… but my code was already unrolled. How can a compiler unroll something like
markupfor(j=0;j<dim;j++) {}
When unrolling you ARE making assumptions. The compiler cannot know what assumptions you are making (when considering the function can take an arbitrary value as input).
I'm curious.. which university/post-secondary institution do you study at? And what year are you in?
ynori7 wrote: [quote]Chinchilla3k wrote: I'm curious.. which university/post-secondary institution do you study at? And what year are you in? what difference does that make? [/quote]
I was just curious. I study at the University of Waterloo. I'm a first year undergrad in mathematics. There is not much need to be defensive.
Why are you leaving my questions relevant to the topic you started unanswered? You started a thread asking for help with regards to this but you don't seem interested in continuing a discussion about the problem.
I'm nowhere near to being an expert in optimization. All I've read are a few books and papers about the topic. That's why this problem interested me. I'll continue reading about this topic. It's always good to have someone to interact with - especially when they are studying the same topic.
Is this an active assignment? Does your school have rules about asking for help on assignments?
Chinchilla3k wrote: I was just curious. I study at the University of Waterloo. I'm a first year undergrad in mathematics. There is not much need to be defensive.
Why are you leaving my questions relevant to the topic you started unanswered? You started a thread asking for help with regards to this but you don't seem interested in continuing a discussion about the problem.
I'm nowhere near to being an expert in optimization. All I've read are a few books and papers about the topic. That's why this problem interested me. I'll continue reading about this topic. It's always good to have someone to interact with - especially when they are studying the same topic.
Is this an active assignment? Does your school have rules about asking for help on assignments?
what questions have i left unanswered?
yeah, it's an active assignment due next tuesday. They dont mind if people get help as long as they dont just copy someone's solution.
in case you're still interested, i think the best way to optimize it has something to do with cache. the cache line size in the machines i'm using is 64 bits or something like that, and when you access data from the array, cache assumes you're going to want data from the areas around it, so it pulls the entire line. but we're not going linearly, so it's just wasting its efforts.
i'm not exactly sure how to solve this though.