I am currently trying to learn some good asm optimizing techniques for a graphics program I am running, and was wondering if there was a fast way to do what I'm doing below in ASM. (I'm sure there is)

for(x=0; x < 320; x++)
for(y=0; y < 200; y++)
{
total_distance=0;
for(ctr=0; ctr < total; ctr++)
{
total_distance += (100/distance_formula(x, y, struct[ctr].x,
struct[ctr].y));
}
if(total_distance > 10)
put_pixel(x,y,total_distance);
}

I don't need help with the algorithm, the algorithm is sound for quick 2D meta-balls, I just want to know what is the best way to do this is. Basically, I figure there must be some ASM opcodes to do this faster I don't know about. The putpixel routine outputs to a virtual screen, so there's room for optimization there, so just help me out here.

Below is just the algorithm above in english:

Loop through the entire virtual screen 64000 bytes (320x200).

For every point on the screen :
Loop from 0 to a set amount of vertex's
In the loop, divide 100 by the distance from the current x,y you are at, to the vertex in the nested loop you're running. Then plot the pixel to the virtual screen.

The X and Y variables are integers (even in the structures)

The total_distance variable is a float (so if I don't see you using EAX, I'll be upset!:)

The compilers (especially C/C++) today produce code which is at least to 90-95% as fast as
pure assembler language. Graphics code, for example, will only be faster as DirectX when
you are supporting either software rendering/drawing (you'll have to write different code for
most of the common display adapters) or (for 3d rendering) you'll need to implement calls to
the 3d rendering engine of the display adapter (which, of course, is also different for different
adapters). As long as you aren't coding for DOS, use DirectX. This is the most portable
method to support graphics, and it isn't hard to write - and: it's the fastest method (as long
as you don't want to write code for all common 3d/2d graphics adapters).
If you like, i can post you the assembler code for your routine (for the 320x200x256 MCGA
/VGA mode) - but you'll have to specify what distance_formula computes. A 32 bit compiler
will replace the (slow) assembler CALL to distance_formula with an inline function - which is
generally faster as the CALL function. If you don't specify what distance_formula does, you
can almost be sure that the assembler implementation CALLing this function will be slower
as the output your compiler generates.

I would consider looking into ways to optimize distance_formula when called many times with the same points or similar points.

0

The enterprise network environment is evolving rapidly as companies extend their physical data centers to embrace cloud computing and software-defined networking. This new reality means that the challenge of managing the security policy is much more dynamic and complex.

snoegler:
Yeah, I'm aware that the compilers produce very fast ASM code, but that is out of the code that I'm giving it in C. I know that the function I gave you is getting compiled as efficiently as possible, but I want to know of a faster way to do it.

For example, if I wanted to set the entire virtual screen to 0 by doing for(i=0; i < 64000; i++) vscr[i] = 0;, the compiler would optimize that as much as possible, but won't be nearly as fast as _fmemset(vscr, 0, 64000);

Assembler is NOT going to help you speed up the algorithm. Most of the time is spent doing the math. The pixel plotting is trivial. The source code I gave you above will speed it up.

Declare sqr_table as an array of integers from 0-319, and sqrt_table as an array of a floating point type. Note that this table will contain over 141000 floats.

The first two statements will set up the necessary tables. This only has to be done ONCE, no matter how many times repeat the rest of the program. There will be some overhead involved in this, but once they are created you can use the tables over and over.

If memory is a concern, or the overhead involved in making the square root table, you can just use the sqr_table alone, and still notice an increase in speed.

If you want to exclude the square root table, change this line:

I don't know is distance_formula in your code was a function or a macro, but as snoegler said, it is better to do without a function call. This is a cycle eater.

There are a few problems with your answer, but some of it is my fault.

I am using Turbo C++ 3.0 for DOS (i know, i know, shh!), anyway, I already have a 64000 byte array for my virtual screen.

If you can tell me a way to :

1) Allocate enough memory to hold an array of 64000 floats, which is equivalent to making 4 more virtual screens.

2) Allocate enough memory to hold an array of 320 longs
(to hold the square table, because no 16bit integer is going to hold 320^2

Your suggestion would increase speed, but HOG memory.

If I could just hog all the memory I wanted, the fastest way to do what you're telling me is to record the distance from every point to every point, in an array that would be equivalent to 64,000 virtual screens. (64,000x64,000 bytes).

My goal is to incorperate the "cycle eating" algorithm, and pixel plotting into one hunk of ASM code.

If i was you, i would stop using floating point and move to 32 bit integer maths (fixed-point).
Of course, you're using BC 3.0, but when you are writing an (external) assembler module
for your purpose, you can use the 32 bit operators (MOV EAX,...). They're legal in real mode,
as well as in VM or protected mode.
Integer maths is of course faster as floating point, and as (almost) all MMX capable processors
have 2 integer pipelines, this will double the speed again.
There is a very fast SQRT integer algorithm, if you like, i'll post it. (I would do it right now, but
i need to search my harddisk for it).

for those features you don't need an external asm file. Just set the compiler to generate 386 code (altho still real mode!) and do an 'asm .386' at the start of your 386 code.. this allows inline 386 asm.

Is your goal to do it faster, or just to use asm for the sake of using asm?

Doing it with 64000 virtual screens (i.e. figuring distance from every point on screen to every other point on screen) would be fast, but the overhead to calculate this would probably take 20 minutes on a 300Mhz computer. The square root table I proposed would take one array of 142240 floats (320^2 + 200^2) floats, or 8 bytes * 142240 = approximately 1.1 MB of memory. You could skip the square root table and end up using only about 1 K for the square tables, and still notice quite a gain in speed.

Let xd be a variable containing abs(x-struct[ctr].x)
Let yd be a variable containing abs(y-struct[ctr].y)
numerator is an integer variable containing 100

I am not familiar with floating point opcodes at all. I tried the above, and got several overflow errors etc. partaining to the way you're indexing the square table.

Also "fld total_dist" and "fst total_dist" get similar errors. What do these opcodes do anyhow? :)

You've asked an impossible question, "speed up my code but don't change the algorithm." The overhead of doing a for loop in C/C++ vs. ASM is probably close to 0% with any reasonable compiler. Same with the put_pixel(), it's not in the inner loop, who cares how long it takes!!!

You need to analyze what is going in inside the _INNER_ the loop. Just a quick glance shows me that you are performing a function call AND FLOATING POINT DIVISION INSIDE THE LOOP!!!! Stop this, if it must run fast get that division out of there and do ALL inner loop math in INTEGER! If you need to total stuff up, do it all and then do the division once for the entire calculation.

It's a common misconception that ASM is always faster than C/C++/Pascal/etc... This is just not so. Algorithm design usually has much more to do with speed than the language itself.

If you're going to tell me that my question is impossible, then don't try to "answer" the question for points telling me it is impossible. And I know what the common misconceptions are with C vs. ASM, and I didn't say "speed up my code by don't change the algorithm", I said I wasn't looking for help with the actual algorithm, although as scrapdog stated, it looks like there is room for optimzation there as well with some sort of integer square root algorithm I've never heard of.

I simply wanted to convert the whole entire loop, put pixel routine and the algorithm to one ASM function, but for some odd reason, you guys feel like lecturing me on why I shouldn't ask this question, the question is asked, and the question is worth a lot.

If it makes you feel better if I say that I'm trying to do this program in PURE ASM with a PURE ASM compiler, will that persuade you to answer the question? Scrapdog is the only one who attempted, but for some reason his code isn't working for me. But if this question keeps getting debated instead of answered, Scrapdog gets the points.

Also I have come across a very fast inverse square root algorithm on the net. If you are interested, I will download that and make the necessary changes to it (if I can).

No, I scaled the square root table to 1/4 of the whole screen. I do believe I should use 2. However, maybe I'm wrong. I didn't spend too much time figuring this out so I might have overlooked something.

Does it start to resemble if you take out the [hp>>2]<<2, and use, say:
sqrt_table = (int)(1600/sqrt(i));
.
if (total_dist > 160) putpixel(x,y,total_dist>>4);

All this does is decrease the resolution of my algorithm and slow it down with more shifts.

Messiah: are you planning to call your algorithm repeatedly to simulate moving meta-balls? If so, there is definitely room for optimization if you are going to repeat it

I made an ATTEMPT to convert your algorithm to assembly language. I tried to implement an integer square root. I wouldn't cut and paste this "as is" if I were you...I would look through it and look at the logic. I am not all that used to using Intel assembler directives, so I might have forgot something. You might need to add a few "dword ptrs" here and there...

Also, I didn't test it, as this would require me to build an entire program that manipulates the screen.

This algorithm assumes that you have already set up your sqr_tables, arrays of 320 dwords. If you do not wish to set up sqr_tables, remove the references to it and replace it with the appropriate mul instruction (I advise against this however).

Also I didn't set up the variables for you. You will have to do this yourself, depending on the assembler you are using.

Here are the variables:

total dword
total_dist dword
centerx array of dwords
centery array of dwords
xctr dword
yctr dword
strctr dword
vscreen array of bytes
sqr_table array of dwords

Here is the algorithm:

pushad
mov esi, 319
xloop: mov xctr, esi
mov edi, 200
yloop: mov yctr, edi
mov total_dist, 0
mov edi, total

strloop: mov strctr, edi
mov eax, xctr
sub eax,centerx[edi*4]
jns noabs1
mov eax,centerx[edi*4]
sub eax, xctr
noabs1: mov xd, eax
mov eax, yctr
sub eax, centery[edi*4]
jns noabs2
mov eax, centery[edi*4]
sub eax, yctr
noabs2: mov esi,eax
mov eax, sqr_table[esi*4]
mov esi,xd
add eax, sqr_table[esi*4]
shl eax,16
and eax, $7fffffff
je nosqrt

mov ebp, eax
loop1: mov cx,0
mov ebx,eax
div ebp
add eax, ebx
shr eax,1
cmp eax, ebx
je done
inc cx
cmp cx,20
jl loop1
done: shr eax,16
mov ebx,100
xchg eax,ebx
div ebx
add total_dist, eax
nosqrt: mov edi, strctr
dec edi
jge strloop

messiah: If you can optimize asm so well why don't you just generate an assembly output of your C program with -S (BCC), optimize the part you want, then include them inline with the rest of C code?

Yes, I am going to be repeating this loop over and over to simulate moving meta balls.

The centerx, centery arrays, I presume those are the individual meta ball points? If so, one thing I don't know how to do is use structures in ASM. My "centerx, centery" variables are:

typedef struct
{
int x, y;
} coord;

coord centerx[meta_balls];

That answer is almost complete, you'll be getting credit scrapdog.

CenterX and CenterY are arrays of the x and y values of each of the balls. The are arrays of double words (32 bit integers).

Example:

centerx[10] would be equivalent to struct[10].x

If you were to use one array of structures, rather than two arrays (one x and one y), there would be only minor changes to make.

An array of structures would be stored like this:

x1 y1 x2 y2 x3 y3 x4 y4 ... xn yn

where these are all contiguous integers.

Two arrays would be stored like this:

x1 x2 x3 x4 x5 x6 ... xn
y1 y2 y3 y4 y5 y6 ... yn

If you really wanted to use structures in your asm code (I see no advantage one way or the other), you coud remove the centerx and centery variables and replace it with a one new variable.

Let's call it "ball". Ball is at the memory location where your array of structures starts (it will be a dword variable). In your program, you called this "struct", but I think "ball" is more descriptive. The memory allocated to ball will be the number of total balls * 8 bytes (4 bytes for each coordinate).

Here is an excerpt of the old code:
----------------------------------

sub eax,centerx[edi*4]
jns noabs1
mov eax,centerx[edi*4]
sub eax, xctr
noabs1: mov xd, eax
mov eax, yctr
sub eax, centery[edi*4]
jns noabs2
mov eax, centery[edi*4]
sub eax, yctr

Change it to:
-------------

sub eax,ball[edi*8]
jns noabs1
mov eax,ball[edi*8]
sub eax, xctr
noabs1: mov xd, eax
mov eax, yctr
sub eax, ball+4[edi*8]
jns noabs2
mov eax, ball+4[edi*8]
sub eax, yctr

Note that I changed all "centerx[edi*4]"s to "ball[edi*8]"
and all "centery[edi*4]"s to "ball+4[edi*8]".

Observe the example that showed how the array is stored (structures vs. two arrays) and look carefully to see why this is the necessary change in code.

I can't promise that this will be faster than your C code. Remember, compilers nowadays are very efficient!! I provided this code for educational purposes...this code may not even be close to optimum, since I am relatively new to 80x86 assembly programming. However, now that you have the assembly language equivalent of the algorithm, you can look it over to see how it works and find any places where there is room for optimization.

If this code DOES work though, there will be quite a gain in performance due to using the integer square root. Implementing an integer square root function in C or pascal is much more difficult, and this is one area where assembly language can be more convenient.

The meat of the integer square root is in this part of the code I gave you:

If you are going to use the centerx and centery arrays in the assembly program, you also have to use two separate arrays in your c program!! And if you are going to use structs in c, you have to use the implementation of "ball" in the assembly program. Just in case you were confused...

total_distance += (100/distance_formula(x, y, struct[ctr].x, struct[ctr].y));
Apparently total_distance is not a distance. It is the sum of the inverse of a distance.
I'm I right ?

What means this total_distance argument to the put_pixel function ?
put_pixel(x,y,total_distance);
Does it specify the color of the pixel ? What is it's biggest possible value ?

What are these structs indexed by ctr ?
What is the struct table size ?

Are all x,y coordinates integer values ?

As optimizations, remove the indexed access into struct table,
replace the "100/" by "1/" and multiply the result only by 100.
I would try if possible to build a table holding all possible 1/distance_formula values.
using dx,dy as the 2 dimentional table entries.

According to the number of struct entries and the possible distance values you may
convert these distance values into integers without loosing precision. Integer addition
and operations is always faster.

Apparently for a total_distance bigger smaller than 10 you don't change the pixel value.
It might be possible to define an enclosing box or domain a priori where pixel will not be set. So you can avoid examining these pixels.

For instance each ctr points has a box. All points beyond the limit would yield a total_distance smaller than 10. You would then compute the union of all these boxes and define a bigger box. You then know that all pixel out of this bigger box will have a total_distance bigger than 10.

If you use a table of all dx,dy values you can use pointers for each crt and make them
move into the distance table by a single step. Summing up the values pointed by each
pointer would give you the total_distance.

Your question is ambigus also on what you really want. DO you want to optimize your program to make it run faster or do you simply want to rewrite it in assembly ?
You COULD get much more speed increase by optimizing your algorithm than rewriting your code in assembly.

For now I don't have enough information to go further in answering your question.
For instance, are the crt points also moving points implying refreshing your screen after
each drawing ?

total_distance += (100/distance_formula(x, y, struct[ctr].x, struct[ctr].y));
Apparently total_distance is not a distance. It is the sum of the inverse of a distance.
I'm I right ?

What means this total_distance argument to the put_pixel function ?
put_pixel(x,y,total_distance);
Does it specify the color of the pixel ? What is it's biggest possible value ?

What are these structs indexed by ctr ?
What is the struct table size ?

Are all x,y coordinates integer values ?

As optimizations, remove the indexed access into struct table,
replace the "100/" by "1/" and multiply the result only by 100.
I would try if possible to build a table holding all possible 1/distance_formula values.
using dx,dy as the 2 dimentional table entries.

According to the number of struct entries and the possible distance values you may
convert these distance values into integers without loosing precision. Integer addition
and operations is always faster.

Apparently for a total_distance bigger smaller than 10 you don't change the pixel value.
It might be possible to define an enclosing box or domain a priori where pixel will not be set. So you can avoid examining these pixels.

For instance each ctr points has a box. All points beyond the limit would yield a total_distance smaller than 10. You would then compute the union of all these boxes and define a bigger box. You then know that all pixel out of this bigger box will have a total_distance bigger than 10.

If you use a table of all dx,dy values you can use pointers for each crt and make them
move into the distance table by a single step. Summing up the values pointed by each
pointer would give you the total_distance.

Your question is ambigus also on what you really want. DO you want to optimize your program to make it run faster or do you simply want to rewrite it in assembly ?
You COULD get much more speed increase by optimizing your algorithm than rewriting your code in assembly.

For now I don't have enough information to go further in answering your question.
For instance, are the crt points also moving points implying refreshing your screen after
each drawing ?

Scrapdog you didn't catch my suggested optimization.

What you store in your table are 100/sqrt(dx*dx+dy*dy) for every possible dx,dy.
By using the enclosing box merging you could define a smaller box to examine.

To avoid indexed access into a two dimentional array which requires address computation use pointers. One pointer for each vertex.

Thus when you examin pixel x,y, you will have a specific dx,dy for every vertex.
To compute the total_distance you simply add all referenced values by these pointers.
Updating the pointers is fast because when xis incremented, dx will be incremented or decremented. It thus simply mean increment or decrement pointer address.
Scan line processing will be fast.
Moving to next scan lines will imply recomputing pointer locations.

So in summary there are three optimizations:
1. identify enclosing box,
2. use table holding 100/sqrt(dx*dx+dy*dy) so that computing total_distance is reduced
to compute a sum.
3. use one pointer per vertex to get it's distance value and that you can update by a simple increment or decrement.

Of course in assembly it will be even faster.
But a C porgram using these optimizations will be faster than an ASM program using your standard algorithm. ;-)

> Scrapdog you didn't catch my suggested optimization.
>
> What you store in your table are 100/sqrt(dx*dx+dy*dy) for every possible dx,dy.
> By using the enclosing box merging you could define a smaller box to examine.

We have already established that using a table of this size is not feasible. Note that 320^2 + 200^2 = 142400 values in the table. Multiply that by 4 bytes for double words, and you have a 569k table.

>To avoid indexed access into a two dimentional array which requires address computation use
>pointers. One pointer for each vertex.

I don't recall using a two dimensional array of any kind in my suggested optimization, other than the one for the virtual screen. It is necessary to use a two dimensional array for the virtual screen, as it is in cartesian coordinates. If you were to use a one dimensional array, say from 0 to 63999, you would have to convert this to an x and y (to determine distance) in the INNER loop, which would slow it down.

Also, storing a pointer for each center (I assume this is what you mean by "vertex") would still require you to retrieve the pointer from memory...there is an arbitrary number of meta-balls, we can't get by using the registers to store the pointers.

>Thus when you examin pixel x,y, you will have a specific dx,dy for every vertex.

Yes. If you look at my algorithm, dx = abs(x-centerx), dy = abs(y-centery).

>To compute the total_distance you simply add all referenced values by these pointers.
>Updating the pointers is fast because when xis incremented, dx will be incremented or
>decremented. It thus simply mean increment or decrement pointer address.

If you increment dx by an integer, you will lose precision. Floats or fixed point integers must be used in the inner loop to do it this way. Floats/fixed points cost more than integers, especially in an inner loop.

>Scan line processing will be fast.
>Moving to next scan lines will imply recomputing pointer locations.

Recomputing pointer locations is more overhead. I do not see what is wrong with dx = abs(x-centerx). It is one subtraction operation, and an absolute value operation. Using dx = dx + ddx would require the floating point addition PLUS the overhead of recomputing for each scan line.

>So in summary there are three optimizations:
>1. identify enclosing box,

I don't know exactly what you mean by this, but I attempted to scale down the sqrt table by a factor of 1/4 using shifts. The results came out a little different than expected. :)

>2. use table holding 100/sqrt(dx*dx+dy*dy) so that computing total_distance is reduced
> to compute a sum.

We used a table that stores the squares of 0 to 319. This is a relatively small table.

>3. use one pointer per vertex to get it's distance value and that you can update by a
>simple increment or decrement.

A pointer to what? An element found in the square root table? This would involve an address calculation, since obviously dx can change by values less than 1 for every change in x. You would have to store it as a "floating point" pointer if you do it this way, and I don't think Intel processors use fuzzy logic :)

>Of course in assembly it will be even faster.

>But a C porgram using these optimizations will be faster than an ASM program using your standard
>algorithm. ;-)

Again, please read the question history! :)

A lot of your suggestions have already been implemented (or rejected).

Scratch that comment about needing a floating point to increment dx. What I meant is that you need a floating point if you were to use an increment or decrement for the distance formula. Incrementing/decrementing with integers WOULD be OK, but still there is the overhead of maintaining/recalculating/storing the individual pointers. However, if you think this would be an optimization, feel free to make the change to my assembly code to demonstrate it.

The table size is 320x200 which is 64000 cells.
dx and dy values can't be bigger than 320 for x and 200 for y dimensions.

The matrix is symetric so you only need to store half of the values.
If you use 4 bytes per value this ends up with 32000 *4 ->128K
I agree that this may still be too big.
This is a size/speed compromise and it's up to Messiah to decide.

The speed increase would be very important in comparision to the initially given algorithm.

Again I'm not discussing your square table. Pointers I'm talking about are in the table I suggest which is a two dimensional array.

You didn't understood the table I'm suggesting.
When computing a distance you have dx and dy as input.
These values are integer according to the definition given by messiah.

The screen is scanned by lines. So x is incremented by one unit while y remains constant .Thus for a given row, dy is contant for all vertexes. Only dx will change. dx will change by one for each examined pixel. If the pointer in the table points to cell [10,50] for instance, this cell will hold the value resulting from the following computation: 100/(10*10+50*50).

For this given vertex, when examining the next screen pixel in a scan line you will have to find the value [11,50] for instance or [9,50]. See ? So if you store values by rows in your table you only need to increment or decrement by one the pointer value.
I agree that when you need to skip to the next scan line you will need to recompute the pointer location according to the dx,dy value. But this can also be otpimised and reduced to a simple addition.

Apparently you still don't understand my optimization.

Just a comment:
Why are you wasting time proving who's right? Why not finding some good algorithm ...
Usually (with some knowledge of mathematic assembler operations in the background -
which apparently everybode here has) it is enough to write down the algorithm as formula,
try to split it up into simple mathematic assembler operations, like *,/, ... , and then implement
it. There's nothing real hard there ... And 2 or 3 clock cycles optimizations simply *won't* ever
be realized by anyone, especially when we're talking of 320x200 - (64000 pixels, this makes
about 73 clock cycles per pixel for 50 frames/sec).

0

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!