Solved

ASM Translation

Posted on 1998-10-16
52
687 Views
Last Modified: 2008-02-01
I am currently trying to learn some good asm optimizing techniques for a graphics program I am running, and was wondering if there was a fast way to do what I'm doing below in ASM. (I'm sure there is)

for(x=0; x < 320; x++)
for(y=0; y < 200; y++)
{
total_distance=0;
for(ctr=0; ctr < total; ctr++)
{
total_distance += (100/distance_formula(x, y, struct[ctr].x,
struct[ctr].y));
}
if(total_distance > 10)
put_pixel(x,y,total_distance);
}

I don't need help with the algorithm, the algorithm is sound for quick 2D meta-balls, I just want to know what is the best way to do this is. Basically, I figure there must be some ASM opcodes to do this faster I don't know about. The putpixel routine outputs to a virtual screen, so there's room for optimization there, so just help me out here.

Below is just the algorithm above in english:

Loop through the entire virtual screen 64000 bytes (320x200).

For every point on the screen :
Loop from 0 to a set amount of vertex's
In the loop, divide 100 by the distance from the current x,y you are at, to the vertex in the nested loop you're running. Then plot the pixel to the virtual screen.

The X and Y variables are integers (even in the structures)

The total_distance variable is a float (so if I don't see you using EAX, I'll be upset!:)
0
Comment
Question by:messiah
  • 25
  • 9
  • 5
  • +7
52 Comments
 
LVL 1

Author Comment

by:messiah
ID: 1253610
Edited text of question
0
 
LVL 6

Expert Comment

by:snoegler
ID: 1253611
The compilers (especially C/C++) today produce code which is at least to 90-95% as fast as
pure assembler language. Graphics code, for example, will only be faster as DirectX when
you are supporting either software rendering/drawing (you'll have to write different code for
most of the common display adapters) or (for 3d rendering) you'll need to implement calls to
the 3d rendering engine of the display adapter (which, of course, is also different for different
adapters). As long as you aren't coding for DOS, use DirectX. This is the most portable
method to support graphics, and it isn't hard to write - and: it's the fastest method (as long
as you don't want to write code for all common 3d/2d graphics adapters).
If you like, i can post you the assembler code for your routine (for the 320x200x256 MCGA
/VGA mode) - but you'll have to specify what distance_formula computes. A 32 bit compiler
will replace the (slow) assembler CALL to distance_formula with an inline function - which is
generally faster as the CALL function. If you don't specify what distance_formula does, you
can almost be sure that the assembler implementation CALLing this function will be slower
as the output your compiler generates.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1253612
I would consider looking into ways to optimize distance_formula when called  many times with the same points or similar points.
0
 
LVL 1

Author Comment

by:messiah
ID: 1253613
snoegler:
Yeah, I'm aware that the compilers produce very fast ASM code, but that is out of the code that I'm giving it in C. I know that the function I gave you is getting compiled as efficiently as possible, but I want to know of a faster way to do it.

For example, if I wanted to set the entire virtual screen to 0 by doing for(i=0; i < 64000; i++) vscr[i] = 0;, the compiler would optimize that as much as possible, but won't be nearly as fast as _fmemset(vscr, 0, 64000);  
0
 
LVL 1

Author Comment

by:messiah
ID: 1253614
ozo:
I tried looking up variations of the distance algorithm, but couldn't find any.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253615
You could make tables for the distance formula equation...

You could make a square table for all possible values of x and y, and a second table for square roots:

sqrt_table[sqr_table[abs(x2-x1)] + sqr_table[abs(y2-y1)]]

could be an example of how you reference it.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253616
...or, for your purposes, a 100/sqrt table.  This way you won't have to use that cycle-eating divide:

distance=sqrt_100_table[sqr_table[abs(x2-x1)]+sqr_table[abs(y2-y1)]];
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253617
for (i=0; i < 319; i++) sqr_table[i] = i * i;
for (i=1; i < 319*200; i++) sqrt_table[i] = 100/sqrt(i);

for(x=0; x<320; x++)
  for(y=0; y<200; y++) {
    total_distance=0;
    for(ctr=0; ctr<total; ctr++)
      total_distance += sqrt_table[sqr_table[abs(struct[ctr].x-x)+sqr_table[abs(struct[ctr].y-y)]];
    if (total_distance>10) put_pixel(x,y,total_distance);
 }
}


Assembler is NOT going to help you speed up the algorithm.  Most of the time is spent doing the math.  The pixel plotting is trivial.  The source code I gave you above will speed it up.

Declare sqr_table as an array of integers from 0-319, and sqrt_table as an array of a floating point type.  Note that this table will contain over 141000 floats.

The first two statements will set up the necessary tables.  This only has to be done ONCE, no matter how many times repeat the rest of the program.  There will be some overhead involved in this, but once they are created you can use the tables over and over.

If memory is a concern, or the overhead involved in making the square root table, you can just use the sqr_table alone, and still notice an increase in speed.

If you want to exclude the square root table, change this line:

sqrt_table[sqr_table[abs(struct[ctr].x-x)+sqr_table[abs(struct[ctr].y-y)]];

to this:

(100/sqrt(sqr_table[abs(struct[ctr].x-x)+sqr_table[abs(struct[ctr].y-y)]);

I don't know is distance_formula in your code was a function or a macro, but as snoegler said, it is better to do without a function call.  This is a cycle eater.


0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253618
Oops!

In

for (i=1; i < 319*200; i++) sqrt_table[i] = 100/sqrt(i);

the 319 should be changed to 320.  Also make sure you use the appropriate typecasts where necessary.
0
 
LVL 1

Author Comment

by:messiah
ID: 1253619
There are a few problems with your answer, but some of it is my fault.

I am using Turbo C++ 3.0 for DOS (i know, i know, shh!), anyway, I already have a 64000 byte array for my virtual screen.

If you can tell me a way to :

1) Allocate enough memory to hold an array of 64000 floats, which is equivalent to making 4 more virtual screens.

2) Allocate enough memory to hold an array of 320 longs
(to hold the square table, because no 16bit integer is going to hold 320^2

Your suggestion would increase speed, but HOG memory.

If I could just hog all the memory I wanted, the fastest way to do what you're telling me is to record the distance from every point to every point, in an array that would be equivalent to 64,000 virtual screens. (64,000x64,000 bytes).

My goal is to incorperate the "cycle eating" algorithm, and pixel plotting into one hunk of ASM code.
0
 
LVL 6

Expert Comment

by:snoegler
ID: 1253620
If i was you, i would stop using floating point and move to 32 bit integer maths (fixed-point).
Of course, you're using BC 3.0, but when you are writing an (external) assembler module
for your purpose, you can use the 32 bit operators (MOV EAX,...). They're legal in real mode,
as well as in VM or protected mode.
Integer maths is of course faster as floating point, and as (almost) all MMX capable processors
have 2 integer pipelines, this will double the speed again.
There is a very fast SQRT integer algorithm, if you like, i'll post it. (I would do it right now, but
i need to search my harddisk for it).
0
 
LVL 1

Expert Comment

by:TheMadManiac
ID: 1253621

for those features you don't need an external asm file. Just set the compiler to generate 386 code (altho still real mode!) and do an 'asm .386' at the start of your 386 code.. this allows inline 386 asm.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253622
Is your goal to do it faster, or just to use asm for the sake of using asm?


Doing it with 64000 virtual screens (i.e. figuring distance from every point on screen to every other point on screen) would be fast, but the overhead to calculate this would probably take 20 minutes on a 300Mhz computer.  The square root table I proposed would take one array of 142240 floats (320^2 + 200^2) floats, or 8 bytes * 142240 = approximately 1.1 MB of memory.  You could skip the square root table and end up using only about 1 K for the square tables, and still notice quite a gain in speed.
 
0
 
LVL 1

Expert Comment

by:mlaiosa
ID: 1253623
Show me the assembly code you have so far.  Im *very* good at asm optimization, but not so good at translating from C
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253624
For the total distance += distance line:

Let xd be a variable containing abs(x-struct[ctr].x)  
Let yd be a variable containing abs(y-struct[ctr].y)  
    numerator is an integer variable containing 100
     
                mov esi, xd                
            fld     [dword ptr esi*4+sqr_table]        
                mov esi, yd
                fld     [dword ptr esi*4+sqr_table]
                fadd
            fsqrt
            fild numerator
            fdivr
            fld total_dist
            fadd
            fst total_dist


I don't know much about using floating point instructions, so I could be wrong.  mlaiosa look this over.

scrapdog
0
 
LVL 1

Author Comment

by:messiah
ID: 1253625
I am not familiar with floating point opcodes at all. I tried the above, and got several overflow errors etc. partaining to the way you're indexing the square table.

Also "fld total_dist" and "fst total_dist" get similar errors. What do these opcodes do anyhow? :)
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253626
fld and fst load and store floating point numbers.

Make sure your square table contains 32-bit floats.

0
 
LVL 7

Expert Comment

by:faster
ID: 1253627
This part of C code is too simple, changing it to assembly will hardly have any effect.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253628
That is what I was trying to say in the first place :)
0
 
LVL 6

Expert Comment

by:snoegler
ID: 1253629
Converting to integer maths could possibly get 10-20% out of it.
0
 
LVL 32

Expert Comment

by:jhance
ID: 1253630
You've asked an impossible question, "speed up my code but don't change the algorithm."  The overhead of doing a for loop in C/C++ vs. ASM is probably close to 0% with any reasonable compiler.  Same with the put_pixel(), it's not in the inner loop, who cares how long it takes!!!

You need to analyze what is going in inside the _INNER_ the loop.  Just a quick glance shows me that you are performing a function call AND FLOATING POINT DIVISION INSIDE THE LOOP!!!!  Stop this, if it must run fast get that division out of there and do ALL inner loop math in INTEGER!  If you need to total stuff up, do it all and then do the division once for the entire calculation.  

It's a common misconception that ASM is always faster than C/C++/Pascal/etc...  This is just not so.  Algorithm design usually has much more to do with speed than the language itself.
0
 
LVL 1

Author Comment

by:messiah
ID: 1253631
If you're going to tell me that my question is impossible, then don't try to "answer" the question for points telling me it is impossible. And I know what the common misconceptions are with C vs. ASM, and I didn't say "speed up my code by don't change the algorithm", I said I wasn't looking for help with the actual algorithm, although as scrapdog stated, it looks like there is room for optimzation there as well with some sort of integer square root algorithm I've never heard of.

I simply wanted to convert the whole entire loop, put pixel routine and the algorithm to one ASM function, but for some odd reason, you guys feel like lecturing me on why I shouldn't ask this question, the question is asked, and the question is worth a lot.

If it makes you feel better if I say that I'm trying to do this program in PURE ASM with a PURE ASM compiler, will that persuade you to answer the question? Scrapdog is the only one who attempted, but for some reason his code isn't working for me. But if this question keeps getting debated instead of answered, Scrapdog gets the points.

THANK YOU.
=)
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253632
I could do a better job of writing the code if I had a better idea of its context, i.e. the surrounding source code.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253633
Also I have come across a very fast inverse square root algorithm on the net.  If you are interested, I will download that and make the necessary changes to it (if I can).
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253634
This is not really an answer to your question, but in attempting to optimize your code, I discovered (by accident) something really "neat-o"!

Try this:

int i,x,y,ds,total_dist,hp,xd,yd,ctr;
int sqr_table[320];
int sqrt_table[8901];


for (i=0; i<320; i++) sqr_table[i] = i*i;
for (i=1; i<=8900; i++) sqrt_table = (int)(100/sqrt(i));
for (x=0; x<320; x++)
  for(y=0; y<200; y++) {
    total_dist = 0;
    for (ctr = 0; ctr<total; ctr++) {
      xd=abs(struct[ctr].x-x);
      yd=abs(struct[ctr].y-y);
      hp=sqr_table[xd]+sqr_table[yd];
      total_dist += sqrt_table[hp>>2]<<2;
    }
    if (total_dist > 10) putpixel(x,y,total_dist);
  }


I implemented it in pascal, and converted it to C for you, so there might be some compiler errors.  When I ran it in Pascal, it looked pretty freaky!

scrapdog
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253635
The results look better with 5 or less balls.
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 84

Expert Comment

by:ozo
ID: 1253636
Did you mean sqrt_table[hp>>4]<<2; ?
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253637
No, I scaled the square root table to 1/4 of the whole screen.  I do believe I should use 2.  However, maybe I'm wrong.  I didn't spend too much time figuring this out so I might have overlooked something.

0
 
LVL 84

Expert Comment

by:ozo
ID: 1253638
unless your balls are all at x=160,y=100 it looks ike your sqrt_table index may overflow...
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253639
The sqrt_table will never overflow no matter where the balls are.

Maximum x = 319
Minimum struct.x = 0

Maximum difference = 320

Maximum y = 199
Minimum struct.y = 0

Maximum difference = 200

320/4 = 80 ^ 2 = 6400
200/4 = 50 ^ 2 = 2500

The maximum xdistance^2 is 6400
The maximum ydistance^2 is 2500

The index is the sum of these two, so the index will never exceed 8900.

So the index of sqrt table will never go over 8900.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1253640
so, did you mean to say
    hp=sqr_table[xd>>2]+sqr_table[yd>>2];
    total_dist += sqrt_table[hp]<<2;
0
 
LVL 1

Author Comment

by:messiah
ID: 1253641
uhm :) that algorithm doesn't even resemble a meta ball :)
0
 
LVL 84

Expert Comment

by:ozo
ID: 1253642
Does it start to resemble if you take out the [hp>>2]<<2, and use, say:
sqrt_table = (int)(1600/sqrt(i));
.
if (total_dist > 160) putpixel(x,y,total_dist>>4);
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253643
All this does is decrease the resolution of my algorithm and slow it down with more shifts.

Messiah: are you planning to call your algorithm repeatedly to simulate moving meta-balls?  If so, there is definitely room for optimization if you are going to repeat it
0
 
LVL 5

Accepted Solution

by:
scrapdog earned 400 total points
ID: 1253644
I made an ATTEMPT to convert your algorithm to assembly language.  I tried to implement an integer square root.  I wouldn't cut and paste this "as is" if I were you...I would look through it and look at the logic.  I am not all that used to using Intel assembler directives, so I might have forgot something.  You might need to add a few "dword ptrs" here and there...

Also, I didn't test it, as this would require me to build an entire program that manipulates the screen.

This algorithm assumes that you have already set up your sqr_tables, arrays of 320 dwords.  If you do not wish to set up sqr_tables, remove the references to it and replace it with the appropriate mul instruction (I advise against this however).

Also I didn't set up the variables for you.  You will have to do this yourself, depending on the assembler you are using.


Here are the variables:

total        dword
total_dist   dword
centerx      array of dwords
centery      array of dwords
xctr         dword
yctr         dword
strctr       dword
vscreen      array of bytes
sqr_table    array of dwords


Here is the algorithm:

pushad
mov esi, 319
xloop:  mov xctr, esi
mov edi, 200
yloop:  mov yctr, edi
mov total_dist, 0
mov edi, total

strloop:  mov strctr, edi
mov eax, xctr
sub eax,centerx[edi*4]
jns noabs1
mov eax,centerx[edi*4]
sub eax, xctr
noabs1:   mov xd, eax
mov eax, yctr
sub eax, centery[edi*4]
jns noabs2
mov eax, centery[edi*4]
sub eax, yctr
noabs2:   mov esi,eax
mov eax, sqr_table[esi*4]
mov esi,xd
add eax, sqr_table[esi*4]
shl eax,16
and eax, $7fffffff
je nosqrt

mov ebp, eax
loop1:    mov cx,0
mov ebx,eax
div ebp
add eax, ebx
shr eax,1
cmp eax, ebx
je done
inc cx
cmp cx,20
jl loop1
done:     shr eax,16
mov ebx,100
xchg eax,ebx
div ebx
add total_dist, eax
nosqrt:   mov edi, strctr
dec edi
jge strloop

mov eax, total_dist
cmp eax,10
jl noplot
mov ebx, yctr
shl ebx, 8
mov esi, yctr
shl esi, 6
add esi,ebx
add esi, xctr
mov byte ptr vscreen[esi], al

noplot: mov esi, yctr
dec esi
jge yloop
mov edi, xctr
dec esi
jge xloop

popad


scrapdog
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253645
oops

mov edi, 200   should be  mov edi,199
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253646
oops again

loop1:    mov cx,0
mov ebx,eax

should be

mov cx,0
loop1:  mov ebx, eax

D'oh!!
0
 
LVL 1

Expert Comment

by:newexpert
ID: 1253647
messiah: If you can optimize asm so well why don't you just generate an assembly output of your C program with -S (BCC), optimize the part you want, then include them inline with the rest of C code?
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253648
>Show me the assembly code you have so far.  Im *very* good at
>asm optimization, but not so good at translating from C

That was mlaiosa that said that, not messiah   :)

0
 
LVL 1

Author Comment

by:messiah
ID: 1253649
Yes, I am going to be repeating this loop over and over to simulate moving meta balls.

The centerx, centery arrays, I presume those are the individual meta ball points? If so, one thing I don't know how to do is use structures in ASM. My "centerx, centery" variables are:

typedef struct
{
int x, y;
} coord;

coord centerx[meta_balls];

That answer is almost complete, you'll be getting credit scrapdog.
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253650
CenterX and CenterY are arrays of the x and y values of each of the balls.  The are arrays of double words (32 bit integers).

Example:

centerx[10]  would be equivalent to  struct[10].x

If you were to use one array of structures, rather than two arrays (one x and one y), there would be only minor changes to make.

An array of structures would be stored like this:

x1 y1 x2 y2 x3 y3 x4 y4 ... xn yn

where these are all contiguous integers.

Two arrays would be stored like this:

x1 x2 x3 x4 x5 x6 ... xn
y1 y2 y3 y4 y5 y6 ... yn


If you really wanted to use structures in your asm code (I see no advantage one way or the other), you coud remove the centerx and centery variables and replace it with a one new variable.

Let's call it "ball".   Ball is at the memory location where your array of structures starts (it will be a dword variable).  In your program, you called this "struct", but I think "ball" is more descriptive.  The memory allocated to ball will be the number of total balls * 8 bytes (4 bytes for each coordinate).

Here is an excerpt of the old code:
----------------------------------

sub eax,centerx[edi*4]
jns noabs1
mov eax,centerx[edi*4]
sub eax, xctr
noabs1:   mov xd, eax
mov eax, yctr
sub eax, centery[edi*4]
jns noabs2
mov eax, centery[edi*4]
sub eax, yctr


Change it to:
-------------

sub eax,ball[edi*8]
jns noabs1
mov eax,ball[edi*8]
sub eax, xctr
noabs1:   mov xd, eax
mov eax, yctr
sub eax, ball+4[edi*8]
jns noabs2
mov eax, ball+4[edi*8]
sub eax, yctr

Note that I changed all "centerx[edi*4]"s to "ball[edi*8]"
and all "centery[edi*4]"s to "ball+4[edi*8]".

Observe the example that showed how the array is stored (structures vs. two arrays) and look carefully to see why this is the necessary change in code.

I can't promise that this will be faster than your C code.  Remember, compilers nowadays are very efficient!!  I provided this code for educational purposes...this code may not even be close to optimum, since I am relatively new to 80x86 assembly programming.  However, now that you have the assembly language equivalent of the algorithm, you can look it over to see how it works and find any places where there is room for optimization.

If this code DOES work though, there will be quite a gain in performance due to using the integer square root.  Implementing an integer square root function in C or pascal is much more difficult, and this is one area where assembly language can be more convenient.

The meat of the integer square root is in this part of the code I gave you:

mov ebp, eax
loop1:    mov cx,0
mov ebx,eax
div ebp
add eax, ebx
shr eax,1
cmp eax, ebx
je done
inc cx
cmp cx,20
jl loop1


Hope this helps a bit!!!

Scrapdog
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253651
Note:

One thing I didn't make clear!!

If you are going to use the centerx and centery arrays in the assembly program, you also have to use two separate arrays in your c program!!  And if you are going to use structs in c, you have to use the implementation of "ball" in the assembly program.  Just in case you were confused...
0
 
LVL 1

Expert Comment

by:meessen
ID: 1253652
I don't understand the algorithm.

What is a meta ball ?

What is this distance_formula ?

total_distance += (100/distance_formula(x, y, struct[ctr].x,  struct[ctr].y));
Apparently total_distance is not a distance. It is the sum of the inverse of a distance.
I'm I right ?

What means this total_distance argument to the put_pixel function ?
put_pixel(x,y,total_distance);
Does it specify the color of the pixel ? What is it's biggest possible value ?

What are these structs indexed by ctr ?
What is the struct table size ?

Are all x,y coordinates integer values ?

As optimizations, remove the indexed access into struct table,
replace the "100/" by "1/" and multiply the result only by 100.
I would try if possible to build a table holding all possible 1/distance_formula values.
using dx,dy as the 2 dimentional table entries.

According to the number of struct entries and the possible distance values you may
convert these distance values into integers without loosing precision. Integer addition
and operations is always faster.

Apparently for a total_distance bigger smaller than 10 you don't change the pixel value.
It might be possible to define an enclosing box or domain a priori where pixel will not be set. So you can avoid examining these pixels.

For instance each ctr points has a box. All points beyond the limit would yield a total_distance smaller than 10. You would then compute the union of all these boxes and define a bigger box. You then know that all pixel out of this bigger box will have a total_distance bigger than 10.

If you use a table of all dx,dy values you can use pointers for each crt and make them
move into the distance table by a single step. Summing up the values pointed by each
pointer would give you the total_distance.

Your question is ambigus also on what you really want. DO you want to optimize your program to make it run faster or do you simply want to rewrite it in assembly ?
You COULD get much more speed increase by optimizing your algorithm than rewriting your code in assembly.

For now I don't have enough information to go further in answering your question.
For instance, are the crt points also moving points implying refreshing your screen after
each drawing ?
0
 
LVL 1

Expert Comment

by:meessen
ID: 1253653
I don't understand the algorithm.

What is a meta ball ?

What is this distance_formula ?

total_distance += (100/distance_formula(x, y, struct[ctr].x,  struct[ctr].y));
Apparently total_distance is not a distance. It is the sum of the inverse of a distance.
I'm I right ?

What means this total_distance argument to the put_pixel function ?
put_pixel(x,y,total_distance);
Does it specify the color of the pixel ? What is it's biggest possible value ?

What are these structs indexed by ctr ?
What is the struct table size ?

Are all x,y coordinates integer values ?

As optimizations, remove the indexed access into struct table,
replace the "100/" by "1/" and multiply the result only by 100.
I would try if possible to build a table holding all possible 1/distance_formula values.
using dx,dy as the 2 dimentional table entries.

According to the number of struct entries and the possible distance values you may
convert these distance values into integers without loosing precision. Integer addition
and operations is always faster.

Apparently for a total_distance bigger smaller than 10 you don't change the pixel value.
It might be possible to define an enclosing box or domain a priori where pixel will not be set. So you can avoid examining these pixels.

For instance each ctr points has a box. All points beyond the limit would yield a total_distance smaller than 10. You would then compute the union of all these boxes and define a bigger box. You then know that all pixel out of this bigger box will have a total_distance bigger than 10.

If you use a table of all dx,dy values you can use pointers for each crt and make them
move into the distance table by a single step. Summing up the values pointed by each
pointer would give you the total_distance.

Your question is ambigus also on what you really want. DO you want to optimize your program to make it run faster or do you simply want to rewrite it in assembly ?
You COULD get much more speed increase by optimizing your algorithm than rewriting your code in assembly.

For now I don't have enough information to go further in answering your question.
For instance, are the crt points also moving points implying refreshing your screen after
each drawing ?
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253654
meessen:

Please read the question history.

By the way, using 100 * 1/x is the exact equivalent as 100/x.  Both involve a multiply and a divide.  This is no optimization.

scrappy
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253655
In fact, the 100/x is faster, now that I look at it.

mov ebx,100
xchg eax,ebx
div ebx

(one divide)

as opposed to

mov ebx,1
xchg eax,ebx
div ebx
mul 100

(two multiply operations)
0
 
LVL 1

Author Comment

by:messiah
ID: 1253656
gracias.. this was good information..
0
 
LVL 1

Expert Comment

by:meessen
ID: 1253657
Scrapdog you didn't catch my suggested optimization.

What you store in your table are 100/sqrt(dx*dx+dy*dy) for every possible dx,dy.
By using the enclosing box merging you could define a smaller box to examine.

To avoid indexed access into a two dimentional array which requires address computation use pointers. One pointer for each vertex.

Thus when you examin pixel x,y, you will have a specific dx,dy for every vertex.
To compute the total_distance you simply add all referenced values by these pointers.
Updating the pointers is fast because when xis incremented, dx will be incremented or decremented. It thus simply mean increment or decrement pointer address.
Scan line processing will be fast.
Moving to next scan lines will imply recomputing pointer locations.

So in summary there are three optimizations:
1. identify enclosing box,
2. use table holding 100/sqrt(dx*dx+dy*dy) so that computing total_distance is reduced
   to compute a sum.
3. use one pointer per vertex to get it's distance value and that you can update by a        simple increment or decrement.

Of course in assembly it will be even faster.
But a C porgram using these optimizations will be faster than an ASM program using your standard algorithm. ;-)




0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253658
> Scrapdog you didn't catch my suggested optimization.
>
> What you store in your table are 100/sqrt(dx*dx+dy*dy) for every possible dx,dy.
> By using the enclosing box merging you could define a smaller box to examine.

We have already established that using a table of this size is not feasible.  Note that 320^2 + 200^2 = 142400 values in the table.  Multiply that by 4 bytes for double words, and you have a 569k table.

>To avoid indexed access into a two dimentional array which requires address computation use
>pointers. One pointer for each vertex.

I don't recall using a two dimensional array of any kind in my suggested optimization, other than the one for the virtual screen.  It is necessary to use a two dimensional array for the virtual screen, as it is in cartesian coordinates.  If you were to use a one dimensional array, say from 0 to 63999, you would have to convert this to an x and y (to determine distance) in the INNER loop, which would slow it down.

Also, storing a pointer for each center (I assume this is what you mean by "vertex") would still require you to retrieve the pointer from memory...there is an arbitrary number of meta-balls, we can't get by using the registers to store the pointers.

>Thus when you examin pixel x,y, you will have a specific dx,dy for every vertex.

Yes.  If you look at my algorithm, dx = abs(x-centerx), dy = abs(y-centery).

>To compute the total_distance you simply add all referenced values by these pointers.
>Updating the pointers is fast because when xis incremented, dx will be incremented or
>decremented. It thus simply mean increment or decrement pointer address.

If you increment dx by an integer, you will lose precision.  Floats or fixed point integers must be used in the inner loop to do it this way.  Floats/fixed points cost more than integers, especially in an inner loop.

>Scan line processing will be fast.
>Moving to next scan lines will imply recomputing pointer locations.

Recomputing pointer locations is more overhead.  I do not see what is wrong with dx = abs(x-centerx). It is one subtraction operation, and an absolute value operation.  Using dx = dx + ddx would require the floating point addition PLUS the overhead of recomputing for each scan line.

>So in summary there are three optimizations:
>1. identify enclosing box,

I don't know exactly what you mean by this, but I attempted to scale down the sqrt table by a factor of 1/4 using shifts.  The results came out a little different than expected.  :)

>2. use table holding 100/sqrt(dx*dx+dy*dy) so that computing total_distance is reduced
>   to compute a sum.

We used a table that stores the squares of 0 to 319.  This is a relatively small table.

>3. use one pointer per vertex to get it's distance value and that you can update by a        
>simple increment or decrement.

A pointer to what?  An element found in the square root table?  This would involve an address calculation, since obviously dx can change by values less than 1 for every change in x.  You would have to store it as a "floating point" pointer if you do it this way, and I don't think Intel processors use fuzzy logic :)

>Of course in assembly it will be even faster.

>But a C porgram using these optimizations will be faster than an ASM program using your standard
>algorithm. ;-)

Again, please read the question history! :)

A lot of your suggestions have already been implemented (or rejected).


scrapdog

0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1253659
Scratch that comment about needing a floating point to increment dx.  What I meant is that you need a floating point if you were to use an increment or decrement for the distance formula.  Incrementing/decrementing with integers WOULD be OK, but still there is the overhead of maintaining/recalculating/storing the individual pointers.  However, if you think this would be an optimization, feel free to make the change to my assembly code to demonstrate it.


0
 
LVL 1

Expert Comment

by:meessen
ID: 1253660
As I said you didn't understand my suggestion.

The table size is 320x200 which is 64000 cells.
dx and dy values can't be bigger than 320 for x and 200 for y dimensions.

The matrix is symetric so you only need to store half of the values.
If you use 4 bytes per value this ends up with 32000 *4 ->128K
I agree that this may still be too big.
This is a size/speed compromise and it's up to Messiah to decide.

The speed increase would be very important in comparision to the initially given algorithm.

Again I'm not discussing your square table. Pointers I'm talking about are in the table I suggest which is a two dimensional array.  

You didn't understood the table I'm suggesting.
When computing a distance you have dx and dy as input.
These values are integer according to the definition given by messiah.

The screen is scanned by lines. So x is incremented by one unit while y remains constant .Thus for a given row, dy is contant for all vertexes. Only dx will change. dx will change by one for each examined pixel. If the pointer in the table points to cell [10,50] for instance, this cell will hold the value resulting from the following computation: 100/(10*10+50*50).

For this given vertex, when examining the next screen pixel in a scan line you will have to find the value [11,50] for instance or [9,50]. See ? So if you store values by rows in your table you only need to increment or decrement by one the pointer value.
I agree that when you need to skip to the next scan line you will need to recompute the pointer location according to the dx,dy value. But this can also be otpimised and reduced to a simple addition.

Apparently you still don't understand my optimization.
0
 
LVL 6

Expert Comment

by:snoegler
ID: 1253661
Just a comment:
Why are you wasting time proving who's right? Why not finding some good algorithm ...
Usually (with some knowledge of mathematic assembler operations in the background -
which apparently everybode here has) it is enough to write down the algorithm as formula,
try to split it up into simple mathematic assembler operations, like *,/, ... , and then implement
it. There's nothing real hard there ... And 2 or 3 clock cycles optimizations simply *won't* ever
be realized by anyone, especially when we're talking of 320x200 - (64000 pixels, this makes
about 73 clock cycles per pixel for 50 frames/sec).
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now