asked on

Optimizing small ASM code

Hi experts.
I'm complete newbie in assembler for windows and these two sources are my first source for assembler in windows.
If someone can optimize this code will take the points.
I'm using MASM compiler with WinASM IDE.

First code is for converting 8 bit gray scale image to 24 bit gray scale image.
Source image is passed to ASM procedure as pointer to byte array in memory that contains colors for pixels. Each byte represent single pixel with R,G and B values stored in first byte. Second pixel color is stored in second byte and so on. I call this ASM code from Visual Basic using CallWindowProc.
How this code works: Get first byte of source byte array and copy this byte 3 times in first, second and third byte of destination byte array. Next get second byte of source byte array and copy it in 4,5 and 6th byte of destination byte array.
Like this:
Source byte array values: 05 10 D2 50 28 75 F2 A8
This procedure fills destination byte array as: 05 05 05 10 10 10 D2 D2 D2 50 50 50 28 28 28 75 75 75 F2 F2 F2 A8 A8 A8
Some comments are in Bulgarian - excuse me for that. I hope that you can understand ASM code without any comments:

; Opcodes:
; ; OpCodes
; 8B54240CB9000000008B7424108BFE8B068BD881E3000000FF23460125FF00000023460225FF00000033C3890746464647413BCA75D98BC2C3

.586
.model flat,stdcall
option casemap:none

.code
Start:

mov ebx,[esp+8] ; Byte array iztochnika na danni ; *** Pointer to first element of source byte array
mov edx,[esp+12] ; Broi povtoreniq ; *** Number of loops (number of bytes in source byte array)
mov ecx,0 ; Broi povtoreniq
mov esi, [esp+16] ; V esi se vzema adresa na pyrviq element ot chetvyrtiq parametyr (byte array kadeto shte se slojat dannite) ; *** Pointer to first element of destination byte array
mov edi,esi ; edi shte e broqcha na elementa koito trqbva da se promeni

Circle:
mov eax, [ebx] ; Vzemane na stoinostta ot pyrviq bait do chetvyrtiq bait na array iztochnika ; *** Get 4 bytes from source byte array in eas
mov [esi],eax ; Vkarvane na chetirite baita v otmestvaneto na array priemnika ; *** Put these 4 bytes in destination byte aray
mov [esi+1],eax
mov [esi+2],eax
inc ebx ; minavane na sledvashtata troika baitove
inc esi ; ; *** Go to next 3 bytes in destination byte array
inc esi
inc esi ; Posochvame sledvashtiq index koito trqbva da se promeni
inc ecx ; Da se yvelichi broqcha
cmp ecx,edx ; *** If we still don't reach number of loops
jnz Circle ; Prehod kym Circle ako ecx ne e edx

Exit:

mov eax,edx ; *** Return number of loops performed

ret

End Start

Second code is for inverse operation - convert 24 bit image to 8 bit image - every 3 bytes from 24 bit source byte array are AND-ed each other and saved into destination 8 bit byte array.
If someone optimizes this code, will take the points and will be wellcome to optimize other code.

Also, are there some special exclusions or warnings when my source byte array or destination byte array are not 32-bit aligned in memory ?

Please, don't laugh on my code - it is my first code in ASM for windows. Previous ASM i wrote was in early 1985 when I wrote code for 6502 processor (ATARI 600XL, APPLE II...).

Thanks in advance.

Dancie

Something like this will run faster,

.586
.model flat,stdcall
option casemap:none

.code

Start:

mov ebx,[esp+8]
mov eax,[esp+12]      ;exchange edx with eax and you save the last instruction
mov esi, [esp+16]
xor ecx,ecx      ;this is the same as mov ecx,0,but this aligns the loop below
            ;on a 4 byte boundry which increases the speed of the loop
            ;also it seperates the read of esi and the use of the register
            ;this lets the instructions on a pentium execute in the same clock
mov edi,esi      ;why is this needed? If you remove it two bytes should come in its
            ;place to align the loop below.
Circle:
mov edx, [ebx]
inc ebx             ;place this instruction here to let it pair on a pentium with the
mov [esi],edx      ;instruction before it,on a P4 add ebx,1 seems to be faster than inc ebx
mov [esi+1],edx
mov [esi+2],edx
inc ecx            ;place this here to let it pair on a pentium
add esi,3            ;an add is one clock cycle,so is an inc
cmp ecx,eax
jnz Circle

Exit:            ;eax retruns number of loops preformed
ret

End Start

END

grg99

Does your code run correctly?

It's doing some mighty dubious loading and storing of 32-bit values to/from sucessive bytes.

I would do something like this: it does FOUR times the work per loop and only does 1/4 of the memory accesses.

Circle:
mov eax,[esi] ; get FOUR 8-bit values at once
irp R,<b,c,d>
mov e&R&x,eax
endm ; end irp actualy

Bump=0
irp R,<a,b,c,d>
mov &R&h,&R&l ; extend value into all three bytes
sal e&R&x,16
or &r&l,&rh
mov [edi+Bump],e&R%x ; store result
Bump=Bump+3
endm ; end irp actually

lea edi,[edi+3*4]
lea esi,[esi+4]
dec ebp
jnz Circle

stefan73

Hi grg99,
> mov [edi+Bump],e&R%x ; store result
> Bump=Bump+3
Are you sure you don't get hefty penalty cycles for data misalignment?

Cheers,

Stefan

grg99

> Bump=Bump+3
>Are you sure you don't get hefty penalty cycles for data misalignment?

No worse than in the original code, which stored a 32-bit register into three overlappng places.

It might help a lot to redefine the bitmap to have 32-bit entries, that's done a lot these days.

There's also probably some MMX instruction that does just these operations, but that may be out of the scope of this query.

___XXX_X_XXX___

ASKER

Hi to all of you experts.
Will try these suggestions and will give points.

Let me explain:
grg99:
"Does your code run correctly?" - Yes, it runs correctrly. This code is some kind of very simple compression when my program needs to send data between client and server. Client sends it's desktop screen in 8 bit gray scale image instead of 24 bit color image to server. With current code written by me it works good. Sorry for lame method of storing EAX in 4 consecutive bytes in memory - this is because i don't know Assembler well, probably I don't know it at all.

"No worse than in the original code, which stored a 32-bit register into three overlappng places." - Yes, my code is not so good. I can say that it is not good at all - but this is my first code in 32-bit ASM as well. Please, don't blame me - the lame :)

"It might help a lot to redefine the bitmap to have 32-bit entries, that's done a lot these days." - I must use 24 bit because Windows API graphical functions work with 24 bit bitmaps - 3 consecutive bytes with R,G and B values.

"There's also probably some MMX instruction that does just these operations, but that may be out of the scope of this query" - I don't know anything for MMX instructions (as you can see from my code).

Dancie:
You code looks close to mine. I see that my code contains unnecessary operations. Size of code is not critical, I will be happy on 5 or 6 ms shorter execution for resolutions 1024x768 pixels.

Soon (maybe tomorrow) I will try your suggestions and will post results here.

grg99

Oh, I forgot, there's certainly a Windows API for doing the 8-->24 mapping, and it is probably optimized better than any of us could ever do so.

It may even call on the graphics accelaerator chip on your video card to do it really quickly.

grg99

Depending on the speed of the connection, you may want to convert the screen data to GIF or JPEG format. Again there are API's for doing just these things.

___XXX_X_XXX___

ASKER

API's for direct memory converting to GIF/JPEG that will be faster than ASM code ? I don't know such APIs. I want to avoid Windows version dependency and DirectX dependency. May be DirectX have some functions that do this, but I don't want to use any DirectX or XP specifix APIs to do that. This ASM code will execute about 10-12 times per second (on client and on server with 320x200 rectangle from screen and 100MBit LAN) and I don't want to use additional API callings to avoid ASM code.

ASKER CERTIFIED SOLUTION

Dancie

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

grg99

Yes, a lot of Windows API's are well written.
The graphics stuff is pretty awesome.

You're going to have *fewer* dependencies if you call a API to do this conversion,
as there are conversion functions for every supported screen resolution and bit-depth.

Also note that the Windows API is going to recognize any graphics hardware or MMX features and use them if possible.

I understand wanting to write stuff in asm for fun, but that's not always the fastest way to do things.

Regards,

grg99

___XXX_X_XXX___

ASKER

grg99:
It's not for fun ;(
My equvalent VB code does this job (from 1024x768x1 byte color to 1024x768x3 bytes color for about 500+ ms). My ASM code does it for about 30-35 ms. I need some speed, not fun ;)

By the way, grg99, I can't compile you ASM code
irp R,<b,c,d>
with MASM. May be I'm too lame to understand what is wrong. Error reported is:
for this code:
.586
.model flat,stdcall
option casemap:none

.code

Start:
irp R,<b,c,d>
ret

End Start

Error is:
D:\My Programs\VB.6\!__COMPUTER_CLUB_MANAGER_2__!\ASM Procedures\ConvertFrom8BitGrayScaleTo24Bit\Untitled1.asm(11) : fatal error A1008: unmatched macro nesting

The red X mark is located at "End Start" line ?!?!? May be I'm too lame to understand what you mean in your code. Sorry :(

grg99: Could you write your idea (i like this -> "mov &R&h,&R&l ; extend value into all three bytes") that I can compile it using MASM.

Dancie:
Your last (revisited) ASM code works great (compared to my ASM code). Some results from GetTickCount here:

Results with my ASM code:
Source byte array size is 50 331 648 bytes -> Will be converted to byte array with size 150 994 944 bytes
Time reported from GetTickCount (note that CallWindowProc time is in calculated time) = 406 ms

Results with Dancie's ASM code after hes optimizing and revision:
Source byte array size is 50 331 648 bytes -> Will be converted to byte array with size 150 994 944 bytes
Time reported from GetTickCount (note that CallWindowProc time is in calculated time) = 166 ms

Program was running in VB6 IDE (I think this doesn't matter for execution of ASM code - it will be same in IDE and in EXE) several times for two algorithms and averaged. Processor is P4 2.4 GHz, 512MB DDR400, OS Windows XP Pro SP1.

Thanks to all of you, experts. Dancie get the points. As I promise, will expect you to answer on my second question about optimizing ASM code - this time for 24 to 8 bit conversion. I will ask it later (again for 500 points).

Best regards,
Georgi Ganchev

MarkSteward

I seem to be a little late, but I'd like to pick up on the conversion idea. GDI+ is available as a redistributable for Windows 98 and up, and it's extremely easy to use from VB. I'm currently only using it to convert to PNG for a mobile phone, so I haven't stress-tested it, but I reckon it would be a better solution (even at 8 BPP you're sending almost a megabyte per screen over the network, and that limits you to 16fps even with theoretically optimal conditions). And can you really guarantee 100MBps when you can't guarantee Windows version?

Oh, and Windows does use 32-bit internally for most bitmaps, so from an optimisation point of view, that would be better, along with using Windows GDI functions which are heavily optimised for specific combinations of source and destination BPP levels.

Best wishes,
Mark

___XXX_X_XXX___

ASKER

Hi Mark. The client program is installed on many machines (about 600-700+ in my country). I don't want to tell to my customers "You must install on every machine GDI+ in order to use such functionality.". Of course, may be GDI+ is more easy to use and powerful, but I must support as less dependencies as possible.

"And can you really guarantee 100MBps when you can't guarantee Windows version?"
Yes, environment for this Client-Server application is LAN, and it is 100MBit on every computer (My program is for Cyber Cafe's Management).

Thanks Mark for the idea, when (IF) I move to VB.NET will use GDI+ for sure in my new programs.

MarkSteward

Sorry, I misread 1024x768 as your resolution, instead of 320x200, so you could have 12 clients doing 17fps in ideal conditions (well, more like 5 to be realistic). Is this the program at compclub2.hit.bg?

Some more ideas: converting to an RLE using Windows functions will reduce bitmap size, and give you an immediate speed increase, and compressing with a storage format (like the built-in LZ32) will reduce bandwidth again. Bitmaps as a rule compress to 1/10 their size.

Also make sure that you're using as much Windows API as possible, and doing as little manipulation in VB as possible! Are you sending screen captures? If so, I assume you're using BitBlt. Instead of using CreateCompatibleBitmap for the memory DC, use CreateBitmap(lngWidth, lngHeight, 1, 8, NULL), and then Windows will do the copying and conversion in one, faster than copying 24-bit and then converting could ever be. If it's not screen captures, could you tell us what the source of the image is?

Your original code will overrun its buffer by a byte, by the way. And the reason the grg99's code didn't compile was that you didn't copy it all. grg99's code still has multiple unaligned memory accesses, though, and not much pairing.

I'll post a couple of implementations to compare with win32k!vSrcCopyS8D32 and vSrcCopyS32D8.

Cheers,
Mark

MarkSteward

Also, it shouldn't need to be converted to 24-bit. When you send (or receive) the bitmap, set up the 256-colour palette to be 0x00000000, 0x00010101, etc., and Windows can use it as normal (e.g. blit it into a 24 BPP hDC).

MarkSteward

Or just let Windows deal with the palette, and choose the best colours.

Sorry for so many posts!

Mark

___XXX_X_XXX___

ASKER

Hi Mark.
Yes, my program is at compclub2.hit.bg (compclub2.tripod.com). Here is the code that get "screen shot" - sorry, but remarks are in Bulgarian, will try to translate to english after "***"

Also, excuse me of this non-ASM code posted here...

VB6 function:

Public Function Fun_GetScreenBytes(lngLeft As Long, _
lngTop As Long, _
lngWidth As Long, _
lngHeight As Long, _
byteFormat As Byte) As Byte()

On Error Resume Next

' lngLeft and lngTop are upper right corner of rectangle that must get
' lngWidth and lngHeight are dimensions of that rectangle in pixels
' byteFormat - either 1 for uncompressed 24 bit color format or 2 for 8 bit gray scale format
'KPD-Team 2000
'URL: http://www.allapi.net/
'E-Mail: KPDTeam@Allapi.net
'-> Compile this code for better performance
Dim bi24BitInfo As BITMAPINFO
Dim bBytes() As Byte
Dim iDC As Long
Dim iBitmap As Long
Dim byteConvert() As Byte
Dim byteTemp As Byte
Dim lngL As Long
Dim lngTemp As Long

' Ne se pozvolqvat nylevi dyljini *** No 0 sizes of captured rectangle.
If lngWidth = 0 Then
lngWidth = 1
End If
If lngHeight = 0 Then
lngHeight = 1
End If

' Podgotvqne na headera *** Preparing header
With bi24BitInfo.bmiHeader
.biBitCount = 24
.biCompression = BI_RGB
.biPlanes = 1
.biSize = Len(bi24BitInfo.bmiHeader)
' Razmerite na pravoygylnika se zadavat kato parametri *** Rectangle dimensions
.biWidth = lngWidth
.biHeight = lngHeight
End With

' Orazmerqvne na byteArray koito shte sydyrja stoinostite na RGB cvetovete *** Prepare byte array
' Za da sydyrja dostatychno baitove za iskaniq pravoygylnik *** To contain enough bytes for selected rectangle
ReDim bBytes(0 To bi24BitInfo.bmiHeader.biWidth * bi24BitInfo.bmiHeader.biHeight * 3 - 1) As Byte
' Syzdavane na DC kato na DISPLAY *** Create device context like display device
iDC = CreateCompatibleDC(0)
' Syzdavane na DIB section *** Create DIB section
iBitmap = CreateDIBSection(iDC, bi24BitInfo, DIB_RGB_COLORS, ByVal 0&, ByVal 0&, ByVal 0&)
' Izbirame v iDC (nashiq DISPLAY DC) -> iBitmap (DIB sectiona)
SelectObject iDC, iBitmap
' Kopirane na baitovete ot ekrana (GetDC(0)) ot poziciq lngLeft,lngTop *** Move bytes from plngDisplayDC (It is = GetDC(0)) to our display device context (iDC) on position 0,0 and dimension lngWidth and lngHeight
' Vyv nashiq iDC na poziciq 0,0 s razmeri lngWidth i lngHeight
BitBlt iDC, 0, 0, lngWidth, lngHeight, plngDisplayDC, lngLeft, lngTop, vbSrcCopy
' Zapylvane na nashiq byteArray sys stoinostite na cvetovete v iDC (nashiq vzet ot ekrana pravoygylnik) *** Get byte info from iDC - bytes that describe colors for each pixel
GetDIBits iDC, iBitmap, 0, bi24BitInfo.bmiHeader.biHeight, bBytes(0), bi24BitInfo, DIB_RGB_COLORS
' Ako trqbva da se konvertira po nqkakyv nachin *** If I must convert it to 8-bit
Select Case byteFormat
Case 1 ' Nekompresiran RGB format (24 bitov) *** No conversion is needed
' Vryshta se 1:1 displeq
Fun_GetScreenBytes = bBytes
Case 2 ' Gray Scale format (8 bitov) *** Must convert to 8 bit gray scale
' Trqbva da se kombinirat 3 baita na cvetovete v edin s AND (ili s XOR ili s OR)
' Izpolzva se veche noviq metod s kod na Assembler
' *** This CallWindowProc API will call ASM code that is in pbyteASM_Convert24BitsTo8BitsGrayScale() byte array
' *** And will pass some parameters like how many bytes must be converted - CLng(UBound(bBytes) \ 3) and memory where to store result = VarPtr(bBytes(0) - store result in source byte array for less memory usage
CallWindowProc ByVal VarPtr(pbyteASM_Convert24BitsTo8BitsGrayScale(0)), 0, 0, CLng(UBound(bBytes) \ 3), VarPtr(bBytes(0))
' *** This is VB method that I not using anymore:
''''' ReDim byteConvert(UBound(bBytes) \ 3) As Byte
''''' For lngL = 0 To UBound(bBytes) Step 3
''''' byteTemp = bBytes(lngL) And bBytes(lngL + 1) And bBytes(lngL + 2)
''''' byteConvert(lngTemp) = byteTemp
''''' lngTemp = lngTemp + 1
''''' Next 'lngL
''''' Fun_GetScreenBytes = byteConvert
ReDim Preserve bBytes(UBound(bBytes) \ 3) As Byte
Fun_GetScreenBytes = bBytes
End Select

' SetDIBBitsToDevice slaga cvetovete na ekrana na izbrano DC
' To nqma da se polzva tyk
''''' SetDIBitsToDevice Me.hdc, 0, 0, bi24BitInfo.bmiHeader.biWidth, bi24BitInfo.bmiHeader.biHeight, 0, 0, 0, bi24BitInfo.bmiHeader.biHeight, bBytes(1), bi24BitInfo, DIB_RGB_COLORS
' Osvobojdavane na pametta
DeleteDC iDC
DeleteObject iBitmap

End Function