Start Free Trial

asked on

I need very fast file access... TfileStream to slow

Ok, my problem is that I can not figure out a very fast way to process file information. I've tried using TFileStream... not fast enough, loading a file into a tmemorystream... not fast enough for what I am doing. I'm working with files that range from a few megs to 50 megs and can involve a lot of sequential reads of 1 byte at a time.

I need to know
1) what is the fastest way to get the data from a file into a buffer
2) what is the fastest number of bytes to read in a chunk at a time
3) what is the fastest type of buffer I can use (ie; TMemoryStream, an array of Bytes, etc)

I've converted a few of the processing procedures into ASM, so I know they are faster. Is it possible to do file access with ASM in delphi? Is it worth the effort (how much faster would it be?)...

I know things can be sped up a lot... a comparable program does the same thing I am doing in 3-4 seconds and it takes my program 30-34 seconds...

this program is basically sliding through a file computing CRC32's for chunks of the file about 384000 bytes long. If a CRC32 matches one in a list, the next crc32 is calculated from the next 384000 bytes.

If the CRC32 calculated doesn't match one in the list, the program reads 1 byte from the file and uses CRC sliding (windowing? i dont know the exact name of this process.. it basically takes the first byte of the chunk used to calculate the last CRC32 and the next byte in the file and computes the new CRC32 with a lookup table and XOR's)...

Any ideas would help

SOLUTION

robert_marquardt

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

I can tell you one thing: making a lot of asm and using some exotic solutions will make your hdd access a little more fast, but does not worth the effort.
My suggestions are as follow:

- work in memory and not from disk
1) - manually copy by chunk (65535 bytes) from disk to memory (either using untyped file or a tfilestream)
2) - copy from disk to memory using streams (tmemorystram.copyfrom(tfilestream,tfilestream.size)); )

you can compare the 2 and see that there is not much of a difference. Probably fastest memory access will be to use a pointer and do some aritmetics on it, but a tmemorystream should not be more slower then that.

the differences between the methods might be of only some miliseconds.
remember that most delphi functions and system api functions are pretty well optimized (this includes disk access) :)

using memory mapped files will not guarantee that the file will be in memory at one given time, but also the above methods will not guarantee this either.

of course, disk acces is not necessarely your problem. are you sure your crc algorithm is fast enough? you can test against this matter bu simply commenting out the code for crc and see how long it takes just the disk/memory access ;)

ASKER

its very hard to tell if its my CRC32 algorithms that are slow because the program acts differently if it finds a valid CRC32. I found that my ASM wasn't faster then just letting delphi do the optimizations itself. here is a chunk of code I use to process get a CRC... the CRCTable (Array of [0..255] of LongInt) is built when TCRC32 object is created along with the WindowMask and WindowTable (also Array of [0..255] of LongInt);

This code is still pretty rough... works, but warnings about combining signed and unsigned types... not a hugh problem. GetCRC is fed the memorystream containing the file and a blocklength to process (in this case 384000)... it returns the CRC32 which is compaired to a list of values.. if matched then another call to GetCRC is made. if no match then NextCRC is called with the same memorystream, the CRC32 GetCRC returned and the first character in the buffer that was used to compute that CRC32. After reading one byte from memorystream it calculated new CRC32 and passes this back to caller which again compairs this to a list of CRC32s that it is looking for... no match, then call NextCRC again otherwise GetCRC... etc etc etc.

function TCRC32.GetCRC(ms : TMemoryStream; BlockLength : UInt64) : LongInt;
var
b: Byte;
CRC: Longint;
e, i: LongInt;

begin
CRC := $FFFFFFFF;

for i := 0 to BlockLength-1 do
begin
E := ms.Read(B,1);
case E of
1 : CRC := RecountCRC(B, CRC);
0 : CRC := RecountCRC(0, CRC);
end;
end;

Result := Not CRC;
end;

function TCRC32.RecountCRC(b: byte; CrcOld: Longint): Longint;
begin
Result := CRCTable[byte(CrcOld xor Longint(b))] xor ((CrcOld shr 8) and $00FFFFFF)
end;

function TCRC32.NextCRC(ms : TMemoryStream; LastCRC : LongInt;
LastChar : Byte) : LongInt;

var
b: Byte;

begin
ms.Read(b,1);

Result := WindowMask XOR CRCSlideChar(WindowMask XOR LastCRC, b, LastChar,
WindowTable);
end;

function TCRC32.CRCSlideChar(CRC : LongInt; NewChar, OldChar : Byte;
Table : TCRCTable): LongInt;
begin
Result := ((CRC SHR 8) AND $00ffffff) XOR CRCTable[Byte(CRC) XOR NewChar] XOR Table[OldChar];
end;

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

ciuly: ok, well I listed the code for the CRC about 15 minutes before your post. Please take a look and tell me if I can speed it up.

So you think it might be the search that is slowing the process down? When the program is run, it makes a list of CRC32's to look for in each file I'm dealing with. So basically I need to look for the CRC32 value that Get/NextCRC returns (in LongInt form) in some sort of list of values that could be as small as 20-30 LongInts, upto a large number (20000-30000 maybe more)... what would be the best way to store these values in memory and what would be the fastest way to search for this value?

ASKER

robert_marquardt: I am looking into the memory mapped file routines in JCL but haven't got time to do it right now... must go to work.

felonious,

I agree with the others in that it is most likely your "search" comparison that is causing this slowdown. What you did not mention though is how you currently store / search these values? If you currently search them by iterating your list (for x:=0 to Count-1) then you must also realize that if there are 30,000 values in the list, then every failed check would iterate the list completely. Eg:

2,000 searches on 30,000 items where no match is made = 60 Million checks (uggh)

So a sorted list (to use for binary split searching) or a hash table should be used. Personally I would go with the sorted list / split search approach, as the hash table would require a lot of tuning to get an even bucket spread.

- Hash lookup will locate the bucket in one shot, but will then need to perform X comparisons depending on bucket depth.
- A split search on 30000 items would require max 13..14 checks for a complete miss, but also allows you to check min/max values and can easily be implemented on a TList.

Up to you, as examples for either could be provided.

Regards,
Russell

felonious: I know you posted the crc code, I actually comented on it :), what I asked was the code that deals/works/uses the crc and not the one that creates it. I know my english is not that good :) so vasically what I would liked to see was the code that seraches for a given crc and possibly other piece of code that uses the crc but you didn't mention ans maybe not cosidered time consuming ;)

anyway, russell better outlined the need of an improved and optimized search algorithm

Example using a quick integer list I put together.

The example source will show the difference between iterating a list with 30,000 items 10K times, and performing a binary search 10K times on the same list with 30,000 items. On my system, the iteration took over 3 seconds (3110 ms) while the bin search took less than 15 ms. Should give you an idea what we have been trying to explain.

Russell

-----

unit Unit1;

interface

uses
Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
StdCtrls, IntList;

type
TForm1 = class(TForm)
Button1: TButton;
Button2: TButton;
procedure Button1Click(Sender: TObject);
procedure FormCreate(Sender: TObject);
procedure Button2Click(Sender: TObject);
private
// Private declarations
public
// Public declarations
end;

var
List: TIntList;
Form1: TForm1;

implementation
{$R *.DFM}

procedure TForm1.Button1Click(Sender: TObject);
var dwMark: LongWord;
dwLoops: Integer;
dwIndex: Integer;
dwValue: Integer;
begin

dwMark:=GetTickCount;

// Testing for a number
for dwLoops:=1 to 10000 do
begin
dwValue:=Random(1000000);
for dwIndex:=0 to Pred(List.Count) do
begin
if (List[dwIndex] = dwValue) then break;
end;
end;

dwMark:=GetTickCount-dwMark;
Caption:=Format('%d ms', [dwMark]);

end;

procedure TForm1.Button2Click(Sender: TObject);
var dwMark: LongWord;
dwLoops: Integer;
dwIndex: Integer;
dwValue: Integer;
begin

dwMark:=GetTickCount;

for dwLoops:=1 to 10000 do
begin
dwValue:=Random(1000000);
if List.Find(dwValue, dwIndex) then
Assert(dwValue = List[dwIndex], 'Failure in find routine');
end;

dwMark:=GetTickCount-dwMark;
Caption:=Format('%d ms', [dwMark]);

end;

procedure TForm1.FormCreate(Sender: TObject);
var dwIndex: Integer;
begin

// LOCK the list otherwise it will sort on each add!!!!
List.BeginUpdate;

try
// Load the list with 30000 values
for dwIndex:=1 to 30000 do List.Add(Random(1000000));
finally
List.EndUpdate;
end;

end;

initialization

// Seed random number generator
Randomize;

// Create the list
List:=TIntList.Create;

finalization

// Free the list
FreeAndNil(List);

end.

---- integer list unit ----

unit IntList;
////////////////////////////////////////////////////////////////////////////////
//
// Unit : IntList
// Author : rllibby
// Date : 11.01.2005
// Description : Implementation for an integer list that supports sorting as
// well as binary searching.
//
////////////////////////////////////////////////////////////////////////////////
interface

////////////////////////////////////////////////////////////////////////////////
// Include Units
////////////////////////////////////////////////////////////////////////////////
uses
Windows, SysUtils, Classes;

////////////////////////////////////////////////////////////////////////////////
// TIntList
////////////////////////////////////////////////////////////////////////////////
type
TIntList = class(TObject)
private
// Private declarations
FList: TList;
FNeedsSort: Boolean;
FLockCount: Integer;
protected
// Protected declarations
procedure UpdateList;
function GetItem(Index: Integer): Integer;
function GetCount: Integer;
public
// Public declarations
constructor Create;
destructor Destroy; override;
function Add(Value: Integer): Integer;
procedure BeginUpdate;
procedure Clear;
procedure Delete(Index: Integer);
procedure EndUpdate;
function Find(Value: Integer): Boolean; overload;
function Find(Value: Integer; out Index: Integer): Boolean; overload;
property Count: Integer read GetCount;
property Items[Index: Integer]: Integer read GetItem; default;
end;

implementation

function SortList(Item1, Item2: Pointer): Integer;
begin

// Return the difference between the 2 items
result:=Integer(Item1)-Integer(Item2);

end;

function TIntList.Find(Value: Integer): Boolean;
var L, H, I, C: Integer;
begin

// Set default result
result:=False;

// Check to see if the list needs updating
if FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed now
FNeedsSort:=False;
end;

// Set hi and lo marks
L:=0;
H:=Pred(FList.Count);

// Check for at least one item in list
if (H < L) then
// No items
result:=False
else
begin
// Quick check for min and max values
if (Value < Integer(FList[L])) or (Value > Integer(FList[H])) then
// Not in valid range
result:=False
else
begin
// Hi lo check
while (L <= H) do
begin
// Get mid point
I:=(L + H) shr 1;
// Get difference
C:=Integer(FList[I])-Value;
// Check difference
if (C < 0) then
// Update the lo mark
L:=Succ(I)
else
begin
// Update the hi mark
H:=Pred(I);
// Check for exact match
if (C = 0) then
begin
// Success
result:=True;
// Break
break;
end;
end;
end;
end;
end;

end;

function TIntList.Find(Value: Integer; out Index: Integer): Boolean;
var L, H, I, C: Integer;
begin

// Set default result
result:=False;

// Check to see if the list needs updating
if FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed now
FNeedsSort:=False;
end;

// Set hi and lo marks
L:=0;
H:=Pred(FList.Count);

// Check for at least one item in list
if (H < L) then
// No items
result:=False
else
begin
// Quick check for min and max values
if (Value < Integer(FList[L])) or (Value > Integer(FList[H])) then
// Not in valid range
result:=False
else
begin
// Hi lo check
while (L <= H) do
begin
// Get mid point
I:=(L + H) shr 1;
// Get difference
C:=Integer(FList[I])-Value;
// Check difference
if (C < 0) then
// Update the lo mark
L:=Succ(I)
else
begin
// Update the hi mark
H:=Pred(I);
// Check for exact match
if (C = 0) then
begin
// Success
result:=True;
// Set index
L:=I;
end;
end;
end;
// Update index
Index:=L;
end;
end;

end;

procedure TIntList.EndUpdate;
begin

// Check count
if (FLockCount > 0) then
begin
// Decrement the update count
Dec(FLockCount);
// Check for unlock
if (FLockCount = 0) and FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed
FNeedsSort:=False;
end;
end
else
// Make sure its zero
FLockCount:=0;

end;

procedure TIntList.BeginUpdate;
begin

// Increment the update count
Inc(FLockCount);

end;

function TIntList.GetItem(Index: Integer): Integer;
begin

// Return the item at the given index
result:=Integer(FList[Index]);

end;

procedure TIntList.UpdateList;
begin

// Check locked state
if (FLockCount > 0) then
// Mark the list as needing a sort
FNeedsSort:=True
else
begin
// Sort the list
FList.Sort(SortList);
// No longer need a sort
FNeedsSort:=False;
end;

end;

function TIntList.Add(Value: Integer): Integer;
begin

// Add the item to the list
result:=FList.Add(Pointer(Value));

// Update the list
UpdateList;

end;

procedure TIntList.Clear;
begin

// Clear the list
FList.Clear;

// Don't require an update
FNeedsSort:=False;

end;

procedure TIntList.Delete(Index: Integer);
begin

// Perform delete (this does not require a re-sort)
FList.Delete(Index);

end;

function TIntList.GetCount: Integer;
begin

// Return count from list
result:=FList.Count;

end;

constructor TIntList.Create;
begin

// Perform inherited
inherited Create;

// Set starting defaults
FList:=TList.Create;
FNeedsSort:=False;
FLockCount:=0;

end;

destructor TIntList.Destroy;
begin

// Resource protection
try
// Free the list
FreeAndNil(FList);
finally
// Perform inherited
inherited Destroy;
end;

end;

end.

I still think (not tested though) that old style routines are much faster than TStream of any kind..

procedure TForm1.Button1Click(Sender: TObject);
Var
F: File;
Buffer: Array[0..8024] of char;
Abc: Integer; //Actual Bytes Copied
begin
System.Assign(F,'c:\testdemo.zip');
System.Reset(F,1);
While Not EOF(F) do
Begin
BlockRead(F,Buffer,SizeOf(Buffer),ABC);
Memo1.Lines.Add(String(Buffer)); //etc..
//Buffer contains your data.
End;
System.Close(F);
end;

You could increase the buffer, whatever, i personally would grab it at around 8/16/32k chunks and then chuck the contents into another char array or string and then examine the data then, grab another chunk and so on.

I just dumped the contents of that file into memo just to make sure it's working, please remove that line.

robert_marquardt

The JCL also contains various list implementations. Hash, queue etc.

IMHO Robert hit the nail on the head with his very first post. Use memory mapped files together with a more optimal CRC code. That should be as fast as it can get. No need for assembler here, it won't be measurably faster than Delphi code.

P.S: Here's the CRC routine my "madZip" unit is using:

function UpdateCrc32(crc32: cardinal; const inBuf; inLen: integer) : cardinal;
const crc32Tab : array [0..255] of cardinal =
($00000000, $77073096, $ee0e612c, $990951ba, $076dc419, $706af48f, $e963a535, $9e6495a3,
$0edb8832, $79dcb8a4, $e0d5e91e, $97d2d988, $09b64c2b, $7eb17cbd, $e7b82d07, $90bf1d91,
$1db71064, $6ab020f2, $f3b97148, $84be41de, $1adad47d, $6ddde4eb, $f4d4b551, $83d385c7,
$136c9856, $646ba8c0, $fd62f97a, $8a65c9ec, $14015c4f, $63066cd9, $fa0f3d63, $8d080df5,
$3b6e20c8, $4c69105e, $d56041e4, $a2677172, $3c03e4d1, $4b04d447, $d20d85fd, $a50ab56b,
$35b5a8fa, $42b2986c, $dbbbc9d6, $acbcf940, $32d86ce3, $45df5c75, $dcd60dcf, $abd13d59,
$26d930ac, $51de003a, $c8d75180, $bfd06116, $21b4f4b5, $56b3c423, $cfba9599, $b8bda50f,
$2802b89e, $5f058808, $c60cd9b2, $b10be924, $2f6f7c87, $58684c11, $c1611dab, $b6662d3d,
$76dc4190, $01db7106, $98d220bc, $efd5102a, $71b18589, $06b6b51f, $9fbfe4a5, $e8b8d433,
$7807c9a2, $0f00f934, $9609a88e, $e10e9818, $7f6a0dbb, $086d3d2d, $91646c97, $e6635c01,
$6b6b51f4, $1c6c6162, $856530d8, $f262004e, $6c0695ed, $1b01a57b, $8208f4c1, $f50fc457,
$65b0d9c6, $12b7e950, $8bbeb8ea, $fcb9887c, $62dd1ddf, $15da2d49, $8cd37cf3, $fbd44c65,
$4db26158, $3ab551ce, $a3bc0074, $d4bb30e2, $4adfa541, $3dd895d7, $a4d1c46d, $d3d6f4fb,
$4369e96a, $346ed9fc, $ad678846, $da60b8d0, $44042d73, $33031de5, $aa0a4c5f, $dd0d7cc9,
$5005713c, $270241aa, $be0b1010, $c90c2086, $5768b525, $206f85b3, $b966d409, $ce61e49f,
$5edef90e, $29d9c998, $b0d09822, $c7d7a8b4, $59b33d17, $2eb40d81, $b7bd5c3b, $c0ba6cad,
$edb88320, $9abfb3b6, $03b6e20c, $74b1d29a, $ead54739, $9dd277af, $04db2615, $73dc1683,
$e3630b12, $94643b84, $0d6d6a3e, $7a6a5aa8, $e40ecf0b, $9309ff9d, $0a00ae27, $7d079eb1,
$f00f9344, $8708a3d2, $1e01f268, $6906c2fe, $f762575d, $806567cb, $196c3671, $6e6b06e7,
$fed41b76, $89d32be0, $10da7a5a, $67dd4acc, $f9b9df6f, $8ebeeff9, $17b7be43, $60b08ed5,
$d6d6a3e8, $a1d1937e, $38d8c2c4, $4fdff252, $d1bb67f1, $a6bc5767, $3fb506dd, $48b2364b,
$d80d2bda, $af0a1b4c, $36034af6, $41047a60, $df60efc3, $a867df55, $316e8eef, $4669be79,
$cb61b38c, $bc66831a, $256fd2a0, $5268e236, $cc0c7795, $bb0b4703, $220216b9, $5505262f,
$c5ba3bbe, $b2bd0b28, $2bb45a92, $5cb36a04, $c2d7ffa7, $b5d0cf31, $2cd99e8b, $5bdeae1d,
$9b64c2b0, $ec63f226, $756aa39c, $026d930a, $9c0906a9, $eb0e363f, $72076785, $05005713,
$95bf4a82, $e2b87a14, $7bb12bae, $0cb61b38, $92d28e9b, $e5d5be0d, $7cdcefb7, $0bdbdf21,
$86d3d2d4, $f1d4e242, $68ddb3f8, $1fda836e, $81be16cd, $f6b9265b, $6fb077e1, $18b74777,
$88085ae6, $ff0f6a70, $66063bca, $11010b5c, $8f659eff, $f862ae69, $616bffd3, $166ccf45,
$a00ae278, $d70dd2ee, $4e048354, $3903b3c2, $a7672661, $d06016f7, $4969474d, $3e6e77db,
$aed16a4a, $d9d65adc, $40df0b66, $37d83bf0, $a9bcae53, $debb9ec5, $47b2cf7f, $30b5ffe9,
$bdbdf21c, $cabac28a, $53b39330, $24b4a3a6, $bad03605, $cdd70693, $54de5729, $23d967bf,
$b3667a2e, $c4614ab8, $5d681b02, $2a6f2b94, $b40bbe37, $c30c8ea1, $5a05df1b, $2d02ef8d);
var i1 : integer;
ab : array [0..maxInt - 1] of byte absolute inBuf;
begin
result := crc32;
for i1 := 0 to inLen - 1 do
result := crc32Tab[byte(result xor ab[i1])] xor (result shr 8);
end;

ASKER

all: ok, well this is probably going to sound really stupid (or maybe just ignorant), but I just wanted to code something really fast to test my CRC sliding (windowing, whatever) to make sure it was working, so... here is how I was search for the CRCs... again very simple and fast to implement, and I didn't think at the time that it could be the slow down, but... what the heck do I know :)

CRC32 stored in an "in memory" DB (with other pertinent information about the files, recovery info, etc)
I was then transfering each CRC32 to a string
ie) CRCString := CRCString+':'+IntToStr(CRC32);

the above was only done once for each SET of files because the CRC32s I needed to look for were the same for the entire set of files.

after each CRC32 was returned by getCRC/NextCRC I would simply do
if pos(':'+inttostr(tempCRC32+':', CRCString) <> 0 then
blah blah etc etc

probably my slow down? :) dont grill me to badly <grin>

anyway I will try out the things you have all suggested and see how it goes. I still haven't had time to do anything with it as I had to write a program for my Intro to Java class (dont laugh! I have to take it to get into more advanced classes!)

ASKER

tobjectpascal: I was wondering about that also... I haven't seen any numbers on which way to access files is faster though so I tend to use TFileStream now. Can anyone tell me which is faster?

string manipulation is by all means slow :)

go with one of the methods already mentioned.

ASKER

Madshi: is having the CRC table values as constants faster then generating the table on the fly? (aside from the time needed to generate the table obviously)

I mean will having them as a const be beneficial as opposed to a variable Array[0..255] of Cardinal/LongInt.

Also, should I be handling these values as Cardinal instead of LongInt?

what kind of buffer is inBuf?

About the asm: I found that out the hard way... my asm wasn't any faster then what delphi did itself (maybe slower)

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

IMHO, the real "nail on the head" was seeing the code for the storage and lookup of the CRC values. Using the above info, I updated the sample code listed above to:

1.) Create a lookup string with 30,000 crc values

dwMark:=GetTickCount;
szList:=EmptyStr;
for dwIndex:=1 to 30000 do szList:=szList+':'+IntToStr(Random(1000000));
dwMark:=GetTickCount-dwMark;
Caption:=Format('%d ms', [dwMark]);

2.) Perform 10,000 Pos(...) searches against this list; same as the iteration / split search.

var dwMark: LongWord;
dwLoops: Integer;
dwIndex: Integer;
dwValue: Integer;
begin

dwMark:=GetTickCount;

for dwLoops:=1 to 10000 do
begin
dwValue:=Random(1000000);
dwIndex:=Pos(':'+IntToStr(dwValue)+':', szList);
end;

dwMark:=GetTickCount-dwMark;
Caption:=Format('%d ms', [dwMark]);

end;

The results were as follows:

Loading integer list w/30,000 items = < 15 ms
Loading string w/30,000 items = 10141 ms (10 seconds)

Searching for a value, performed 10,000 times:

Integer list (binary) split search = ~15 ms
Integer list iteration = 3110 ms (3 seconds)
String Pos(...) = 17953 ms (18 seconds)

These number are obviously relative, as I have no idea how many iterations are done per processing of the files. But they do show that a signifigant amount of time is dedicated to just seeding the string, never mind searching it.

The other comments regarding optmizing are excellent areas for improvement as well. Static tables, mem mapped files, etc...
I also wanted to mention one thing about streams; TMemoryStream in paticular. If you have the physical memory to load the file, then it does provide you with some pretty powerful access, as:

1.) It performs a single memory allocation, and data is read using one ReadBuffer call (translates to ReadFile at the low level)
2.) Once loaded, you dealing with the data in memory, which is much faster than disk IO.
3.) You have direct access to the memory through the .Memory property.

Its still slightly slower than using mem mapped files, but does compare well against the old style routines.

Regards,
Russell

Integer list updated to work with cardinals instead, in case you decide to go this route.

unit IntList;
////////////////////////////////////////////////////////////////////////////////
//
// Unit : IntList
// Author : rllibby
// Date : 11.01.2005, mods 11.02.2005
// Description : Implementation for a cardinal list that supports sorting as
// well as binary searching.
//
////////////////////////////////////////////////////////////////////////////////
interface

////////////////////////////////////////////////////////////////////////////////
// Include Units
////////////////////////////////////////////////////////////////////////////////
uses
Windows, SysUtils, Classes;

////////////////////////////////////////////////////////////////////////////////
// TIntList
////////////////////////////////////////////////////////////////////////////////
type
TIntList = class(TObject)
private
// Private declarations
FList: TList;
FNeedsSort: Boolean;
FLockCount: Integer;
protected
// Protected declarations
procedure UpdateList;
function GetItem(Index: Integer): Cardinal;
function GetCount: Integer;
function InternalFind(Value: Cardinal): Integer;
public
// Public declarations
constructor Create;
destructor Destroy; override;
function Add(Value: Cardinal): Integer;
procedure BeginUpdate;
procedure Clear;
procedure Delete(Index: Integer);
procedure EndUpdate;
function Find(Value: Cardinal): Boolean; overload;
function Find(Value: Cardinal; out Index: Integer): Boolean; overload;
property Count: Integer read GetCount;
property Items[Index: Integer]: Cardinal read GetItem; default;
end;

implementation

function SortList(Item1, Item2: Pointer): Integer;
begin

// Return the difference between the 2 items
result:=Cardinal(Item1)-Cardinal(Item2);

end;

function TIntList.InternalFind(Value: Cardinal): Integer;
var L, H, I, C: Integer;
bFound: Boolean;
begin

// Check to see if the list needs updating
if FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed now
FNeedsSort:=False;
end;

// Set hi and lo marks
L:=0;
H:=Pred(FList.Count);

// Check for at least one item in list
if (H < L) then
// No items
result:=(-1)
else
begin
// Quick check for min and max values
if (Value < Cardinal(FList[L])) or (Value > Cardinal(FList[H])) then
// Not in valid range
result:=(-1)
else
begin
// Set default
bFound:=False;
// Hi lo check
while (L <= H) do
begin
// Get mid point
I:=(L + H) shr 1;
// Get difference
C:=Cardinal(FList[I])-Value;
// Check difference
if (C < 0) then
// Update the lo mark
L:=Succ(I)
else
begin
// Update the hi mark
H:=Pred(I);
// Check for exact match
if (C = 0) then
begin
// Found
bFound:=True;
// Update lo mark
L:=I;
end;
end;
end;
// Update result
if bFound then
result:=L
else
result:=(-1);
end;
end;

end;

function TIntList.Find(Value: Cardinal): Boolean;
begin

// Check result of internal find call
result:=(InternalFind(Value) >= 0);

end;

function TIntList.Find(Value: Cardinal; out Index: Integer): Boolean;
begin

// Set result from internal find
Index:=InternalFind(Value);

// Item found?
result:=(Index >= 0);

end;

procedure TIntList.EndUpdate;
begin

// Check count
if (FLockCount > 0) then
begin
// Decrement the update count
Dec(FLockCount);
// Check for unlock
if (FLockCount = 0) and FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed
FNeedsSort:=False;
end;
end
else
// Make sure its zero
FLockCount:=0;

end;

procedure TIntList.BeginUpdate;
begin

// Increment the update count
Inc(FLockCount);

end;

function TIntList.GetItem(Index: Integer): Cardinal;
begin

// Return the item at the given index
result:=Cardinal(FList[Index]);

end;

procedure TIntList.UpdateList;
begin

// Check locked state
if (FLockCount > 0) then
// Mark the list as needing a sort
FNeedsSort:=True
else
begin
// Sort the list
FList.Sort(SortList);
// No longer need a sort
FNeedsSort:=False;
end;

end;

function TIntList.Add(Value: Cardinal): Integer;
begin

// Add the item to the list
result:=FList.Add(Pointer(Value));

// Update the list
UpdateList;

end;

procedure TIntList.Clear;
begin

// Clear the list
FList.Clear;

// Don't require an update
FNeedsSort:=False;

end;

procedure TIntList.Delete(Index: Integer);
begin

// Perform delete (this does not require a re-sort)
FList.Delete(Index);

end;

function TIntList.GetCount: Integer;
begin

// Return count from list
result:=FList.Count;

end;

constructor TIntList.Create;
begin

// Perform inherited
inherited Create;

// Set starting defaults
FList:=TList.Create;
FNeedsSort:=False;
FLockCount:=0;

end;

destructor TIntList.Destroy;
begin

// Resource protection
try
// Free the list
FreeAndNil(FList);
finally
// Perform inherited
inherited Destroy;
end;

end;

end.

ASKER

rllibby: The is very nice, but it doesn't work 100% with cardinals. The problem is in the sort.

function SortList(Item1, Item2: Pointer): Integer;
begin

// Return the difference between the 2 items
result:=Cardinal(Item1)-Cardinal(Item2);

end;

Item1-Item2 will at some point exceed the limits of an integers range. (even substituting in a LongInt will not always work)... thus the numbers in the list will not be in order and the Find functions will not work properly. I tried to do some rewrite of the code, but haven't got it to work yet.

change
function SortList(Item1, Item2: Pointer): Integer;
to
function SortList(Item1, Item2: Pointer): int64;

that should fix it :P

ASKER

yeah, I also found an issue in the internalfind function that I cant figure out

In InternalFind variable C cannot be an integer, but changing it to an int64 is giving strange results.

first, change the sortlist function to:

function SortList(Item1, Item2: Pointer): Integer;
var c1,c2:cardinal;
begin
c1:=Cardinal(Item1);
c2:=Cardinal(Item2);
if c1<c2 then result:=-1
else
if c1=c2 then result:=0
else result:=1;
end;

this will eliminate any problems due to using cardinals and/or 64 bit integers

about internanlfind,
if you still have issues after changing back to integer, here is how I would do it in respect to that C variable:

function TIntList.InternalFind(Value: Cardinal): Integer;
var L, H, I: Integer;
bFound: Boolean;
begin

// Check to see if the list needs updating
if FNeedsSort then
begin
// Sort the list
FList.Sort(SortList);
// Sort not needed now
FNeedsSort:=False;
end;

// Set hi and lo marks
L:=0;
H:=Pred(FList.Count);

// Check for at least one item in list
if H < L then
// No items
result:=-1
else
begin
// Quick check for min and max values
if (Value < Cardinal(FList[L])) or (Value > Cardinal(FList[H])) then
// Not in valid range
result:=-1
else
begin
// Set default
bFound:=False;
// Hi lo check
while L <= H do
begin
// Get mid point
I:=(L + H) shr 1;
// Check difference
if Cardinal(FList[I])<Value then
// Update the lo mark
L:=Succ(I)
else
begin
// Update the hi mark
H:=Pred(I);
// Check for exact match
if Cardinal(FList[I])=Value then
begin
// Found
bFound:=True;
// Update lo mark
L:=I;
end;
end;
end;
// Update result
if bFound then
result:=L
else
result:=-1;
end;
end;

end;

so in that code there are a lot of unneeded paranthesis. I deleted some in the above function :)
I didn't test the code, I just eliminated the use of the C variable, so this should work.

if you still have issues, then post some test code :)

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

rllibby: No need to apologize, this has all been (or will be when I get it all implemented) very helpful. I will try this updated code out when I get home tonight (I work a 2nd shift).. thanks again.

Oh by the way, is there anyway to change the subject of this post? I'd hate to accept an answer that doesn't really match up with what people are expecting when they open this question.

I'm considering splitting the points with the majority going to rllibby (say 300) for the useful code and the rest going to robert_marquardt (for the memory mapped files idea which I havent gotten around to trying yet, but probably will use) & ciuly for hitting on the idea that the sort was what was slowing things down. Any objections?

Also, If I award the points, does this discussion shut down or can we continue talking? There seems to have been a lot of changes since I last used E-E so I want to ask these questions beforehand.

Let me know if you run into further problems with the binary list; hopefully you shouldn't <g>

Regarding

1.) Changing the post title: Meikl may be able to do this, but most likely you will need to ask a CS mod to change it for you.
2.) Splitting points: I am fine with the split, but might suggest including Madshi for his contribution, which was towards the end result of making your routines faster.
4.) Closing the q: comments and discussions can still continue on a Q even after the points have been assigned. It remains as an open thread in that aspect.

Russell

ASKER

The binary sort is awsome, really spead things up!

Really you all have helped enormously, I will split the points between all of you. But!...

I may have jumped the gun here.

I am now looking at this TJclFileMapping stuff and am totally confused about how it is supposed to be used. Can one of you help me with this?

Glad things have sped up.. btw, what was the speed difference compared to before? And are you still using a TFileStream, or did you switch back to TMemoryStream? (TMemoryStream would allow direct read access to the memory).

Regarding the Jedi code, this would be Robert's area of expertise and he would be best to comment on this (I will help out if this does not get answered though).

Regards,
Russell

ASKER

rllibby: before with the tmemorystream it was running in around 35 seconds... with the new search it is completing in around 6 seconds.

This implementation is loading the whole file into a memorystream and that probably wont be acceptable for very large files, so I am pursuing some of the other options laid out in this post. Using tfilestream instead of tmemorystream brings the run time back up to 35-40 seconds which is not good enough, soo.... fun stuff!

Can't comment on other implementations of mem mapped files, but I provide a sample snip of code that shows what you *could* do:

type
PLargeByteArray = ^TLargeByteArray;
TLargeByteArray = Array [0..Pred(MaxInt)] of Byte;

var lplbaData: PLargeByteArray;
hFile: THandle;
hMapping: THandle;
dwSize: DWORD;
dwIndex: DWORD;
bData: Byte;

const
FileName = 'c:\windows\notepad.exe';

begin

// Open the file for reading
hFile:=FileOpen(FileName, fmOpenRead or fmShareDenyNone);

// Check handle
if (hFile <> THandle(-1)) then
begin
// Resource protection
try
// Get the file size
dwSize:=GetFileSize(hFile, nil);
// Create the file mapping
hMapping:=CreateFileMapping(hFile, nil, PAGE_READONLY or SEC_COMMIT, 0, 0, nil);
// Check mapping
if (hMapping > 0) then
begin
// Resource protection
try
// Map view of file
lplbaData:=MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
// Check data
if Assigned(lplbaData) then
begin
// Resource protection
try
//
// Read the data as an array of bytes, eg...
//

for dwIndex:=0 to Pred(dwSize) do
// Read a byte
bData:=lplbaData^[dwIndex];

//

finally
// Unmap view of file
UnmapViewOfFile(lplbaData);
end;
end;
finally
// Close the handle
CloseHandle(hMapping);
end;
end;
finally
// Close the handle
CloseHandle(hFile);
end;
end;

end;

ASKER

Very interesting, I'm not sure I understand what memory mapping is all about, but I will try it out.

Instead of this:

// Read the data as an array of bytes, eg...
//

for dwIndex:=0 to Pred(dwSize) do
// Read a byte
bData:=lplbaData^[dwIndex];

//

you could then directly call "not UpdateCrc32(dword(-1), lplbaData^, dwSize)".

ASKER

rllibby: okay, I when ahead and put this in and it functions about as fast as when I load the whole file into a tmemorystream before processing it. My question is, since the whole file is getting mapped does this mean that it will take up as much memory as tmemorystream does?

Madshi: Thanks for the idea! I've rewritten my CRC32 code to use that very same notion and it seem to function well I'm also wondering about my crc code.. as you can see in the code I posted above (towards the beginning of the thread) GetCRC makes a call to RecountCRC for each byte... would it be faster to just move the code for RecountCRC into GetCRC? I dont know how much time (if any) a call to another function adds to the processing time.

Well, you can do some timing tests once you have it all working. Function calls do consume time. However, the biggest time loss in your situation will most probably be reading the file from harddisk, so it's not too important how fast your CRC routine really is. But if you care about a fast CRC implementation then you should probably avoid function calls.

That's the beauty of the mem mapped file... You get the speed of the memory stream, direct byte access, and don't have to worry about the memory aspect as the system handles paging the file in/out.

With a TMemoryStream, you need the actual memory to load the file in. If you don't, and you hit the paging file... then things come to a screeching halt. Eg, if I tried to process some of my > 1 GB ms-dvr (tv recordings) using TMemoryStream, my application would pretty much hang in the process. Using a mem mapped file, I can use the above code and process the file in about 7 seconds, with to discernable hit to the memory. Running the function again executes in about .7 seconds which brings about another point.

When doing these tests, make multiple runs to come up with an average. The first time through a given file will generally be slower than subsequent (consecutive) iterations, due to the system having some of the file information in cache (which saves you from a disk IO call). I personally use some code from sysinternals (www.sysinternals.com) to flush the system cache between runs when performing timing tests based on disk IO functionality. This helps me get a better feel for worst case timing results.

As Madshi stated, in this scenario disk IO will probably be your most limiting factor. Your second biggest factor is going to be the searching for crc values in the list. Now, if your interested in trying out a hash table for this problem, then I could take a look at putting one together that is specialized for this situation.

Russell

Using the hash implemenation below, I found that for 20,000 items stored, the IntHash was about 2X faster than the IntList. Of course there is a tradeoff here though, the IntList is much lighter on the memory (IntHash for 20,000 items was appx 1/2 MB of memory). The reason its faster is that the numbers are spread among buckets by hashing the value with a prime number (prime being the key), the hash index is calculated using a simple MOD, and when a bucket is checked, it *theoretically* will have fewer items in it to check then the IntList does.

eg:

IntList with 30,000 items takes appx 13..14 checks per find
IntHash with 30,000 items and a max depth of 6 would take at *most* 6 checks per find.

Of course this all falls apart if the numbers don't hash evenly (all the values get dumped in only a few buckets), when this happens the hash will perform much worse than the binary sort. There are a number of ways to generate the "index" as well (other than a simple mod), like rotating bits, etc.. but the more time spent hashing the value the longer it takes to perform a find.

Anyways, I'm not saying that this will make things better/worse performance-wise, but it does give you another avenue to explore.

Russell

----------

unit IntHash;
////////////////////////////////////////////////////////////////////////////////
//
// Unit : IntHash
// Author : rllibby
// Date : 11.05.2005
// Description : Implementation for a cardinal hash list
//
////////////////////////////////////////////////////////////////////////////////
interface

////////////////////////////////////////////////////////////////////////////////
// Include units
////////////////////////////////////////////////////////////////////////////////
uses
Windows, SysUtils, Classes;

////////////////////////////////////////////////////////////////////////////////
// Hashing constants
////////////////////////////////////////////////////////////////////////////////
const
HASH_SIZE = 23039;

////////////////////////////////////////////////////////////////////////////////
// Hash class definition
////////////////////////////////////////////////////////////////////////////////
type
TIntHash = class(TObject)
private
// Private declarations
FHash: Array [0..Pred(HASH_SIZE)] of TList;
protected
// Protected declarations
function GetMaxDepth: Integer;
function GetMemorySize: Integer;
function HashIndex(Value: Cardinal): Integer;
public
// Public declarations
constructor Create;
destructor Destroy; override;
procedure Clear;
procedure Add(Value: Cardinal);
function Find(Value: Cardinal): Boolean;
property MaxDepth: Integer read GetMaxDepth;
property MemorySize: Integer read GetMemorySize;
end;

implementation

function TIntHash.GetMemorySize: Integer;
var dwIndex: Integer;
begin

// Calculate size of int hash iteself
result:=InstanceSize;

// Now sum up the instance size of all hash buckets
for dwIndex:=0 to Pred(HASH_SIZE) do
begin
// Check bucket assignment
if Assigned(FHash[dwIndex]) then
begin
// Get instance size PLUS the capacity * SizeOf(Pointer)
Inc(result, FHash[dwIndex].InstanceSize + (FHash[dwIndex].Capacity * SizeOf(Pointer)));
end;
end;

end;

function TIntHash.GetMaxDepth: Integer;
var dwIndex: Integer;
begin

// Default result
result:=0;

// Walk the table and get max bucket depth
for dwIndex:=0 to Pred(HASH_SIZE) do
begin
// Check bucket assignment
if Assigned(FHash[dwIndex]) then
begin
// Check bucket depth
if (FHash[dwIndex].Count > result) then result:=FHash[dwIndex].Count;
end;
end;

end;

function TIntHash.HashIndex(Value: Cardinal): Integer;
begin

// Simple mod function for hash index
result:=Value mod HASH_SIZE;

end;

procedure TIntHash.Add(Value: Cardinal);
var dwIndex: Integer;
begin

// Get bucket index
dwIndex:=HashIndex(Value);

// Check bucket
if (FHash[dwIndex] = nil) then FHash[dwIndex]:=TList.Create;

// Add to bucket
FHash[dwIndex].Add(Pointer(Value));

end;

function TIntHash.Find(Value: Cardinal): Boolean;
var listPtr: PPointerList;
dwCount: Integer;
dwIndex: Integer;
begin

// Set default result
result:=False;

// Get bucket index
dwIndex:=HashIndex(Value);

// Check bucket
if Assigned(FHash[dwIndex]) then
begin
// Get list count
dwCount:=FHash[dwIndex].Count;
// Get list pointer
listPtr:=FHash[dwIndex].List;
// Check the list
for dwIndex:=0 to Pred(dwCount) do
begin
// Compare
if (Cardinal(listPtr^[dwIndex]) = Value) then
begin
// Found
result:=True;
// Done processing
break;
end;
end;
end;

end;

procedure TIntHash.Clear;
var dwIndex: Integer;
begin

// Walk the table and free all buckets
for dwIndex:=0 to Pred(HASH_SIZE) do
begin
// Check bucket assignment
if Assigned(FHash[dwIndex]) then FreeAndNil(FHash[dwIndex]);
end;

end;

constructor TIntHash.Create;
begin

// Perform inherited
inherited Destroy;

// Init hash table
ZeroMemory(@FHash, SizeOf(FHash));

end;

destructor TIntHash.Destroy;
begin

// Resource protection
try
// Clear hash table
Clear;
finally
// Perform inherited
inherited Destroy;
end;

end;

end.

ASKER

rllibby: thanks for the additional search, I'm happy with the sorted binary search we worked out before so I think I'm going to stick with that. Most of the sets of files I will be dealing with will only have around 2000-3000 CRC32's (or much fewer) so this should be sufficient. I will of course save the above code just in case.

I'm very happy with the overall results, but am still trying to figure out what my biggest time comsumer is... its kind of elusive. I figured it was disk I/O, but I cant seem to get a good measurement on that. I benched the GetCRC and NextCRC routines and they are fast enough. GetCRC was around ~3 ms and NextCRC was much much faster. I had NextCRC process the same block of data a million times and it came back with around 215 ms. This is probably off though since I was just processing the same block of data over and over. I may have to set up furthur tests for the CRC routines.

I processed one set of files (14 files total) which took (overall) around 1 second. When I measured how much time was spent on each file (mapping the files and getting CRCS, etc) it was only around 15 ms. Which is only about a fifth of the total time spent.

A big time consumer I found is when I check for the existance of a file and then get its filesize. Do you know a fast way to do this? Right now I am using a TFileStream, and I feel there must be a better/faster way although I haven't found it yet.

Do you need to check both existence and size at the same time? Or do you check existence, open file, get file size, process, etc?...
If you need both at the same time, then the following should work well as it peforms both at the same time:

function CheckFileExistence(FileName: String; var Size: Cardinal): Boolean;
var wfdInfo: TWin32FindData;
hFind: THandle;
begin

// Attempt to find file
hFind:=FindFirstFile(PChar(FileName), wfdInfo);

// Check results
if (hFind <> INVALID_HANDLE_VALUE) then
begin
// Set the size.
// Note: If you are dealing with files larger than 4 GB, then you should probably
// use Int64 and code accordingly, eg:
//
// MAXDWORD = 4294967295
// int64 Size:=(wfdInfo.nFileSizeHigh * MAXDWORD)+wfdInfo.nFileSizeLow
//
Size:=wfdInfo.nFileSizeLow;
// Close find
Windows.FindClose(hFind);
// Set result
result:=True;
end
else
// File does not exist
result:=False;

end;

// Usage
var dwSize: DWORD;
begin

if (CheckFileExistence('c:\pagefile.sys', dwSize) and (dwSize > 0)) then
begin
// ...
end;

end;

---------

If you only need to check existence first, and then plan on opening the file regardless of size, then you could use the following to check the file.

function CheckFileExistence(FileName: String): Boolean;
var dwAttr: DWORD;
begin

// Get file attributes
dwAttr:=GetFileAttributes(PChar(FileName));

// Check
result:=(dwAttr <> $FFFFFFFF) and ((dwAttr and FILE_ATTRIBUTE_DIRECTORY) = 0);

end;

Then using the code I already provided for opening the file (see below), you should notice that it gets the file size. So there really is no reason to be using a TFileStream at all (in either situation).

// Open the file for reading
hFile:=FileOpen(FileName, fmOpenRead or fmShareDenyNone);

// Check handle
if (hFile <> THandle(-1)) then
begin
// Resource protection
try
// Get the file size
dwSize:=GetFileSize(hFile, nil); // <-----------------

---------

rllibby

ASKER

Do you think your above CheckFileExistence is faster then FileExists? I'll have to check it out.

I've managed to shave more time off the NextCRC function, which with heavily damaged files can save a lot of time. My test bed ran before around 4 seconds (a little over really), I now have it down to ~2.3 seconds.

I've also optimized my MD5 routines to use some of the stuff we have talked about and took around 300-500 miliseconds off my tests. Do any of you know of a fast way of copying from one buffer to another? Right now I'm using something like:

for I := 0 to len-1 do
buffer[I] := buf^[StartOffset+I];

My MD5 hash algorithm doesn't like the buffer so for now I'm forced to copy from one buffer to another until I can fix it to work with the below buffer type (for the memory mapped file):

PLargeByteArray = ^TLargeByteArray;
TLargeByteArray = Array [0..Pred(MaxInt)] of Byte;

The MD5 functions (they use an Array of Byte) seem like they should work but the program crashes for some reason when I try to use PLargeByteArray with it.

Thanks again for all your help!

>> Do you think your above CheckFileExistence is faster then FileExists? I'll have to check it out.
1.) You will have to test and see, as you didn't display any code for usage...

>> My MD5 hash algorithm doesn't like the buffer so for now I'm forced to copy from one buffer to another until I can fix it to work with the below buffer type (for the memory mapped file)
2.) You should probably fix this, otherwise you will end up performing a lot of extra work (you already have the array of bytes).

>> for I := 0 to len-1 do
buffer[I] := buf^[StartOffset+I];
3.) Move(buffer[I], buf^[StartOffset+I], len);

>> The MD5 functions (they use an Array of Byte) seem like they should work but the program crashes for some reason when I try to use PLargeByteArray with it.
4.) We would need to see the code as (Param: Array of Byte), is not the same as (Param: Pointer) or (Param: PLargeByteArray)

At this point, you should probably think about opening a new Q if you still have further questions... Your original problem was with performance in regards to the CRC and the lookup, and I believe this has been answered completely. Your questions now are related to further optimizing your other routines.

Russell

ASKER

Oh ok, I guess that makes sense. Thanks for all the help.