Link to home
Start Free TrialLog in
Avatar of Delphi_developer
Delphi_developer

asked on

How to achieve the smallest data type, for True/False -1/0?

Hi

My application processes very large amounts of data, not from database. I have lots of Records and array of records, and I need to be careful with assigning data type to variables, otherwise I can easily get 'Out of memory' error.

So, I swapped integer with byte or smallint, where possible, string with char.

But I'm curious with how to identify Yes/No, 1/0, True/False.. so the variable that indicates just two states, like boolean, but occupying(consuming) less bytes.

From Delphi help
"A Boolean type is stored as a Byte, a ByteBool is stored as a Byte, a WordBool type is stored as a Word, and a LongBool is stored as a Longint.
"

So, Boolean is like Byte  = 0..255.

Can I create/make my own type with just values 0 and 1 or similar? So that it only takes 1 bit? :)

Thank you
SOLUTION
Avatar of MerijnB
MerijnB
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Can windows even read and write bit data on a pascal level?

Avatar of Delphi_developer
Delphi_developer

ASKER

I have an analytical application, which analyzes lots of data, and does a cross reference between them.. so all the data needs to be in memory, And in some situations, with large data, I get into trouble with memory...

I tried all sorts of other mechanisms, like loading to memory 'on demand' but this slows down the processing by 100x times...

I also made like 'sets of values'  for certain data, like if a certain property name is used a lot, like 'Default File Name'... I just use Byte type and assign it like 1 or 2 - the value from 'sets of values'...

I optimized it very good...

I managed to to get it down to 10% of memory consumption than what it was at beginning, without any optimization... still not good enough.
How about using kbmMemTable.  It compresses the data automatically.
http://www.components4programmers.com/products/kbmmemtable/index.htm
Thank you, I had a quick look at it, and I don't believe this could be useful for my app. I mean I would need to convert all the loading and searching thorough the data... The whole app has around 200K lines... lots of searches... so, changing all the code related to the main data records si not an option, at all.

@MerijnB: What dou you mean by "8 booleans in 1 byte"? 1 boolean is already 1 byte long...


Well, if 1 bit is not possible to achieve, what about 4 bits, 1/2 of a byte - 0..64 ?
Use Byte (or Unsigned Integer) as Merijn suggested, and treat each bit in that byte as a boolean variable (True = anything other than zero; False = 0)

You can use the Bitwise operators to extract individual values from these variables.

http://www.aspfree.com/c/a/.NET/The-Delphi-Language-Part-1/2/

See section headed Bitwise Operators

Bitwise operators enable you to modify individual bits of a given integral variable. Common bitwise operators enable you to shift the bits to the left or right or to perform bitwise "and," "not," "or," and "exclusive or" (xor) operations with two numbers. The Shift-left and Shift-right operators are shl and shr, respectively, and they're much like the C# << and >> operators. The remainder of Delphi's bitwise operators is easy enough to remember: and, not, or, and xor. Table 5.3 lists the bitwise operators.Table 5.3 Bitwise Operators
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
A salient question seems to be "How many boolean values are you tracking in a given record?"

If the answer is more than one, then you will be able to save space by using the technique described by moorhouselondon.

If you have many boolean values per record, you might also consider using some Integer allocations to avoid alignment headaches.
I suspect (I haven't looked at the Assembler code generated) that the Bitwise Operators will also give greater speed, because they are nearer to direct underlying Machine Code instructions when translated by the Delphi Compiler.
Good idea, but I usually have 1 or 2 bolean values... but have lots of 'array of byte' records. In one record I have 5 of these array of byte sub-records (2 dimensional array)... and here records can run up to a couple of millions...

TRec= record
  MainID:word;
  MainSubID:word;
  MainSubPropA:string;
  IndicatorA:boolean;
  ValuesA: array of byte;
  ValuesB: array of byte;
  ValuesC:array of byte;
  ValuesD: array of byte;
  ValuesE:array of byte;
end:

I this case ValuesA- E from some sort of grid of 1 and 0 where they identify specific relationships...

So, here it would come very handy some sort of bit manipulation, but one line of Values can be from 100 to 10000.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you, it will take me some time to go through all this...
OK, hers is what I believe should be the code for your suggestion, aikimark, I use procedures from Bitwise unit... and I have 2 of my procedures, one to set bit and one to read it. In my example i have a grid(array) of 3 lines of 3 bytes, and I set 0th bit in 1st line, 11th in second and 23rd bit in 3rd line to 1....

can you please verify if this is what you had in mind, and if it is, if I have any pitfalls in my code, or perhaps how it could be faster or optimized...:


///// 1. from BitWise unit, to manipulate bits:
 
function IsBitSet(const val: longint; const TheBit: byte): boolean;
begin
  result := (val and (1 shl TheBit)) <> 0;
end;
 
function BitOn(const val: longint; const TheBit: byte): LongInt;
begin
  result := val or (1 shl TheBit);
end;
 
function BitOff(const val: longint; const TheBit: byte): LongInt;
begin
  result := val and ((1 shl TheBit) xor $FFFFFFFF);
end;
 
function BitToggle(const val: longint; const TheBit: byte): LongInt;
begin
  result := val xor (1 shl TheBit);
end;
 
 
 
 
///// 2. My SetBit and GetBit:
 
 
procedure SetBit(var vValues:array of byte;vBit:integer);
var vPos,vBitPos:integer;
begin
  If vBit = 0 Then // for first bit! or we get -1 from Ceil(vBit/8)-1
    vPos:=0
  else
    vPos:=Ceil(vBit/8)-1; // which byte
 
  vBitPos:=vBit mod 8;    // which bit in byte
 
  // set the bit
  vValues[vPos]:=BitOn(vValues[vPos],vBitPos);
end;
 
function GetBit(vValues:array of byte; vBit:integer):integer;
var vPos,vBitPos:integer;
begin
  If vBit = 0 Then // for first bit! or we get -1 from Ceil(vBit/8)-1
    vPos:=0
  else
    vPos:=Ceil(vBit/8)-1; // which byte
 
  vBitPos:=vBit mod 8;    // which bit in byte
 
  // read if bit is set or not
  If IsBitSet(vValues[vPos],vBitPos) Then
    Result:=1
  else
    Result:=0;
end;
 
 
 
///// 3. Usage 
 
 
procedure TForm1.Button2Click(Sender: TObject);
 
type TRec = record
      vId:integer;
      vBytes:array of byte;
end;
 
var
  i,j: Integer;
  vStr:string;
  vArray:array of TRec;
  k: Integer;
begin
 
  // set array 3x3 
  SetLength(vArray,3);
  SetLength(vArray[0].vBytes,3);
  SetLength(vArray[1].vBytes,3);
  SetLength(vArray[2].vBytes,3);
 
  // set 3 bits 
  SetBit(vArray[0].vBytes,0);
  SetBit(vArray[1].vBytes,11);
  SetBit(vArray[2].vBytes,23);
 
  // read same 3 bits
  ShowMessage('bit 0 = '+IntToStr(GetBit(vArray[0].vBytes,0)));
  ShowMessage('bit 11 = '+IntToStr(GetBit(vArray[1].vBytes,11)));
  ShowMessage('bit 23 = '+IntToStr(GetBit(vArray[2].vBytes,23)));
 
 
  // write 3x3 bytes to memo, to see 'binary' results
  vStr:='';
  for i := 0 to Length(vArray) - 1 do
  begin
    vStr:='';
    for k := 0 to Length(vArray[i].vBytes) - 1 do
    for j := 0 to 7 do
    begin
      if IsBitSet(vArray[i].vBytes[k],j) then
        vStr:='1'+vStr
      else
        vStr:='0'+vStr;
    end;
 
    Memo1.Lines.Add(vStr);
  end;
 
end;

Open in new window

From my example I get:

000000000000000000000001
000000000000100000000000
100000000000000000000000

It looks like it should be quite correct, right?
* You were correct to make this zero origin.  My math example was overly simplistic.  Your invoking code can use one origin and the underlying calculations can translate it into zero origin.  This might make documentation and maintenance easier.  It's up to you.

* If you are going to do much rendering of your bit array as a string, you should probably avoid concatenation.  If you create a PChar of the appropriate length, you only need to set its bytes to 48 (='0') or 49 (='1') as you look at the bit values.

* You need to remember that your rendering is reading right to left.

* You haven't mentioned persistence, but you will save an equal amount of disk space by rendering the byte values of your bit array as binary stream output.  It's not so good for human recognition, but it sure is fast for save/read.
I agree, I'll have to make sure I only use proper procedure to read/write so I don't mix right to left and left to right.

Anyway, this is very good concept, at least for this problem with arrays of 1/0 data.

Now I also understand why Byte is the simplest/smallest data type and anything smaller is achievable, but only when it makes sense.

Thanx
I split the points because MerijnB first pointed out to use bits in bytes... moorhouselondon for explaining what MerijnB meant... and aikimark for the final calculation.

Thank you guys
Although this question is closed (thanks for the points, btw), I need to ask if these bit arrays are sparsely populated.  That is, what is the ratio of 1 values to 0 values?

You can use a different encoding for sparse arrays/matrixes that can give even greater space savings.  Ultimately, you have the ability to mix the data structures you've accepted with this question with a sparse array, where the on/off bits are represented by a list of two byte integers (given your 1000 max dimension size).  If the sparse dimension is mostly 1 values, then a sparse representation would be the locations of the 0 values.  Conversely, if the dimension is mostly 0 values, the sparse representation would be the 1 locations.  Your code can decide when one representation saves the most data and convert the dimension dynamically.  Doing this is a trade-off between program complexity, non simulation overhead/optimization cycles, and memory.

======================
Also, what we never discussed were the non-bit array parts of your record.  I'm not sure what data
    MainSubPropA:string;
contains or how it's used.  String data does not lend itself to tight packing and introduces alignment 'artifacts'.  It might be worth considering a table of these string values that is pointed to by an 32bit-sized value in the MainSubPropA of the records.  Also, if there are redundancies in these string values, you can save even more space from duplicate value elimination.  You might even be able to use a Word-sized data type if the number of unique strings falls below 64K.
Yes, I actually use Word type for thos IDs. The issue I have is with the biggest array, which is a cross reference between 7K objects, so 7Kx7K. For some situations I need complete matrix, for some I also need another matrix with less data to combine results... quite complex.

With this example I'll get from 7000x7000 bytes down to 7000x875 bytes, so 8x less bytes, so the memory consumption will come down to 12.5% (15%) of original, and this is very good.

So,l for example 7kx7K grid indicates with 1 where certain objects cross (hit) with their properties or data. And I export this to Excel, with simple 'X' where rows and columns cross. In some situations instead fo complete grid, I use approach that each line identifies the number identifications of objects, so if 1st line (1st object/1st row) only cross with 10 columns, I would only have Length of 10 words and each would indicate object id, like 10,11,30,400,599,2000,3009,4000,... so much less memory consumption, But the range of hits can be from 0 to 7K, because one object can cross with all objects. And in this case I use Word instead of byte...- I believe this is what you mean by 'sparse representation'?

This situation with so much memory consumption is very rare, but I need to have it as much as possible not to happen.

I also have for other records changed from string to char where I know thta only 1 character identifies the value, like 'Y'es/'N'o.

It will take me some time to implement this, but I will let you know how it went and the final results.
A little bit late, but you can even use a set for this, much simpler.

If you add more than 8 elements, it will automatically become a 16 bit integer
type TBit = (Bit0, Bit1, Bit2, Bit3, Bit4, Bit5, Bit6, Bit7);
     TBits = set of TBit;
 
var Bits: TBits;
    B: byte;
begin
 Bits := [Bit0, Bit3];
 
 B := Byte(Bits);  // B is now 9
 
 B := 129;
 Bits := TBits(B); // Bits is now [Bit0, Bit7]
 
 if (Bit0 in Bits) then
 begin
  // check for a signle bit
 end;
end;

Open in new window

@MerijnB:

so, if I have for example a record of 100 values, so 100 bits... I would have have it defined as

vValues:array of TBit;
...
SetLength(vValues,Ceil(100/8));// 13 TBit to store 100 bits

And to access for example 20th bit, I would need to get which of the TBits to access and which Bit... similar to example above:

 vPos:=Ceil(20/8)-1; // which byte
 
  vBitPos:=20mod 8;    // which bit in byte
 
  // set the bit
  vValues[vPos]:=[vBitPos];

and to check for 20th bit:

If (vBitPos in vValues[vPos]) then...


Actually it looks quite the same, but it seems a bit easier...


How would you clear a bit in this case, once you set it?
That's pretty slick, MerijnB.

You've shown how a single bit can be examined, but how would a single bit be changed?

=======================
@Delphi_developer

While I understand your reluctance to change your 200k line program, I think that this kind of matching problem really lends itself to a database engine.  Once you get some breathing room on this problem, you might want to post another question that concerns the representation of your record data in a relational format, suitable for a database engine to tackle the core part of this problem.  I'm a big proponent of code reduction through outsourcing.  Introducing a database component would likely add flexibility as well as simplicity to your codebase.

7kx7k = 49,000,000 bytes, which isn't very many bytes for today's processors even with one byte/bit representation.
Thank you, but, database is not an option. My application is standalone, and the ease of use and no database is the reason that is being sold quite good... the competion had the database back-end... wasn't accepted at all...


Actually rest of the code is quite optimized, but the latest functionality - matrix of cross reference and relations between objects is the one that is having issues... but I'll fix it with one of these two solutions I got here.

You don't have to have a database backend, just a database engine.  Everything can exist in memory and the database engine dlls can be included INSIDE of your application code (.exe).  No client need know *how* you are doing your cross referencing.

The advantage of a database engine is that you don't have to change your code as the data requirements expand.
OK, if it is all in memory, then what would be the difference between what I have now and database engine?

Here is what I have now, and let me know where I could benefit with the change:

All of my data is in arrays of records, similar as tables They can have from 5 to 50 fields, and up to a couple of millions records. I have optimized data types from integer to word, smallint and bytes, string to char, where possible.

For speed I also used some sort of indexes... for example: most of the arrays have two level IDs of records, like MainObjectID and SubObjectID. And for all FOR loops I have:

GetABCObjectIndexes(MainID,SubID.vStart,vStop);
For i:=vStart to vStop do
...

So, the search for data is very fast.

* "The advantage of a database engine is that you don't have to change your code as the data requirements expand."

Well, I don't see this one... If I need to add another record field, I need to change the 'Load' procedures and all code that relates to that array, if the field changes anything relevant.
But, it is the same case with database, if the field is 'important' - it affects most of the searches for data - the code with DB needs to be changed as in my code, right?

So, I don't believe the speed would improve, if it's all in memory then memory consumption is not going to change, and the changes I need to make in my code to manipulate arrays would also be needed with database.

The only reason for DB I see is that it is very non-memory consumption, because it is a standalone DB. But if it is just engine as you said, all in memory, the I don't see any benefit over my code.

What do you think?
When you get/make some training time, please look up database normalization.

I don't know your problem sufficiently to state my opinion as fact, since you may be solving your problem in the optimum way.  However, when I see your 'records', I think of them as implementing non-normalized relationships.  I see this often in user-created databases and spreadsheets (used as databases).

If you represent your data differently, you might be able to pass it to a database engine and request (through SQL statements) the matching information you seek, including filtering grouping and sorting -- just ready for reporting.

I plead the same ignorance about your 'indexing' (efficiency, structure, performance, etc.)
I am a bit stubborn so, thank you for suggesting DB, but I doubt I will ever go and change my code to DB usage, unless there is substantially money, for me, involved here. Just too much code to change.

I know the general benefits of DBs, but I believe my code doesn't have a lot of room to optimize, perhpas up to 10,15%, with a lot of changes... 3 years ago, my app analyzed, for example, 100MB of data in about an hour, full analyze, which is about 40% of what it does now.. now it does much more within 5 mins. So, to have everything in memory is the best option, for the speed. It used to take about 200MB of memory for 100MB of data. Now the 2GB of data takes up 200MB of memory. So, I did very good job.

The indexes... this is really cool solution. I don't know the details of DBs indexes, but my work this way: If you have a financial records, and let say the main ID is date. And you have 1M records for 2 years, 24 months. You just loop once through the array and build another array of date_from and date_to values. And each date_from and date_to are indexes (array[index]) to the main array. So when you want to just do something with records of month 7, you first call GetIndex(Month =7) and you get starting and ending index of the main array. So you don't loop thought the whole financial array but only through that monht's records. OK, if you loop through records only a couple of times, no problems, but as soon as you have 1000s of searches then this concept proves very very fast and efficient.