Link to home
Start Free TrialLog in
Avatar of SheratonGroup
SheratonGroup

asked on

Type of a Set in Delphi

Hi,

I am trying to create a type with a set of characters, that includes ranges.  I am using #xxx for characters rather than the actual character .  

I have tried several variations on
type TCharsToAllow = (#128, #131, #134..#135, #138, #140, #142, #152, #154, #156);

Does anyone have the proper syntax for this?

Basically I would like to be able to do something like:

var
S : string;
CharsToAllow : TCharsToAllow
..
begin
...
if NOT (S[j] in [CharsToAllow]) then  
 exit;


ASKER CERTIFIED SOLUTION
Avatar of cyberkiwi
cyberkiwi
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>ewangoya
aren't you testing only the first  valid char ?

you can use a set, but that's limited to 256 entries
not that this is a problem in this case

since these are all characters, you could just as well use a string
storing/retrieving is a lot easier as a string

var CharsToAllow: string;

CharsToAllow := #128#131#134#135#138#140#142#152#154#156;

function Validate(const S: string): boolean;
var
   I: Integer;
begin
  // All chars must be allowed in s
  Result := True;
  for I := 1 to length(S) do
    if Pos(S[I], CharsToAllow) = 0 then
    begin
      Result := False;
      Exit;
    end;
end;
using a string & Pos is just much longer than using a set. When you use the IN operator, it translates basically into a handful of bitwise operations (one shift, one And, another shift and one Or). The time to look for a char in a set is the same whatever the char and the set - extremely small.

The Pos function on the other hand can be time consuming, and the time to check a char in a string is not the same whatever the char and the string. If the char is not in it, it takes a maximum of time, that is proportional to the length of the string.

And the memory usage is different, for a char set it uses 256/8=32 bytes, whatever the number of chars in the set.

In short : don't use anything else than a set if you need a set. Especially of (ANSI)Chars

Little details :
- you don't have to redefine a type = Set of Char, it exists already as TCharSet
- ewangoya's function is the most useful/complete implementation, but is faulty (reversed logic)

function Validate(const S: string): Boolean;
const
 CCharsToAllow: TCharSet = [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156];
var
 i: Integer;
begin
 Result := False;
 for i := 1 to length(S) do
  if NOT (S[i] in CharsToAllow) then Exit;
 Result := True;
end;

Open in new window

@epasquier
>> ewangoya's function is the most useful/complete implementation, but is faulty (reversed logic)

You may have missed the first comment, where I thought I had already demonstrated how to define a const set and use it properly.  Once understood, the asker could turn it into a function like you have or use it inline if there are multiple areas in the program requiring different chars-to-allow sets.

Do you have any documentation about this claim? - "(one shift, one And, another shift and one Or) The time to look for a char in a set is the same whatever the char and the set"

Specifically, it is hard to fathom 4 ops (unless it is a quantum computer) resolving whether one char belongs to a set containing say 60 chars.  But I stand to be corrected.
[I am not endorsing the Pos() approach, just not following the claim]
Well, I know how the set are implemented in pascal for ages. About the real ops they use to implement IN operator, I have not disassembled to check , but that is how I would do it knowing the data structure. I only assumed that in all the years borland worked on pascal/Delphi, they found the best way to implement that.
That would be easy to check though. Before I'll do that, let me explain how a set is working :

A set variable is a bit field of 256 elements (max, but it's possible that for easy coding and performance they choose to reserve all 32 bytes - trade memory for performance is usual).

Let us see with a custom set of custom enumerated type :
Type
  TMyEnum=(e1,e2,e3,e4);
  TMySet=Set of TMyEnum;

Var
 aSet:TMySet;

aSet data is coded in the first 4 bits of 1 byte. Each bit say if the corresponding element is in the set
so testing the presence of e2 in aSet would be just an AND operation :
e2 IN aSet => (aSet[0] AND 2)>0
you see in that notation that I consider aSet as a byte Array, and that I have fixed 0 and 2 values. Those are respectively the Byte index of e2 bit within aSet, and the bit value of e2 in this byte
let's call them e2_Bi and e2_bv

if we say that for whatever set of enums, which values are ordered as ei , i being the enum index from 0 to N-1 (N<=256), then we can say that :
ei_Bi = i DIV 8 = i SHR 3  ( first bitwise right shift operation)
ei_bv = 2^ei_bi = 1 SHL ei_bi (second bitwise operation) , where I introduced ei_bi = bit index
and : ei_bi = i MOD 8 = i AND 7

so the final translation of ei IN aSet is :
(aSet[i SHR 3] AND (1 SHL (i AND 7)))>0

which means I said something wrong, it's 2 shifts and 2 AND, not one AND + one OR
Ok, time for checking ASM :

Type
  TMyEnum=(e1,e2,e3,e4);
  TMySet=Set of TMyEnum;
Var
 aSet:TMySet;

procedure TForm1.btn1Click(Sender: TObject);
begin
 aSet:=[e1,e3];
 if e3 IN aSet
  Then ShowMessage(Format('%d %x',[sizeof(aSet),Byte(aSet)]));
end;

==> Message = '1 5'
that settles the fact that the size of the set is optimized for the maximum elements it can contain. For set of Chars, that is 32 bytes. But for that simple set, only 1 byte.
You'll see that the integer value of this set is 5 = 1 + 4 , as explained.
You'll also see in the ASM below that the test is actually using the TEST operator, which is precisely there to check the presence of ONE bit directly, so it avoids one AND operation. I forgot about this because no modern language have the TEST operator, only possible in ASM. Even in C, you would have to do the same as what I explained.

Now this single OP test is optimized because the e3_Bi and e3_bv are already calculated at compile time. I'll make a test where the compiler can't optimize that
Main.pas.81: aSet:=[e1,e3];
00443325 A090334400       mov al,[$00443390]
0044332A A24C584400       mov [$0044584c],al
Main.pas.82: if e3 IN aSet
0044332F F6054C58440004   test byte ptr [aSet],$04
00443336 7437             jz TForm1.btn1Click + $63
Main.pas.83: Then ShowMessage(Format('%d %x',[sizeof(aSet),Byte(aSet)]));
...

Open in new window

If I may interject, you probably want to construct your tests against char in string specifically.
Because using an enum, it is already optimized into known types, whereas for string op (ascii, 256 possible chars) against set, I don't think it is possible in 4.
Type
  TMyEnum=(e1,e2,e3,e4,e5,e6,e7,e8,e9,e10);
  TMySet=Set of TMyEnum;

function Test(eVal:TMyEnum;aSet:TMySet):Boolean;
begin
 Result:=eVal IN aSet;
end;

Ok, now that is yet another operator : BT (Bit Test), introduced with 386 that does exactly what I said but in one instruction. Of course, since it is quite frequently used. It works with the base address as first operand, and the bit index in second (but global bit index, not index in the indexed Byte). Internally, you can bet hard currency that this instruction is cabled to perform exactly the same calculation. But I won't go that far :o)
Note the first 2 instructions before the BT test, to ensure that the enum value is within the scope of the set (10 values), and to normalize the whole 32 bit register EAX to this 8 bit index. They just want there to be on the safe side, because they don't know really what you call this function with.

Bottom Line : ALWAYS use sets when you have a maximum of 256 elements. All operations are blazingly fast. For bigger sets, you can use the TBits type that can hold a large number of bits dynamically, and read/rite them based on index. But you'll have to implement + (OR) and - (AND) operators equivalent, as it is not done (checked on Delphi 5, maybe it has been done after, but that is probably not used enough)
Main.pas.80: begin
0044330C 51               push ecx
0044330D 66891424         mov [esp],dx
Main.pas.81: Result:=eVal IN aSet;
00443311 3C0F             cmp al,$0f
00443313 7707             jnbe Test + $10
00443315 83E07F           and eax,$7f
00443318 0FA30424         bt [esp],eax
0044331C 0F92C0           setb al
Main.pas.82: end;

Open in new window

a set of char is no different than a set of an enum type having 256 possible values. Or a set of Byte for that matter.

Type
  TCharSet=Set of Char;
Var
 aCharSet:TCharSet;

function Test2(eVal:Char;Const aSet:TCharSet):Boolean;
begin
 Result:=eVal IN aSet;
end;

procedure TForm1.btn1Click(Sender: TObject);
begin
 aCharSet:=['A'..'Z'];
 if Test2('A',aCharSet)
  Then ShowMessage(Format('Size=%d ''@''..''_''=%x',[sizeof(aCharSet),PInt(Integer(@aCharSet)+8)^]));
end;

// note just the Const I added before the charset parameter, to avoid the copying of the whole 32 bytes in the stack. The following is then the plain ASM function, with all stack frame management and testings, which you can see are reduced to the bare minimum.
Main.pas.88: Result:=eVal IN aSet;
0044330C 25FF000000       and eax,$000000ff
00443311 0FA302           bt [edx],eax
00443314 0F92C0           setb al
Main.pas.89: end;
00443317 C3               ret 

Open in new window

@epasquier
In your code, 'A' is a constant and has been optimized.
See listing below, and the CPU code.
CPU
Result:=eVal IN aSet;
mov eax,[ebp-$08]
mov dl[ebp-$01]
and edx,$000000ff
bt [eax],edx
setb al
mov [ebp-$09],al
mov al,[ebp-$09]
end;

Delphi

{$APPTYPE CONSOLE}
program Project2;

Type
  TCharSet=Set of Char;

function Test2(eVal:Char;Const aSet:TCharSet):Boolean;
begin
 Result:=eVal IN aSet;
end;

Var
 aCharSet:TCharSet;
 s: string;
 i: integer;
 c: char;
begin
 aCharSet:=['a'..'m']; // 1st half lowercase alphabet
 ReadLn(s);// := 'This is a test string';
 for i := 1 to Length(s) do
 begin
  c := s[i];
  if Test2(c,aCharSet) Then Write(c);
 end;
end.

Open in new window

eVal parameter does not matter if it is from a constant 'A'  or read from a file as far as the function is concerned.

Out of curiosity I compiled and run your app from Delphi 5 and got the same result as my previous example.
I can't explain why you don't have the same ASM code as me, except that we don't use the same Delphi, but still if Delphi 5 is capable of producing a perfect ASM code, why yours (that I suspect is newer than Delphi 5, isn't it ?) is fiddling with the stack instead of directly use the registers ? Maybe an unoptimized compiler option.
Look especially the last 2, they are obviously useless
mov [ebp-$09],al
mov al,[ebp-$09]
same as :
i:=j;
j:=i;
...........

but whatever, even with your code it's obvious that the IN operator is really smartly implemented, and whatever it is that added some strange code around would not when you use the IN operator directly in the code of the functions that need it instead of having a function for it.
Avatar of SheratonGroup
SheratonGroup

ASKER

SO much useful information here.  I really enjoyed reading the discussion.  I actually used information from the first two posts to create a function.  So I am splitting points.

On a side note I am basically trying to validate e-mail addresses that may have umlauts/diactrics/n-tilde's/etc.  Basically e-mails with "special" letters greater than the 128.  

I am really only interested in chars <ascii 255.  I know it's not perfect, but it's good enough for our needs.  I am using Delphi 6, so unicode support is not automatic.

I would welcome comments on the code below.
function TForm1.validateEmail(S : String) : boolean;
var
  i : integer;
  localPart, domainRoot, TLD, strAfterAt : String;
  atPos, TLDPos : integer;

  LPStdCharsToAllow : TCharsToAllow;   //Local Part Chars to allow
  LPSpecCharsToAllow : TCharsToAllow;  //Local Part Chars >128 to allow (diacritics, umlauts, etc.)

  DRStdCharsToAllow : TCharsToAllow;   //Domain Root Chars to allow
  DRSpecCharsToAllow : TCharsToAllow;  //Domain Root Chars >128 to allow.

  TLDCharsToAllow  : TCharsToAllow; //Top Level Domain Chars to allow
  TLDIDNCharsToAllow : TCharsToAllow;  //International Top Level Domain Chars to allow, no chars over 128, but internationalized TLD's ok.

  InternationalTLD : Boolean;
  InternationalTLDCharStart : integer;
begin

InternationalTLD := FALSE;

LPStdCharsToAllow := [#33, #35..#39, #42..#43, #45..#57, #61, #63, #65..#90, #94..#96, #97..#126];
LPSpecCharsToAllow := [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156, #158..#159, #161..#165, #167, #181, #192..#246, #248..#255];

DRStdCharsToAllow := [#45..#46, #48..#57, #65..#90, #97..#122];
DRSpecCharsToAllow := [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156, #158..#159, #161..#165, #167, #181, #192..#246, #248..#255];

TLDCharsToAllow := [#65..#90, #97..#122];       //non-international top level domains
TLDIDNCharsToAllow := [#48..#57, #65..#90, #97..#122];  //for international domains (just the alphanumeric part after the "XN--")

    result := FALSE;
    S := trim(S);

    if ((length(S) < 7) AND (length(S) > 254)) then  //checking overall length of address
      exit;

    atPos := pos('@', S);  //find the @ symbol and split the string.

    if atPos = 0 then
      exit;  // no @ in email address.

    localPart := copy(S, 1, atPos-1);

    if length(localPart) > 64 then
      exit;  //localPart must be less than 64 chars.

    if length(localPart) > 0 then
      strAfterAt := copy(S, atPos+1, length(S)-atPos)
    else
      exit;  //zero length email prefix

    if pos('@', strAfterAt) > 0 then  //Ensure that there are no other @ symbols
      exit;

//Parsing local-part...
    if (localPart[1] = '.') OR (localPart[length(localPart)] = '.') then    //a period cannot begin or end the local-part
      exit;

    if ansiContainsText(S, '..') then        //ensure that 2 or more periods do not occur sequentially anywhere
      exit;

    //valid chars allowed in the local-part of an email address        all upper/lowercase letters and numbers
    //decimal values                                                   [48-57], [65-90], [97-122]

    //valid special chars allowed in the email prefix w/out quotes...  !  #  $  %  &  '  *  +  -  .  /  =  ?  ^  _  `  {   |   }   ~
    //decimal values for spec chars to allow...                        33 35 36 37 38 39 42 43 45 46 47 61 63 94 95 96 123 124 125 126

    for i := 1 to length(localPart) do
    begin
       if NOT (localPart[i] in LPStdCharsToAllow) then   
         if NOT (localPart[i] in LPSpecCharsToAllow) then   
           exit;  //invalid chars found in unquoted local-part

    end;

               //TODO LATER... maybe... check if necessary.
               //corner case not coded for.
               //according to RFC2822 nearly any char can be used in an email address if they're within double quotes
               //it is fairly widely accepted practice not to allow this, because of security risks, eg, embedding malicious
               //commands within an email address.

    TLDPos := 0;
    for i := length(strAfterAt) downto 1 do      //retrieving the domain root and the TLD in one step.
    begin
      if strAfterAt[i] = '.' then
      begin
        TLDPos := i;
        break;
      end;
    end;

    if TLDPos = 0 then   
      exit;

    domainRoot := copy(StrAfterAt, 1, TLDPos-1);     // subtracting one to exclude the period.
    TLD := copy(strAfterAt, TLDPos+1, length(strAfterAt) - TLDPos);

//Parsing domainRoot
    if length(domainRoot) = 0 then
      exit;  //ensuring that the domain root is not blank

    if (domainRoot[1] in ['.','-']) OR (domainRoot[length(domainRoot)] in ['.','-']) then
      exit;  //not allowed to have these chars as the first or last char in domainRoot

    for i := 1 to length(domainRoot) do
    begin
      if NOT (domainRoot[i] in DRStdCharsToAllow) then   
         if NOT (domainRoot[i] in DRSpecCharsToAllow) then   
           exit;  //invalid chars found in domainRoot
    end;

//Parsing Top Level Domain
    if length(TLD) < 2 then
      exit;  //Top Level Domains have a minimum of two chars, as of 8/23/10

    if (ansiStartsText('XN--', TLD) AND (length(TLD) > 4)) then  //International TLD (starts with 'XN--' prefix)
    begin
      for i := 5 to length(TLD) do
        if NOT (TLD[i] in TLDIDNCharsToAllow) then
          exit;
    end
    else
    begin
      for i := 1 to length(TLD) do
        if NOT (TLD[i] in TLDCharsToAllow) then
          exit;
    end;

    showmessage(localPart + #10#13 + domainRoot + #10#13 + TLD);
result := TRUE;
{
The total number of chars in the TLD are at least 2
See http://data.iana.org/TLD/tlds-alpha-by-domain.txt for an up to date list of valid TLD extensions.  Since this changes it is not a good idea to hard code against this list.  So, simply ensuring that there are at least 2 valid chars in the extension.  And that the dashes appear in the right places.  Side Cases… not coded for.

It is possible pass the domain part of an email in brackets (eg, [129.55.222.88]
It is considered valid to allow quoted emails with any character at all in between.
See http://www.regular-expressions.info/email.html
}

end;

Open in new window

PS - One advantage of using a set and not a string is that it allows for using ranges of chars, intermixed with individual chars.  I see now that my original example didn't have any wide ranging chars, but this was necessary.    

thanks again.
range in sets is only useful for the coder : less characters to type. They actually are not stored differently than if you explicitly put all your chars in your set