Solved

Type of a Set in Delphi

Posted on 2010-08-24
15
722 Views
Last Modified: 2013-11-23
Hi,

I am trying to create a type with a set of characters, that includes ranges.  I am using #xxx for characters rather than the actual character .  

I have tried several variations on
type TCharsToAllow = (#128, #131, #134..#135, #138, #140, #142, #152, #154, #156);

Does anyone have the proper syntax for this?

Basically I would like to be able to do something like:

var
S : string;
CharsToAllow : TCharsToAllow
..
begin
...
if NOT (S[j] in [CharsToAllow]) then  
 exit;


0
Comment
Question by:SheratonGroup
  • 7
  • 4
  • 2
  • +2
15 Comments
 
LVL 58

Accepted Solution

by:
cyberkiwi earned 250 total points
Comment Utility
const CharsToAllow = [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156];

procedure TForm1.Button1Click(Sender: TObject);
var
S : string;
j: integer;
begin
s := Edit1.Text;
for j := 1 to length(s) do
  if NOT (S[j] in CharsToAllow) then
   ShowMessage(s + ' is not valid');
end;
0
 
LVL 32

Assisted Solution

by:ewangoya
ewangoya earned 250 total points
Comment Utility
if you want to declare it as type and reuse it, you can do

type
  TCharsToAllow = set of char;

then use it as cyberkiwi suggested

function Validate(const S: string): Boolean;
const
  CCharsToAllow: TCharsToAllow = [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156];
var
   I: Integer;
begin
  for I := 1 to length(S) do
  begin
     Result := S[I] in CCharsToAllow;  
     if Result then
        Exit;
  end;
  Result := False;
end;
0
 
LVL 36

Expert Comment

by:Geert Gruwez
Comment Utility
>>ewangoya
aren't you testing only the first  valid char ?

you can use a set, but that's limited to 256 entries
not that this is a problem in this case

since these are all characters, you could just as well use a string
storing/retrieving is a lot easier as a string

var CharsToAllow: string;

CharsToAllow := #128#131#134#135#138#140#142#152#154#156;

function Validate(const S: string): boolean;
var
   I: Integer;
begin
  // All chars must be allowed in s
  Result := True;
  for I := 1 to length(S) do
    if Pos(S[I], CharsToAllow) = 0 then
    begin
      Result := False;
      Exit;
    end;
end;
0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
using a string & Pos is just much longer than using a set. When you use the IN operator, it translates basically into a handful of bitwise operations (one shift, one And, another shift and one Or). The time to look for a char in a set is the same whatever the char and the set - extremely small.

The Pos function on the other hand can be time consuming, and the time to check a char in a string is not the same whatever the char and the string. If the char is not in it, it takes a maximum of time, that is proportional to the length of the string.

And the memory usage is different, for a char set it uses 256/8=32 bytes, whatever the number of chars in the set.

In short : don't use anything else than a set if you need a set. Especially of (ANSI)Chars

Little details :
- you don't have to redefine a type = Set of Char, it exists already as TCharSet
- ewangoya's function is the most useful/complete implementation, but is faulty (reversed logic)

function Validate(const S: string): Boolean;

const

 CCharsToAllow: TCharSet = [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156];

var

 i: Integer;

begin

 Result := False;

 for i := 1 to length(S) do

  if NOT (S[i] in CharsToAllow) then Exit;

 Result := True;

end;

Open in new window

0
 
LVL 58

Expert Comment

by:cyberkiwi
Comment Utility
@epasquier
>> ewangoya's function is the most useful/complete implementation, but is faulty (reversed logic)

You may have missed the first comment, where I thought I had already demonstrated how to define a const set and use it properly.  Once understood, the asker could turn it into a function like you have or use it inline if there are multiple areas in the program requiring different chars-to-allow sets.

Do you have any documentation about this claim? - "(one shift, one And, another shift and one Or) The time to look for a char in a set is the same whatever the char and the set"

Specifically, it is hard to fathom 4 ops (unless it is a quantum computer) resolving whether one char belongs to a set containing say 60 chars.  But I stand to be corrected.
[I am not endorsing the Pos() approach, just not following the claim]
0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
Well, I know how the set are implemented in pascal for ages. About the real ops they use to implement IN operator, I have not disassembled to check , but that is how I would do it knowing the data structure. I only assumed that in all the years borland worked on pascal/Delphi, they found the best way to implement that.
That would be easy to check though. Before I'll do that, let me explain how a set is working :

A set variable is a bit field of 256 elements (max, but it's possible that for easy coding and performance they choose to reserve all 32 bytes - trade memory for performance is usual).

Let us see with a custom set of custom enumerated type :
Type
  TMyEnum=(e1,e2,e3,e4);
  TMySet=Set of TMyEnum;

Var
 aSet:TMySet;

aSet data is coded in the first 4 bits of 1 byte. Each bit say if the corresponding element is in the set
so testing the presence of e2 in aSet would be just an AND operation :
e2 IN aSet => (aSet[0] AND 2)>0
you see in that notation that I consider aSet as a byte Array, and that I have fixed 0 and 2 values. Those are respectively the Byte index of e2 bit within aSet, and the bit value of e2 in this byte
let's call them e2_Bi and e2_bv

if we say that for whatever set of enums, which values are ordered as ei , i being the enum index from 0 to N-1 (N<=256), then we can say that :
ei_Bi = i DIV 8 = i SHR 3  ( first bitwise right shift operation)
ei_bv = 2^ei_bi = 1 SHL ei_bi (second bitwise operation) , where I introduced ei_bi = bit index
and : ei_bi = i MOD 8 = i AND 7

so the final translation of ei IN aSet is :
(aSet[i SHR 3] AND (1 SHL (i AND 7)))>0

which means I said something wrong, it's 2 shifts and 2 AND, not one AND + one OR
0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
Ok, time for checking ASM :

Type
  TMyEnum=(e1,e2,e3,e4);
  TMySet=Set of TMyEnum;
Var
 aSet:TMySet;

procedure TForm1.btn1Click(Sender: TObject);
begin
 aSet:=[e1,e3];
 if e3 IN aSet
  Then ShowMessage(Format('%d %x',[sizeof(aSet),Byte(aSet)]));
end;

==> Message = '1 5'
that settles the fact that the size of the set is optimized for the maximum elements it can contain. For set of Chars, that is 32 bytes. But for that simple set, only 1 byte.
You'll see that the integer value of this set is 5 = 1 + 4 , as explained.
You'll also see in the ASM below that the test is actually using the TEST operator, which is precisely there to check the presence of ONE bit directly, so it avoids one AND operation. I forgot about this because no modern language have the TEST operator, only possible in ASM. Even in C, you would have to do the same as what I explained.

Now this single OP test is optimized because the e3_Bi and e3_bv are already calculated at compile time. I'll make a test where the compiler can't optimize that
Main.pas.81: aSet:=[e1,e3];

00443325 A090334400       mov al,[$00443390]

0044332A A24C584400       mov [$0044584c],al

Main.pas.82: if e3 IN aSet

0044332F F6054C58440004   test byte ptr [aSet],$04

00443336 7437             jz TForm1.btn1Click + $63

Main.pas.83: Then ShowMessage(Format('%d %x',[sizeof(aSet),Byte(aSet)]));

...

Open in new window

0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 58

Expert Comment

by:cyberkiwi
Comment Utility
If I may interject, you probably want to construct your tests against char in string specifically.
Because using an enum, it is already optimized into known types, whereas for string op (ascii, 256 possible chars) against set, I don't think it is possible in 4.
0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
Type
  TMyEnum=(e1,e2,e3,e4,e5,e6,e7,e8,e9,e10);
  TMySet=Set of TMyEnum;

function Test(eVal:TMyEnum;aSet:TMySet):Boolean;
begin
 Result:=eVal IN aSet;
end;

Ok, now that is yet another operator : BT (Bit Test), introduced with 386 that does exactly what I said but in one instruction. Of course, since it is quite frequently used. It works with the base address as first operand, and the bit index in second (but global bit index, not index in the indexed Byte). Internally, you can bet hard currency that this instruction is cabled to perform exactly the same calculation. But I won't go that far :o)
Note the first 2 instructions before the BT test, to ensure that the enum value is within the scope of the set (10 values), and to normalize the whole 32 bit register EAX to this 8 bit index. They just want there to be on the safe side, because they don't know really what you call this function with.

Bottom Line : ALWAYS use sets when you have a maximum of 256 elements. All operations are blazingly fast. For bigger sets, you can use the TBits type that can hold a large number of bits dynamically, and read/rite them based on index. But you'll have to implement + (OR) and - (AND) operators equivalent, as it is not done (checked on Delphi 5, maybe it has been done after, but that is probably not used enough)
Main.pas.80: begin

0044330C 51               push ecx

0044330D 66891424         mov [esp],dx

Main.pas.81: Result:=eVal IN aSet;

00443311 3C0F             cmp al,$0f

00443313 7707             jnbe Test + $10

00443315 83E07F           and eax,$7f

00443318 0FA30424         bt [esp],eax

0044331C 0F92C0           setb al

Main.pas.82: end;

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
a set of char is no different than a set of an enum type having 256 possible values. Or a set of Byte for that matter.

Type
  TCharSet=Set of Char;
Var
 aCharSet:TCharSet;

function Test2(eVal:Char;Const aSet:TCharSet):Boolean;
begin
 Result:=eVal IN aSet;
end;

procedure TForm1.btn1Click(Sender: TObject);
begin
 aCharSet:=['A'..'Z'];
 if Test2('A',aCharSet)
  Then ShowMessage(Format('Size=%d ''@''..''_''=%x',[sizeof(aCharSet),PInt(Integer(@aCharSet)+8)^]));
end;

// note just the Const I added before the charset parameter, to avoid the copying of the whole 32 bytes in the stack. The following is then the plain ASM function, with all stack frame management and testings, which you can see are reduced to the bare minimum.
Main.pas.88: Result:=eVal IN aSet;

0044330C 25FF000000       and eax,$000000ff

00443311 0FA302           bt [edx],eax

00443314 0F92C0           setb al

Main.pas.89: end;

00443317 C3               ret 

Open in new window

0
 
LVL 58

Expert Comment

by:cyberkiwi
Comment Utility
@epasquier
In your code, 'A' is a constant and has been optimized.
See listing below, and the CPU code.
CPU

Result:=eVal IN aSet;

mov eax,[ebp-$08]

mov dl[ebp-$01]

and edx,$000000ff

bt [eax],edx

setb al

mov [ebp-$09],al

mov al,[ebp-$09]

end;



Delphi



{$APPTYPE CONSOLE}

program Project2;



Type

  TCharSet=Set of Char;



function Test2(eVal:Char;Const aSet:TCharSet):Boolean;

begin

 Result:=eVal IN aSet;

end;



Var

 aCharSet:TCharSet;

 s: string;

 i: integer;

 c: char;

begin

 aCharSet:=['a'..'m']; // 1st half lowercase alphabet

 ReadLn(s);// := 'This is a test string';

 for i := 1 to Length(s) do

 begin

  c := s[i];

  if Test2(c,aCharSet) Then Write(c);

 end;

end.

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
eVal parameter does not matter if it is from a constant 'A'  or read from a file as far as the function is concerned.

Out of curiosity I compiled and run your app from Delphi 5 and got the same result as my previous example.
I can't explain why you don't have the same ASM code as me, except that we don't use the same Delphi, but still if Delphi 5 is capable of producing a perfect ASM code, why yours (that I suspect is newer than Delphi 5, isn't it ?) is fiddling with the stack instead of directly use the registers ? Maybe an unoptimized compiler option.
Look especially the last 2, they are obviously useless
mov [ebp-$09],al
mov al,[ebp-$09]
same as :
i:=j;
j:=i;
...........

but whatever, even with your code it's obvious that the IN operator is really smartly implemented, and whatever it is that added some strange code around would not when you use the IN operator directly in the code of the functions that need it instead of having a function for it.
0
 

Author Comment

by:SheratonGroup
Comment Utility
SO much useful information here.  I really enjoyed reading the discussion.  I actually used information from the first two posts to create a function.  So I am splitting points.

On a side note I am basically trying to validate e-mail addresses that may have umlauts/diactrics/n-tilde's/etc.  Basically e-mails with "special" letters greater than the 128.  

I am really only interested in chars <ascii 255.  I know it's not perfect, but it's good enough for our needs.  I am using Delphi 6, so unicode support is not automatic.

I would welcome comments on the code below.
function TForm1.validateEmail(S : String) : boolean;

var

  i : integer;

  localPart, domainRoot, TLD, strAfterAt : String;

  atPos, TLDPos : integer;



  LPStdCharsToAllow : TCharsToAllow;   //Local Part Chars to allow

  LPSpecCharsToAllow : TCharsToAllow;  //Local Part Chars >128 to allow (diacritics, umlauts, etc.)



  DRStdCharsToAllow : TCharsToAllow;   //Domain Root Chars to allow

  DRSpecCharsToAllow : TCharsToAllow;  //Domain Root Chars >128 to allow.



  TLDCharsToAllow  : TCharsToAllow; //Top Level Domain Chars to allow

  TLDIDNCharsToAllow : TCharsToAllow;  //International Top Level Domain Chars to allow, no chars over 128, but internationalized TLD's ok.



  InternationalTLD : Boolean;

  InternationalTLDCharStart : integer;

begin



InternationalTLD := FALSE;



LPStdCharsToAllow := [#33, #35..#39, #42..#43, #45..#57, #61, #63, #65..#90, #94..#96, #97..#126];

LPSpecCharsToAllow := [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156, #158..#159, #161..#165, #167, #181, #192..#246, #248..#255];



DRStdCharsToAllow := [#45..#46, #48..#57, #65..#90, #97..#122];

DRSpecCharsToAllow := [#128, #131, #134..#135, #138, #140, #142, #152, #154, #156, #158..#159, #161..#165, #167, #181, #192..#246, #248..#255];



TLDCharsToAllow := [#65..#90, #97..#122];       //non-international top level domains

TLDIDNCharsToAllow := [#48..#57, #65..#90, #97..#122];  //for international domains (just the alphanumeric part after the "XN--")



    result := FALSE;

    S := trim(S);



    if ((length(S) < 7) AND (length(S) > 254)) then  //checking overall length of address

      exit;



    atPos := pos('@', S);  //find the @ symbol and split the string.



    if atPos = 0 then

      exit;  // no @ in email address.



    localPart := copy(S, 1, atPos-1);



    if length(localPart) > 64 then

      exit;  //localPart must be less than 64 chars.



    if length(localPart) > 0 then

      strAfterAt := copy(S, atPos+1, length(S)-atPos)

    else

      exit;  //zero length email prefix



    if pos('@', strAfterAt) > 0 then  //Ensure that there are no other @ symbols

      exit;



//Parsing local-part...

    if (localPart[1] = '.') OR (localPart[length(localPart)] = '.') then    //a period cannot begin or end the local-part

      exit;



    if ansiContainsText(S, '..') then        //ensure that 2 or more periods do not occur sequentially anywhere

      exit;



    //valid chars allowed in the local-part of an email address        all upper/lowercase letters and numbers

    //decimal values                                                   [48-57], [65-90], [97-122]



    //valid special chars allowed in the email prefix w/out quotes...  !  #  $  %  &  '  *  +  -  .  /  =  ?  ^  _  `  {   |   }   ~

    //decimal values for spec chars to allow...                        33 35 36 37 38 39 42 43 45 46 47 61 63 94 95 96 123 124 125 126



    for i := 1 to length(localPart) do

    begin

       if NOT (localPart[i] in LPStdCharsToAllow) then   

         if NOT (localPart[i] in LPSpecCharsToAllow) then   

           exit;  //invalid chars found in unquoted local-part



    end;



               //TODO LATER... maybe... check if necessary.

               //corner case not coded for.

               //according to RFC2822 nearly any char can be used in an email address if they're within double quotes

               //it is fairly widely accepted practice not to allow this, because of security risks, eg, embedding malicious

               //commands within an email address.



    TLDPos := 0;

    for i := length(strAfterAt) downto 1 do      //retrieving the domain root and the TLD in one step.

    begin

      if strAfterAt[i] = '.' then

      begin

        TLDPos := i;

        break;

      end;

    end;



    if TLDPos = 0 then   

      exit;



    domainRoot := copy(StrAfterAt, 1, TLDPos-1);     // subtracting one to exclude the period.

    TLD := copy(strAfterAt, TLDPos+1, length(strAfterAt) - TLDPos);



//Parsing domainRoot

    if length(domainRoot) = 0 then

      exit;  //ensuring that the domain root is not blank



    if (domainRoot[1] in ['.','-']) OR (domainRoot[length(domainRoot)] in ['.','-']) then

      exit;  //not allowed to have these chars as the first or last char in domainRoot



    for i := 1 to length(domainRoot) do

    begin

      if NOT (domainRoot[i] in DRStdCharsToAllow) then   

         if NOT (domainRoot[i] in DRSpecCharsToAllow) then   

           exit;  //invalid chars found in domainRoot

    end;



//Parsing Top Level Domain

    if length(TLD) < 2 then

      exit;  //Top Level Domains have a minimum of two chars, as of 8/23/10



    if (ansiStartsText('XN--', TLD) AND (length(TLD) > 4)) then  //International TLD (starts with 'XN--' prefix)

    begin

      for i := 5 to length(TLD) do

        if NOT (TLD[i] in TLDIDNCharsToAllow) then

          exit;

    end

    else

    begin

      for i := 1 to length(TLD) do

        if NOT (TLD[i] in TLDCharsToAllow) then

          exit;

    end;



    showmessage(localPart + #10#13 + domainRoot + #10#13 + TLD);

result := TRUE;

{

The total number of chars in the TLD are at least 2

See http://data.iana.org/TLD/tlds-alpha-by-domain.txt for an up to date list of valid TLD extensions.  Since this changes it is not a good idea to hard code against this list.  So, simply ensuring that there are at least 2 valid chars in the extension.  And that the dashes appear in the right places.  Side Cases… not coded for.



It is possible pass the domain part of an email in brackets (eg, [129.55.222.88]

It is considered valid to allow quoted emails with any character at all in between.

See http://www.regular-expressions.info/email.html

}



end;

Open in new window

0
 

Author Comment

by:SheratonGroup
Comment Utility
PS - One advantage of using a set and not a string is that it allows for using ranges of chars, intermixed with individual chars.  I see now that my original example didn't have any wide ranging chars, but this was necessary.    

thanks again.
0
 
LVL 25

Expert Comment

by:epasquier
Comment Utility
range in sets is only useful for the coder : less characters to type. They actually are not stored differently than if you explicitly put all your chars in your set
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
THe viewer will learn how to use NetBeans IDE 8.0 for Windows to perform CRUD operations on a MySql database.
The viewer will learn how to synchronize PHP projects with a remote server in NetBeans IDE 8.0 for Windows.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now