Link to home
Start Free TrialLog in
Avatar of LenWinSonSoft
LenWinSonSoft

asked on

Compare regular expressions!

Hello,

i have a big problem... In fact there is no way around...

Imagine you have two regular expressions "ab?yz" and maybe "?cd?". The "?" means any string. It doesnt matter which kind of regular expression i have to use!
But now the problem. How can I check weather this two expression can describe the same string, for example "abcdyz" would match both.
Dont hang on this two expressions... I need an algorithm for all thinkable expressions. Its enough to have the "?" for any string without backslashes and the "*" for any string, but also the backslashs.

So "?:\dir\*\test.txt" can be "C:\dir\private\test.txt" or "C:\dir\private\subdir\test.txt"...
"?:\dir\?\test.txt" can only be "C:\dir\private\test.txt" but not "C:\dir\private\subdir\test.txt"...

The expressions "?:\dir\*\test.txt" can be matched by the same string as "?:\dir\?\test.txt".
But "?:\dir\*\test.txt" cannot be matched by "?:\dir\*\test.exe" for example!

How can i realize this??? The next problem is, that this algorithm should process at least 10000 comparsions per secound! But thats not so important.

I am using Visual Studio 2003 and C++...

Thanks for advance!
Mfg Martin
Avatar of jkr
jkr
Flag of Germany image

This should be easier than it seems at the 1st glance. After thinking about it, I'd start by breaking the expressions down to substrings delimited by the wildcards. If these substrings don't match, the whole expressions won't either.
Avatar of LenWinSonSoft
LenWinSonSoft

ASKER

Yes I was on that way too... I generated blocks seperated by wildcards... But the problem is that your idea does'nt work by complex strings...
Imagine of this two expressions:

"?:\dir1\d?2\d?1\d?2\dir3"
"?:\*\dir1\*2\dir3"

They match... but nothing with my first idea... My algorithm is so complex ^^... I am able to compare this separted blocks in any case... But now the problem is in which order i should compare this blocks... there are so many different cases and circumstances... I work since two days and nearly 30 hours on this problem... I think i will get it but i thought there would be a solution in the internet...
What exactly are you trying to finnd?  

What's the input, exactly.  What's the output?   Exactly?

Especially the output, do you just need a yes or No, whether there is at least one string they match in common?
Or do you need a reduced list of patterns that they maych in common?

What are the rules for  the regular expressions?  Exactly?   Can the ? and * match the null string?
The problem with your and mine idea is that this jokers are able to skip blocks and so on... Blocks can ssem to match but later in the string they dont match... then you have to search from beginning to the next matching block and so on... Its crazy at all...
Input:

Expression1: "?:\dir1\d?2\d?1\d?2\dir3"
Expression2: "?:\*\dir1\*2\dir3"

Output:

True, means that this expressions can be matched by the same string.

--------------------------------

Input:

Expression1: "?:\dir1\d?2\d?1\d?2\dir4"
Expression2: "?:\*\dir1\*2\dir3"

Output:

False, means that this expressions cannt be matched by the same string.


"*" can be no character or all characters with undefined length... Also "\" can be in this string!
"?" can be at least one character or all characters with undefined length... "\" is forbidden in the string

But the exact specifications are changable... Thats only what i thought... if there would be a solution with another specification it would also be ok...
Regular expressions are much mightier than wildcard expressions.

If you need a wildcard comparision only I can give you working source code (it's part of my string class).

Regards, Alex
1. Why are you writing your own notation for regular expressions when there is a standard?
2. You have to define precisely what you mean by match. The following two expressions could match the empty string, but that's the only thing they agree in. Is that a match?

/a?b?c?/
/[0-9]*/

This one seems to be what you mean, where one or more values can match both patterns:

/abcdef/
/a*b*c*d*e*fg?/

3. Why do you care? What is the end goal. perhaps this can be approached in a  totally different way.

algorithm:
I am expressing this algorithm in terms of standard RE notation from perl, grep, etc.
For each character progressing through the patterns, pick the most restrictive rule. Example:

/./  /a/ => /a/
/[0-9]?/ /[\da-z]/ ==> /[0-9]

then, you have to manage overlaps. Because of * and +, a rule may overlap another:

/[0-9]+[aeiou]*/   /5[a-z]/ => /5[aeiou]/

advance one character at a time, determining which rule is the most specific. Whenever something is optional, you will have to branch and try both with it, and without it.

/a[0-9]*[a-z]*5/    /[a-z][39]*5*/ ==>  a5
itsmeandnobodyelse :

Yes it would be nice if you could send me this source code... What does it do in detail???

>Why do you care? What is the end goal. perhaps this can be approached in a  totally different way.

no it cannt... The end goal is an application... I cannt describe anything of it because it will be a total new kind and so it is secret...

>Why are you writing your own notation for regular expressions when there is a standard?

because i dont need the other stuff... So it only slow down the processing speed... Also there is no such algorithm for regular expressions...
The easy end-user should be able to use it and you can try to explain a normal human being how to describe a path in regular expressions. Have fun...

> empty string, but that's the only thing they agree in. Is that a match?

No... A match must be unequal to Null... Also with my wildcards there is no matching empty string except for a plain "*"...


Thanks all... Of couse you can post more solutions! I will take the best one...

itsmeandnobodyelse : I hope your code will do my task... If the relevant code is too large for posting i can give you a email-addy
dbkruger : This is for configuration validation... It must check specific wildcard expressions for conflicts with other configuration data... There is no way around...
>>>> tsmeandnobodyelse : I hope your code will do my task..

I do think so. It makes a wildcard compare where ? is the wildcard for a single char and * a wildcard for any string.

I'll give you my original source that is running in various projects (some of them in a 365 day and 24 hours). If you need help transferring it to a non-member function, tell me.

Note, Bool, Int, and Char are typedefs of bool, int, char. String is my String class. 'str' is the wildcard string. 'getFilled' is the length of the String object in bytes. (*this)[i1] is the char at position i1., mid(i1) is a substring beginning at position i1 (til end of string). 'replace' replaces a string sequence by another. If the third argument is true it replaces all ocvcurrences of the sequence searched for.'True' and 'False' are equivalent to 'true' and 'false'.


Regards, Alex


// @mfunc compare with wildcard string
Bool String::compareWildcard(String str) const
{
    Int     i1   = 0;
    Int     i2   = 0;
    Int     len1 = getFilled();
    Int     len2 = str.getFilled();
    String  next;
    Char    c1, c2;

    while (i1 < len1 && i2 < len2)
    {
        c1 = (*this)[i1];
        c2 = str[i2];
        switch (c2)
        {
        case '*':
           
            while (++i2 < len2 && ((c2 = str[i2]) == '?' || c2 == '*'))
            {
                if (c2 == '*')
                    continue;
                if (++i1 == len1)
                    return False;
                c1 = (*this)[i1];
            }
            if (i2 >= len2)
                return True;

            while ((i1 = find(c2, i1)) >= 0)
            {
                next = mid(i1);
                if (next.compareWildcard(str.mid(i2)))
                    return True;
                i1++;
            }
            return False;
        case '?':
            break;
        default:
            if (c1 != c2)
                return False;
        }
        if (++i1 == len1)
        {
            if (++i2 == len2)
                return True;
            next = str.mid(i2);
            next.replace("*", "", True);
            return next.isEmpty();
        }
        if (++i2 == len2)
            return False;
    }
    return True;
}
Maybe i dont see the clue but i think your code compares a string "abcdefg" with a wildcardstring....??? I am right????
ASKER CERTIFIED SOLUTION
Avatar of itsmeandnobodyelse
itsmeandnobodyelse
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
But my question is can it also compare "abc?fg*xyz" and "a?op*z"????
Here is my code... Yes I found it with your inspiration... Its not perfect at all but it works better than my non-recursive algorithms... I dont know why i tried 2 days to write a non-recursive algorithm when a recursive one is so short and simple...

bool Compare(TEXT Joker, TEXT Search, bool NodeJoker){
      wchar_t s, d;
      bool WasEqual = false;
      
      while(true){
            s = *Joker;
            d = *Search;

            if(s == '*')
                  return Compare(Joker + 1, Search, false);

            if(s == '?')
                  return Compare(Joker + 1, Search, true);

            if(d == '*')
                  return Compare(Search + 1, Joker, false);

            if(d == '?')
                  return Compare(Search + 1, Joker, true);

            if(s == d){
                  WasEqual = true;
                  Joker++;

                  if(s == 0)
                        return true;

            }else if(WasEqual)
                  return false;

            if((d == '\\') && (s != '\\') && NodeJoker)
                  return false;

            if(d == 0)
                  return false;

            Search++;
      }

      return true;
}

bool CompareJokerPaths(TEXT Src, TEXT Dst){      
      
      if((*Src == '?') || (*Src == '*'))
            return Compare(Src, Dst, (*Src == '?'));
      
      if((*Dst == '?') || (*Dst == '*'))
            return Compare(Dst, Src, (*Dst == '?'));

      return Compare(Src, Dst, false);
}


Very interesting how simple it is...
>>>>  But my question is can it also compare "abc?fg*xyz" and "a?op*z"????

   string s = "abcdefg";
   if (compareWildcard(s, "abc?fg*") && compareWildcard(s, "a*f?"))
      cout << "the string " << s << " matches both to " << "abc?fg*" << " and " << "a*f?" << endl;

Note, if you have two different wildcard strings these never are equivalent (beside of trivial differences like between "abc*" and "abc***"). The set of strings matching to real different wildcard strings are always different. So, you need to compare a concrete string with both wildcard strings to see whether the wildcard strings are equivalent in respect to that single string value.

Regards, Alex
No... You missunderstood my problem... I dont want to compare wildcard strings for equalence...
I want to check weather there is a string (without wildcards) that matches two wildcard strings...
Or in other words... weather there exists any possible string where Match(Expression1, String) and Match(Expression2, String) are true...
But in fact I solved my problem with your inspiration of recursive comparsion...
With your helped i saved days of work and get at least 100x more performance...

Thanks for advance...
>>>>  But my question is can it also compare "abc?fg*xyz" and "a?op*z"????

Ok, I got it. You want to use wildcards at both sides. I don't know exactly but I think the problem isn't well defined, what means that you got matches even if no character is equal.

Look at

  abc?*ab*ab*??aab  and  *xy*

These match cause the first * at the right maps to 'abc?', 'xy' maps to the first * of the left and the rest of the left side to the second * of the right,

Regards, Alex
Yes thats true...

But *[String]* isnt the general form...

You have to remember i like to use this with paths:

For example:

"?:\Documents and Settings\?\My Files\*.doc"
"?:\Documents and Settings\*Private\*.doc"

Thats the problem...

My function above should solve this problem... I hope so ^^

You also have to remember that the "?" stands for a directory (without "\\") and the "*" for a path (with "\\")... So there are many, many variants for two expressions without skipping the whole string with a wildcard...

Thx
>>>> weather there exists any possible string where Match(Expression1, String) and Match(Expression2, String) are true...

Ok, that's a new request ;-)

You need to consider that a * matches to an empty string as well... hence you should request that there is a non-empty string that matches both to ex1 and ex2.

In my sample above

   abcdababxyaab

would match to both.

But "???abc???" and "x*yzzz" don't match to any string. I wonder whether your func will check that.

Did you recognize the inner while loop in my compare function? That while is necessary cause in case of a * wildcard the next matching part might be found after some partial matches:

   abxyxyxyxyxyzxy   ab*xy?xy

In that example we find xy at the right side what is a match but it is not the one we can take. Nor the next 'xy' nor the next after next. The 4th 'xy' is the right one cause there is a single char 'z' that matches to ? and a final 'xy'.

Regards, Alex
 
>>>> My function above should solve this problem... I hope so ^^

Good luck.

>>>> "?:\Documents and Settings\?\My Files\*.doc"
>>>> "?:\Documents and Settings\*Private\*.doc"

I hope, you don't have to deal with

  \\MY_WORKSTATION\C$\Docume~1\Private/*.doc

Regards, Alex

Have you thought about the situation where two path names describe the same file, but look very different?

For example:


C:\dir1\dir2\..\file.a      and     C:\dir1\file.a                                are the same file.

C:\Programs\goo.exe         and   C:\PROGRA~1\goo.exe       are also the same file   (on a Windows FAT file system)

C:\Foo.exe    and  C:\FOO.EXE                                           are the same file (on most Windows file systems, but not Unix)

C:\FOO.exe    and  D:\FOO.exe     are the same file  (if the old DOS "subst D: C:" command is in effect )

C:\FOO.exe    and   G:\Mount_Point_For_C\FOO.exe  are the same file (on a NTFS file system)

foo       and     goo       are the same file  ( if goo is a Unix link)

\\NUTTY_Server\Almonds      and     f:\Amonds     are the same file, just different file name styles.


>I hope, you don't have to deal with
>\\MY_WORKSTATION\C$\Docume~1\Private/*.doc
>Have you thought about the situation where two path names describe the same file, but look very different?

Yes, I thought about anything... Thats a little problem... but not a big one, because we define strictly how a path has to look like and so we can analyze it... if it does not look like we have defined it will be rejected...

If i am ready with all that configuration stuff, I will write many routines for canonical path names...
The good aspect ist, that we get the most pathnames from the system... User path names will be checked by regular expressions for validility and some system routines to canonicalize them.

My routine above does work with some modifications... I am on the way to write it in assembler... It checks nearly 10.000.000 Jokerstrings per secound with hand optimizations...

I also appended the "+" as only one Joker char...

Thanks for advance...
Here i have some examples with match results:

true: "+:\\Dok\\*\\Secure\\test.?" with "C:\\Dok\\Win\\Win\\Sec?\\test.exe"
true: "+:\\Dok\\?\\Eigene\\test.?" with "C:\\Dok\\User\\Eig?\\test.exe"
false: "+:\\Dok\\?\\Eigene\\test.?" with "CB:\\Dok\\User\\Eig?\\test.exe"
false: "?:\\Dok\\?\\Eigene\\test.?" with "C:\\?k\\User\\Eig\\test.exe"
true: "?:\\Dok\\?\\Eigene\\Privat\\test.?" with "C:\\Dok\\User\\Eig*\\test.exe"
false: "?:\\Dok\\?\\Eigene\\Privat\\test.?" with "C:\\Dok\\User\\Eig?\\test.exe"
true: "?:\\Dok\\User1\\Dok\\User2\\Eigene\\test.?" with "C:\\*Dok\\User2\\Eig?\\test.exe"
false: "Dok\\User2Eigene\\test.exe" with "*Dok\\User2\\Eig?\\test.exe"
true: "?:\\Dok\\User2\\Dok\\User2\\Eigene\\test.?" with "C:\\*Dok\\User2\\Eig?\\test.exe"

All these matches are correctly detected by my routine...

Maybe you can post some more Strings... With some trickyness in it... so i can see weather my routine is ok...

Thanks for advance!