Link to home
Start Free TrialLog in
Avatar of darbid73
darbid73Flag for Germany

asked on

What would the best Regular Expression search expression be for this example.

I am looking for the most efficient way of finding matches in a string.

My attempt at a search expression is "\b[0-9]{4}[A-Z][0-9]{5}\b".  But I am not sure how to deal with all the alternatives.  I realize my search expression is for  2005Q99999.

So could someone help to get a search expression to match all these possible combinations.  It would be great if it could be done in one search expression but I am not sure if that is possible.

The possible combinations I am looking for are below.  
All numbers can be from 0-9 and all letters can be from A-Z.

 200599999
 2005Q99999
 2005Q99999DJ
 2005Q99999  TS  (two blank spaces are in there)
 2005Q99999  TS01 (two blank spaces are in there)
 2005Q99999DJTS
 2005Q99999DJTS01

05Q99999
05Q99999DJ
05Q99999  TS (two blank spaces are in there)
05Q99999  TS01 (two blank spaces are in there)
05Q99999DJTS
05Q99999DJTS01

*A little explanation:
Before the letter Q in my example there can only be 4 numbers or 2 numbers.
After the letter Q there can only be 5 numbers.

After the 5 numbers there are a number of alternatives
2 letters
2 spaces and 2 letters
2 spaces and 2 letters and 2 numbers
4 letters
4 letters and 2 numbers

Thanks for all your help people.
Avatar of mark_harris231
mark_harris231
Flag of United States of America image

There may be more elegant ways, but this works for me:

^(?>\d{2,4}Q?\d{5})(?=\n|\r| |[A-Z]) {0,2}[A-Z]{0,4}\d{0,2}$

To test, I also made sure the following values did NOT work:

2005999991   (extra digit with no letters/spaces present)
2005Q99999   TS (extra space between '9' and 'TS')
20505Q99999  TS01 (extra digit in rows with letters/spaces present)
05Q99999  TSTSTS01 (extra letters)
Avatar of darbid73

ASKER

Thank you Mark,

Looks great.

I cannot test right now.  Ill get back to you tomorrow.
just noticed that you were using word boundaries (\b) in your example.  What language/platform are you trying to implement this in?
It will be on Win 7 using vb.net.

The strings will come from as many file types as possible, but mainly Office documents and pdfs.

The actual cultural language of the string could be any type.
I actually have jumped on a computer at home to test this.  I changed the anchors to \b at the beginning and end and I get a pretty good result.

I am not sure what the difference is in these anchors between \b and ^   $

Do I need to know?

Also there is one unexpected good thing from your suggested expression
"2005Q99999 TS01" also is matched.  This is where there is only 1 space.  I like and think we should keep this please.

One unexpected thing that would be nice if we could fix (although I can check them after matching is the following.

"05Q99999 beer" is matched.  The text I am parsing will be sentances so it is possible that it has a string I want to match followed by a 1/2/3 or 4 letter word.

Would it be possible for you to refine it a little --- if it is possible of course?

After the 5 numbers there can never be only 1 character (number of letter)
After the 5 numbers there can never be 3 characters.

Of course there is nothing we can do about 2 or 4 letter words.  I will have to check that afterwards.

I think that is all we can do.
In the version of regex I used, ^ signifies the beginning of a line and $ signifies the end.  This may not be appropriate in the context you're using.

My mistake concerning the 1-character/3-character letters.  I used 0-4 times parameter.  Limiting to exactly 2 or 4 should just require a couple of additional lookaheads.


Can you post any examples of how this data might show up in the files you're working with?  Nothing sensitive please.
These Patterns can Turn up in any Text. It could be in an Email, Word Doc and I will be also Parsing the text from Excel as well for example.
Ok...I'll work from that premise.  A little more challenging (for me) than I first thought, but I'll see what I can come up with.  It'll be a few hours before I can come back to it, however.
That's absolutely no problem. What you've got so far is pretty good.
Actually just hold off for a bit longer, I might be able to narrow down a pattern for the 4 letters after the 5 numbers.  I don't think you will be able to include it in there but you might.

I'll get back to you on this in about 8 hours.
Modifying mark_harris231's pattern, this appears to fit the desired behaviour:
(?<![\dA-Z])\d\d(?:\d\d)?Q?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{1,4}|[A-Z]{2}\d\d)|[A-Z]{4}\d\d)?(?![A-Z\d])
Thank you I will give it a test.

I don't know if you want to go further with the expression or not.  If you do here is something I think can be added to the assumptions.

These additions deal with the letters and numbers after the five numbers in the examples after the 99999.  eg after this pattern 2005Q99999.

If there are 4 letters then the first two letters can only be "WO" or "WE" or "EP"

eg 2005Q99999WOTS 2005Q99999WETS 2005Q99999EPTS

If after the 2 or 4 letters there are numbers then there are always 2 numbers and they are never higher than 15.  Thus they range from 01, 02, 03 etc to 15 only.

eg 2005Q99999 TS01 2005Q99999WETS15 2005Q99999 EP06
SOLUTION
Avatar of mark_harris231
mark_harris231
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Your initial samples indicated that " 2005Q99999DJTS01" was a valid match, but your last post suggests that it isn't because the 4-letter character string ("DJTS") doesn't start with "WO", "WE" or "EP".  Can you clarify?
You are correct and that is my fault.  Originally I was not sure I could refine these letters to anything too specific.  

If there are 4 letters after the 5 numbers then the first two letters can only start with "WO", "WE" or "EP".
I have tested what you have already and done and i cannot get it to miss one.

I see however I have stuffed up in explaining again.  The "Q" is not a "Q" it can be any capital letter.

This is working but just want to check - I have just replaces the Q with [A-Z]

(?<![\dA-Z])\d\d(?:\d\d)?[A-Z]?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{1,4}|[A-Z]{2}0[0-9]|[A-Z]{2}1[0-5])|[A-Z]{4}0[0-9]|[A-Z]{4}1[0-5])?(?![A-Z\d])
Correct concerning the "Q".  Substituting the [A-Z] should take care of that.

Still testing the WO WE EP formulation, but if this current formula is working as needed, you may be able to begin using as-is, then refine for the additional constraints if we get that sorted out.
Another point of clarification - in all the examples, any 4-character occurrences immediately follow the 5-digit string.  When there are spaces, there is only ever a maximum of a 2-character string (with or without trailing digits).  Is this accurate?

Or is the following a possible valid match:  "2005Q99999  WETS15"
Here is my current test text and the result i am getting.

4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
1973P69691 DEW
1973P69691 DEW
1973P69691 DEW
1973P69691 DEW

The pattern in there is "1973P69691 DE".

I hope we can not return "1973P69691 DEW" but just "1973P69691 DE" or should I do this in the post processing check?

You can see the other numbers that are being returned.
randomtext.txt
Another point of clarification - in all the examples, any 4-character occurrences immediately follow the 5-digit string.  When there are spaces, there is only ever a maximum of a 2-character string (with or without trailing digits).  Is this accurate?

Or is the following a possible valid match:  "2005Q99999  WETS15"

This is a hard one to call on.  I have to allow for human error as I will be reading text from documents and emails.  STRICTLY speaking without human error what you types is not possible.  Strictly speaking it is "2005Q99999WETS15" or "2005Q99999  TS15" or "2005Q99999WE"

But then we add the human factor in there.  This is why i said I was happy with 1 or 2 spaces.  It was also why I understood that it returned results where there were 2 letters or 4 letters after the 5 numbers.
So it sounds like we should shoot for a slightly more relaxed intepretation, which you may have to parse through a bit more manually?

What are the consequences if a match is not made because of a strict interpretation of the ruleset?
I am on the side of relaxed simply because of human error.  Also I can then try and fix things in the post processing.

If the expression misses some I never have a hope of post processing them.
The pattern in there is "1973P69691 DE".

I hope we can not return "1973P69691 DEW" but just "1973P69691 DE" or should I do this in the post processing check?

You can see the other numbers that are being returned.

Not sure I'm completely understanding what you have here.  Are the results "good" except for the DEW?  (i.e., is "4969710      N" a desirable match or incorrect)?
Sorry I was thinking instead of typing.

The perfect result is 13 x 1973P69691 DE - or it might be 12 I forget.

1973P69691 DEW is "too much" it should be 1973P69691 DE.

All the results with 7 numbers and a letter are not wanted at all.
Got it.  Stay tuned...
This seems to work:

(?<![\dA-Z])\d\d(?:\d\d)?[A-Z]?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{2}|[A-Z]{4}|[A-Z]{2}0[0-9]|[A-Z]{2}1[0-5])|[A-Z]{4}0[0-9]|[A-Z]{4}1[0-5])(?!\d)

Open in new window

Nice work, Terry.  I learned quite a bit from your first pattern and I'll be studying this one as well.

In looking at some earlier test data, I did see some incomplete matching when the sample included 4-characters followed by 2-digits:

E.g., 05Q99999DJTS15  only matches the bolded text

I'll see if I can spot the issue/make a suggestion, but thought I'd make you aware.
Also, doesn't appear to address the WE|WO|EP restrictions on 4-character instances as per OPs earlier post: 4-character restriction
Thanks mark. The incomplete matching is to meet the requirement #a38317483, but it likely has room for improvement.

You're right that I didn't take the WE|WO|EP restriction into account; I'd overlooked it. However, this might start to cause trouble with the incomplete matching requirement though. For example, if we don't match a 4 letter suffix starting with WA then do we match just the WA because that meets a different requirement (where we match any 2 letters)?
Hi Mark and Terry I must say I am truly surprised at the time and effort you guys are putting into this.  Thank you so much.  I could not have achieved the level of result you guys have got to so far or are aiming at.  It is one of the rare times when 500 points just does not seem enough.

I am also really surprised at the complexity and length of the regular expression string that you are working on.  The comment by Terry above leads me to thinking that there is something else that you could add in there.  The last 2 letters when there are 4 are the 2 letter country codes.  For example  If you can and want to go this much effort and it is possible I could see if I could cut down this list as well. For example TD CHAD is never going to happen and I think there would be a few more as well.

I think also that I have now well and truly splintered the requirements so I will do a summary post.

Once again guys thank you so much.
The Specs in one place.  You guys decide what is pushing the friendship and is asking too much.

A standard or full pattern
2005Y48938WOFR01

Breaking the pattern up into sections.

before the Y e.g.. 2005
The Y
The Next 5 numbers e.g.. 48938
The Space
The 4 Letters eg. WOFR
The Last 2 Numbers e.g.. 01

before the Y e.g.. 2005
These numbers represent a year.  They should be a full 4 numbers but people of course cut it short to 2.  If you want you could restrict it to 1970 to 2069.  If it is 2 numbers I don't think you can do much.

The Y
This is just a letter of the alphabet

The Next 5 numbers e.g.. 48938
This can only be 5 numbers.  It must be 5 it cannot be less or more and this is 100% strict.

The Space
This should be there in certain circumstances but because we are relying on humans I am not sure we can do much about this one.  In any case here is what it should be.

If there are 4 letters after the five numbers = No space.
If there is "WO" or "WE" or "EP" after the five numbers = No Space
if there is a 2 letter country code = 2 spaces.  (but most put 1 space)

Once again I am really not sure you can make any assumptions here.  Maybe you can say that if there are 4 letters then there is no space.  But as soon as there are 2 it is up to the human if they put a space, put 1 space or put 2.

The 4 Letters eg. WOFR
This can be 2 or 4 letters.

IF 4
first two letters can only be "WO" or "WE" or "EP"
the second two are a 2 letter country code.  (Let me know if you want to implement a list)

IF 2
If it is two then it can be either the first 2 or the second 2.

1 letter or 3 letters or 5 letters are never valid.


The Last 2 Numbers e.g.. 01
These can be a number between 01 - 15.
It is always 2 numbers if it is there.  Thus 5 alone is not valid it must be 05.


Then of course there is one exception to this and that is that a pattern can be shortened to
200548938.
It must be 4 plus 5 numbers.  i.e..  It must be 9 numbers.  Further the first 4 follow the The Y rules above except of course it must be 4 numbers and not 2.
darbid - thanks for pulling together and expanding upon all  the "rules" governing your request.  I regret that I won't be able to put much time to this for the next couple of days due to work and family committments, but if not resolved by then, I'll have a go at it over the weekend.
No stress I have not tested the most recent version anyway, so I will test it and give some feedback.
I've come up with this:

(?<![A-Z\d])(?:(?:19|20)?\d\d[A-Z]\d{5}(\s\s?(?!WO|WE|EP)[A-Z]{2}(?![A-Z]{2})|(?:WO|WE|EP)[A-Z]{2}|(?:WO|WE|EP|[A-Z]{2}))(0[1-9]|1[0-5])?|(?:19|20)\d{7})(?!\d)

Open in new window


With a list of country codes, the way the regex is structured could be simplified (though the length of it might still increase if the list is long)
Hi Terry and Mark,

I am just putting this last suggestion through its paces and I cannot get a reasonable error with it.  I have got two errors but they seem unlikely.  I see what you guys think.

2008L73829WEDE34
I get back 2008L73829WE for this one.  It seems that if number follow which are higher than 15 then the last 2 letters are not matched.  By the way it does not do it for letters.  if letters follow the DE in this case it works fine.

2008L73829 WEDE
This is not found.  Now this is unlikely but a human might just put a space in there.
>> 2008L73829WEDE34

Would you like this case to fail to match, or to match 2008L73829WEDE ? We can handle this by adding a negative lookahead to the pattern, but we'll need to be careful that we don't exclude valid values when we do it.

>> 2008L73829 WEDE

This was intentional; it was my interpretation of the requirement:
If there are 4 letters after the five numbers = No space.
If there is "WO" or "WE" or "EP" after the five numbers = No Space
if there is a 2 letter country code = 2 spaces.  (but most put 1 space)
Does it need changing?
>> 2008L73829WEDE34

I am fine with what it returns.


>> 2008L73829 WEDE
This is not good as it does not return anything.  At least it should return 2008L73829

This second one has lead me to test a simple patter of  "2008R73829" and the expression does not pick it up.  4 numbers a letter and 5 numbers is the fundamental pattern that always needs to be returned, even if something else is sacrificed.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Love it!

I have tested it and think it works pretty well and is enough for this question.

Terry I really appreciate that you stuck with me, thank you.  I also hope you understand that it is sometimes not only the end result but also the journey that is important.  So I will award some points to Mark as well.

From the looks of it you two work on many of these kinds of questions so I am hoping you have a little competition maybe going.

Thanks again guys.
Thanks Guys
I definitely learned some interesting techniques through this exchange (kudos, Terry).

Pleased to have participated and glad you have a solution that meets your needs.