Solved

# What would the best Regular Expression  search expression  be for this example.

Posted on 2012-08-20
312 Views
I am looking for the most efficient way of finding matches in a string.

My attempt at a search expression is "\b[0-9]{4}[A-Z][0-9]{5}\b".  But I am not sure how to deal with all the alternatives.  I realize my search expression is for  2005Q99999.

So could someone help to get a search expression to match all these possible combinations.  It would be great if it could be done in one search expression but I am not sure if that is possible.

The possible combinations I am looking for are below.
All numbers can be from 0-9 and all letters can be from A-Z.

200599999
2005Q99999
2005Q99999DJ
2005Q99999  TS  (two blank spaces are in there)
2005Q99999  TS01 (two blank spaces are in there)
2005Q99999DJTS
2005Q99999DJTS01

05Q99999
05Q99999DJ
05Q99999  TS (two blank spaces are in there)
05Q99999  TS01 (two blank spaces are in there)
05Q99999DJTS
05Q99999DJTS01

*A little explanation:
Before the letter Q in my example there can only be 4 numbers or 2 numbers.
After the letter Q there can only be 5 numbers.

After the 5 numbers there are a number of alternatives
2 letters
2 spaces and 2 letters
2 spaces and 2 letters and 2 numbers
4 letters
4 letters and 2 numbers

Thanks for all your help people.
0
Question by:darbid73

LVL 10

Expert Comment

There may be more elegant ways, but this works for me:

^(?>\d{2,4}Q?\d{5})(?=\n|\r| |[A-Z]) {0,2}[A-Z]{0,4}\d{0,2}\$

To test, I also made sure the following values did NOT work:

2005999991   (extra digit with no letters/spaces present)
2005Q99999   TS (extra space between '9' and 'TS')
20505Q99999  TS01 (extra digit in rows with letters/spaces present)
05Q99999  TSTSTS01 (extra letters)
0

LVL 19

Author Comment

Thank you Mark,

Looks great.

I cannot test right now.  Ill get back to you tomorrow.
0

LVL 10

Expert Comment

just noticed that you were using word boundaries (\b) in your example.  What language/platform are you trying to implement this in?
0

LVL 19

Author Comment

It will be on Win 7 using vb.net.

The strings will come from as many file types as possible, but mainly Office documents and pdfs.

The actual cultural language of the string could be any type.
0

LVL 19

Author Comment

I actually have jumped on a computer at home to test this.  I changed the anchors to \b at the beginning and end and I get a pretty good result.

I am not sure what the difference is in these anchors between \b and ^   \$

Do I need to know?

Also there is one unexpected good thing from your suggested expression
"2005Q99999 TS01" also is matched.  This is where there is only 1 space.  I like and think we should keep this please.

One unexpected thing that would be nice if we could fix (although I can check them after matching is the following.

"05Q99999 beer" is matched.  The text I am parsing will be sentances so it is possible that it has a string I want to match followed by a 1/2/3 or 4 letter word.

Would it be possible for you to refine it a little --- if it is possible of course?

After the 5 numbers there can never be only 1 character (number of letter)
After the 5 numbers there can never be 3 characters.

Of course there is nothing we can do about 2 or 4 letter words.  I will have to check that afterwards.

I think that is all we can do.
0

LVL 10

Expert Comment

In the version of regex I used, ^ signifies the beginning of a line and \$ signifies the end.  This may not be appropriate in the context you're using.

My mistake concerning the 1-character/3-character letters.  I used 0-4 times parameter.  Limiting to exactly 2 or 4 should just require a couple of additional lookaheads.

Can you post any examples of how this data might show up in the files you're working with?  Nothing sensitive please.
0

LVL 19

Author Comment

These Patterns can Turn up in any Text. It could be in an Email, Word Doc and I will be also Parsing the text from Excel as well for example.
0

LVL 10

Expert Comment

Ok...I'll work from that premise.  A little more challenging (for me) than I first thought, but I'll see what I can come up with.  It'll be a few hours before I can come back to it, however.
0

LVL 19

Author Comment

That's absolutely no problem. What you've got so far is pretty good.
0

LVL 19

Author Comment

Actually just hold off for a bit longer, I might be able to narrow down a pattern for the 4 letters after the 5 numbers.  I don't think you will be able to include it in there but you might.

I'll get back to you on this in about 8 hours.
0

LVL 34

Expert Comment

Modifying mark_harris231's pattern, this appears to fit the desired behaviour:
(?<![\dA-Z])\d\d(?:\d\d)?Q?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{1,4}|[A-Z]{2}\d\d)|[A-Z]{4}\d\d)?(?![A-Z\d])
0

LVL 19

Author Comment

Thank you I will give it a test.

I don't know if you want to go further with the expression or not.  If you do here is something I think can be added to the assumptions.

These additions deal with the letters and numbers after the five numbers in the examples after the 99999.  eg after this pattern 2005Q99999.

If there are 4 letters then the first two letters can only be "WO" or "WE" or "EP"

eg 2005Q99999WOTS 2005Q99999WETS 2005Q99999EPTS

If after the 2 or 4 letters there are numbers then there are always 2 numbers and they are never higher than 15.  Thus they range from 01, 02, 03 etc to 15 only.

eg 2005Q99999 TS01 2005Q99999WETS15 2005Q99999 EP06
0

LVL 10

Assisted Solution

Building on Terry's pattern, this addresses the numerical constraints (only 01-15 after 2-4 letters):

(?<![\dA-Z])\d\d(?:\d\d)?Q?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{1,4}|[A-Z]{2}0[0-9]|[A-Z]{2}1[0-5])|[A-Z]{4}0[0-9]|[A-Z]{4}1[0-5])?(?![A-Z\d])

Working on the WO/WE/EP constraints
0

LVL 10

Expert Comment

Your initial samples indicated that " 2005Q99999DJTS01" was a valid match, but your last post suggests that it isn't because the 4-letter character string ("DJTS") doesn't start with "WO", "WE" or "EP".  Can you clarify?
0

LVL 19

Author Comment

You are correct and that is my fault.  Originally I was not sure I could refine these letters to anything too specific.

If there are 4 letters after the 5 numbers then the first two letters can only start with "WO", "WE" or "EP".
0

LVL 19

Author Comment

I have tested what you have already and done and i cannot get it to miss one.

I see however I have stuffed up in explaining again.  The "Q" is not a "Q" it can be any capital letter.

This is working but just want to check - I have just replaces the Q with [A-Z]

(?<![\dA-Z])\d\d(?:\d\d)?[A-Z]?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{1,4}|[A-Z]{2}0[0-9]|[A-Z]{2}1[0-5])|[A-Z]{4}0[0-9]|[A-Z]{4}1[0-5])?(?![A-Z\d])
0

LVL 10

Expert Comment

Correct concerning the "Q".  Substituting the [A-Z] should take care of that.

Still testing the WO WE EP formulation, but if this current formula is working as needed, you may be able to begin using as-is, then refine for the additional constraints if we get that sorted out.
0

LVL 10

Expert Comment

Another point of clarification - in all the examples, any 4-character occurrences immediately follow the 5-digit string.  When there are spaces, there is only ever a maximum of a 2-character string (with or without trailing digits).  Is this accurate?

Or is the following a possible valid match:  "2005Q99999  WETS15"
0

LVL 19

Author Comment

Here is my current test text and the result i am getting.

4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
4261639      A
4969710      N
5191631      M
5218662      J
5224183      J
5225922      J
5261016      N
5274733      D
5343322      A
5355240      O
5361319      N
0531210            S
0532388            S
2681745            S
1973P69691 DE
1973P69691 DE
1973P69691 DEW
1973P69691 DEW
1973P69691 DEW
1973P69691 DEW

The pattern in there is "1973P69691 DE".

I hope we can not return "1973P69691 DEW" but just "1973P69691 DE" or should I do this in the post processing check?

You can see the other numbers that are being returned.
randomtext.txt
0

LVL 19

Author Comment

Another point of clarification - in all the examples, any 4-character occurrences immediately follow the 5-digit string.  When there are spaces, there is only ever a maximum of a 2-character string (with or without trailing digits).  Is this accurate?

Or is the following a possible valid match:  "2005Q99999  WETS15"

This is a hard one to call on.  I have to allow for human error as I will be reading text from documents and emails.  STRICTLY speaking without human error what you types is not possible.  Strictly speaking it is "2005Q99999WETS15" or "2005Q99999  TS15" or "2005Q99999WE"

But then we add the human factor in there.  This is why i said I was happy with 1 or 2 spaces.  It was also why I understood that it returned results where there were 2 letters or 4 letters after the 5 numbers.
0

LVL 10

Expert Comment

So it sounds like we should shoot for a slightly more relaxed intepretation, which you may have to parse through a bit more manually?

What are the consequences if a match is not made because of a strict interpretation of the ruleset?
0

LVL 19

Author Comment

I am on the side of relaxed simply because of human error.  Also I can then try and fix things in the post processing.

If the expression misses some I never have a hope of post processing them.
0

LVL 10

Expert Comment

The pattern in there is "1973P69691 DE".

I hope we can not return "1973P69691 DEW" but just "1973P69691 DE" or should I do this in the post processing check?

You can see the other numbers that are being returned.

Not sure I'm completely understanding what you have here.  Are the results "good" except for the DEW?  (i.e., is "4969710      N" a desirable match or incorrect)?
0

LVL 19

Author Comment

Sorry I was thinking instead of typing.

The perfect result is 13 x 1973P69691 DE - or it might be 12 I forget.

1973P69691 DEW is "too much" it should be 1973P69691 DE.

All the results with 7 numbers and a letter are not wanted at all.
0

LVL 10

Expert Comment

Got it.  Stay tuned...
0

LVL 34

Expert Comment

This seems to work:

``````(?<![\dA-Z])\d\d(?:\d\d)?[A-Z]?\d{5}(?:[A-Z]{2}(?:[A-Z]{2})?|\s\s?(?:[A-Z]{2}|[A-Z]{4}|[A-Z]{2}0[0-9]|[A-Z]{2}1[0-5])|[A-Z]{4}0[0-9]|[A-Z]{4}1[0-5])(?!\d)
``````
0

LVL 10

Expert Comment

Nice work, Terry.  I learned quite a bit from your first pattern and I'll be studying this one as well.

In looking at some earlier test data, I did see some incomplete matching when the sample included 4-characters followed by 2-digits:

E.g., 05Q99999DJTS15  only matches the bolded text

I'll see if I can spot the issue/make a suggestion, but thought I'd make you aware.
0

LVL 10

Expert Comment

Also, doesn't appear to address the WE|WO|EP restrictions on 4-character instances as per OPs earlier post: 4-character restriction
0

LVL 34

Expert Comment

Thanks mark. The incomplete matching is to meet the requirement #a38317483, but it likely has room for improvement.

You're right that I didn't take the WE|WO|EP restriction into account; I'd overlooked it. However, this might start to cause trouble with the incomplete matching requirement though. For example, if we don't match a 4 letter suffix starting with WA then do we match just the WA because that meets a different requirement (where we match any 2 letters)?
0

LVL 19

Author Comment

Hi Mark and Terry I must say I am truly surprised at the time and effort you guys are putting into this.  Thank you so much.  I could not have achieved the level of result you guys have got to so far or are aiming at.  It is one of the rare times when 500 points just does not seem enough.

I am also really surprised at the complexity and length of the regular expression string that you are working on.  The comment by Terry above leads me to thinking that there is something else that you could add in there.  The last 2 letters when there are 4 are the 2 letter country codes.  For example  If you can and want to go this much effort and it is possible I could see if I could cut down this list as well. For example TD CHAD is never going to happen and I think there would be a few more as well.

I think also that I have now well and truly splintered the requirements so I will do a summary post.

Once again guys thank you so much.
0

LVL 19

Author Comment

The Specs in one place.  You guys decide what is pushing the friendship and is asking too much.

A standard or full pattern
2005Y48938WOFR01

## Breaking the pattern up into sections.

before the Y e.g.. 2005
The Y
The Next 5 numbers e.g.. 48938
The Space
The 4 Letters eg. WOFR
The Last 2 Numbers e.g.. 01

before the Y e.g.. 2005
These numbers represent a year.  They should be a full 4 numbers but people of course cut it short to 2.  If you want you could restrict it to 1970 to 2069.  If it is 2 numbers I don't think you can do much.

The Y
This is just a letter of the alphabet

The Next 5 numbers e.g.. 48938
This can only be 5 numbers.  It must be 5 it cannot be less or more and this is 100% strict.

The Space
This should be there in certain circumstances but because we are relying on humans I am not sure we can do much about this one.  In any case here is what it should be.

If there are 4 letters after the five numbers = No space.
If there is "WO" or "WE" or "EP" after the five numbers = No Space
if there is a 2 letter country code = 2 spaces.  (but most put 1 space)

Once again I am really not sure you can make any assumptions here.  Maybe you can say that if there are 4 letters then there is no space.  But as soon as there are 2 it is up to the human if they put a space, put 1 space or put 2.

The 4 Letters eg. WOFR
This can be 2 or 4 letters.

IF 4
first two letters can only be "WO" or "WE" or "EP"
the second two are a 2 letter country code.  (Let me know if you want to implement a list)

IF 2
If it is two then it can be either the first 2 or the second 2.

1 letter or 3 letters or 5 letters are never valid.

The Last 2 Numbers e.g.. 01
These can be a number between 01 - 15.
It is always 2 numbers if it is there.  Thus 5 alone is not valid it must be 05.

Then of course there is one exception to this and that is that a pattern can be shortened to
200548938.
It must be 4 plus 5 numbers.  i.e..  It must be 9 numbers.  Further the first 4 follow the The Y rules above except of course it must be 4 numbers and not 2.
0

LVL 10

Expert Comment

darbid - thanks for pulling together and expanding upon all  the "rules" governing your request.  I regret that I won't be able to put much time to this for the next couple of days due to work and family committments, but if not resolved by then, I'll have a go at it over the weekend.
0

LVL 19

Author Comment

No stress I have not tested the most recent version anyway, so I will test it and give some feedback.
0

LVL 34

Expert Comment

I've come up with this:

``````(?<![A-Z\d])(?:(?:19|20)?\d\d[A-Z]\d{5}(\s\s?(?!WO|WE|EP)[A-Z]{2}(?![A-Z]{2})|(?:WO|WE|EP)[A-Z]{2}|(?:WO|WE|EP|[A-Z]{2}))(0[1-9]|1[0-5])?|(?:19|20)\d{7})(?!\d)
``````

With a list of country codes, the way the regex is structured could be simplified (though the length of it might still increase if the list is long)
0

LVL 19

Author Comment

Hi Terry and Mark,

I am just putting this last suggestion through its paces and I cannot get a reasonable error with it.  I have got two errors but they seem unlikely.  I see what you guys think.

2008L73829WEDE34
I get back 2008L73829WE for this one.  It seems that if number follow which are higher than 15 then the last 2 letters are not matched.  By the way it does not do it for letters.  if letters follow the DE in this case it works fine.

2008L73829 WEDE
This is not found.  Now this is unlikely but a human might just put a space in there.
0

LVL 34

Expert Comment

>> 2008L73829WEDE34

Would you like this case to fail to match, or to match 2008L73829WEDE ? We can handle this by adding a negative lookahead to the pattern, but we'll need to be careful that we don't exclude valid values when we do it.

>> 2008L73829 WEDE

This was intentional; it was my interpretation of the requirement:
If there are 4 letters after the five numbers = No space.
If there is "WO" or "WE" or "EP" after the five numbers = No Space
if there is a 2 letter country code = 2 spaces.  (but most put 1 space)
Does it need changing?
0

LVL 19

Author Comment

>> 2008L73829WEDE34

I am fine with what it returns.

>> 2008L73829 WEDE
This is not good as it does not return anything.  At least it should return 2008L73829

This second one has lead me to test a simple patter of  "2008R73829" and the expression does not pick it up.  4 numbers a letter and 5 numbers is the fundamental pattern that always needs to be returned, even if something else is sacrificed.
0

LVL 34

Accepted Solution

``````(?<![A-Z\d])(?:(?:19|20)?\d\d[A-Z]\d{5}((\s\s?(?!WO|WE|EP)[A-Z]{2}(?![A-Z]{2})|(?:WO|WE|EP)[A-Z]{2}|(?:WO|WE|EP|[A-Z]{2}))(0[1-9]|1[0-5])?)?|(?:19|20)\d{7})(?!\d)
``````
0

LVL 19

Author Comment

Love it!

I have tested it and think it works pretty well and is enough for this question.

Terry I really appreciate that you stuck with me, thank you.  I also hope you understand that it is sometimes not only the end result but also the journey that is important.  So I will award some points to Mark as well.

From the looks of it you two work on many of these kinds of questions so I am hoping you have a little competition maybe going.

Thanks again guys.
0

LVL 19

Author Closing Comment

Thanks Guys
0

LVL 10

Expert Comment

I definitely learned some interesting techniques through this exchange (kudos, Terry).

0

## Featured Post

### Suggested Solutions

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for uâ€¦
Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the faâ€¦
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will wâ€¦
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: Tâ€¦