What should be my strategy with Regular Expressions when there are multiple formats to match.

I am currently working in a .net environment with its Regular Expression engine.  I want to be able to match patent numbers.  The issue I have is not only the regular expression to use but also how I should do this given that there are so many different formats. Just about every country has little differences or nuances and then there is the human factor of writing it differently as well.  Here is a small sample of the issue I am facing.

Examples;

**** European Patent  ****
Country Code: EP
Variable: 1-7 digits
Kind Codes: (A1, A2, A3, A4, A8, A9, B1, B2, B3, B8, B9, C8, C9)

But people write these in many different ways and I would like to catch them all if possible;

EP1606721 B1
EP 1606721 B1
EP1606721 B1
EP 1606721
EP 1,606,721
EP 1 606 721 A2

**** United States Patents ****
Country Code: US
Variable: 1-7 digits
Kind Codes: (A, B1, B2, B3, C1, C2, C3)
But people write these in many different ways and I would like to catch them all if possible;

5864868
5,864,868
US5,864,868
US5864868
US5864868 B1
US5864868B1
US 5864868


****     DE Patents     ****
1995-2003      
Country Code: DE      
Classification Code: Fixed: 1 digit*      
Filing Year: 2 digits      
Number: Fixed: 5 digits      
Kind Codes:(B3, B4, B8, B9 C1, C3, C5, C8, D1, D2, T0, T2, T3,T5, T8, T9, I2)

2004-Present      
Country Code: DE      
Classification Code: Fixed: 2 digits**      
Filing Year: 4 digits      
Number Fixed: 6 digits
Kind Codes:(B3, B4, B8, B9, C5, D1, D2, T2, T3,T5, T8, T9, I2)
Check Code: The last .X is a check digit and can be a number between 1-9.

1995-2003      
DE 198 43 316 A1
DE 19843316A1
DE19843316A1
DE198,43,316 A1

2004 to Present
DE502004002300
DE502004002300.1
DE 50 2004 002 300
DE 50,2004,002,300

These are only some examples, I would eventually like to add other countries as well.  My goal would be to be able to point any amount of text to a method which uses a regular expression engine to return a list of the found patent numbers.

How should my regular expression look?  Should I try to have one regular expression or one for each country?

I think I will make one assumption to help with the matching. This rule being that the patent number must start with a country code. Thus EP / US / DE for example.
LVL 20
darbid73Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Terry WoodsIT GuruCommented:
One (or possibly more than one in some cases) for each country would seem sensible, although if you structure your code well, you could potentially make it clear which patterns apply for each country and then (through code) combine them all into a single pattern.

Anyway, here's a pattern for Europe that matches your specification (and examples):
EP ?(\d[, ]?){1,7}(\s+[ABC][89]|[AB][123]|A4)?

Open in new window


https://regex101.com/r/eV5bQ8/1
darbid73Author Commented:
Thanks Terry for the quick reply. Could you please elaborate (just air code) nothing definate on what you mean by

although if you structure your code well, you could potentially make it clear which patterns apply for each country and then (through code) combine them all into a single pattern

also with your EP suggestion can we have it so it does not match. In other words after 7 digits or a A1 for example there must be white space.

EP20040716578
EP040716578
darbid73Author Commented:
It does not look like this question is going anywhere. If someone could help with the regular expression for the DE format then I will call it a day and accept a solution.
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

louisfrCommented:
To prevent longer codes to match, had \b (word boundary) at the start and end of the regular expression. It will prevent it to match if there are extra letters or digits at the start or end.
That is the best way if your looking for patent numbers inside a text. If you want to check that a number is correct, you can use ^ at the start and $ at the end to check that the whole input matches the expression.

A regex for the DE format could be
\bDE\s?((\d[\s,]?){8}(\s?(B[3489]|C[1358]|D[12]|T[023589]|I2))?|(\d[ ,]?){12}(\s?(B[3489]|C5|D[12]|T[23589]|I2))?)\b

Open in new window

You can shorten it to
\bDE\s?(\d[\s,]?){8}(((\d[ ,]?){4})?(\s?(B[3489]|C5|D[12]|T[23589]|I2))?|(\s?(C[138]|T[0]))?)\b

Open in new window

Terry WoodsIT GuruCommented:
Apologies for leaving this question hanging; I missed seeing the alert email for your reply.

Depending on how strings are appended in a particular language, you can build a maintainable pattern with multiple strings and comments next to them.

Though I'm not a .NET programmer, it looks like a pattern could be built from multiple strings in .NET in a way that looks something like this:
string pattern = "\b(" +
// US - 7 digits
"(US)?\d{7}?" +
// DE - 12 digits with optional dot and number at the end.
"|DE\d{12}(\.\d)?" +  // Notice the | character at the start which is essentially a logical OR. 
")\b"

Open in new window


Note I didn't try to cover all cases with that pattern. I'm really just demonstrating how alternation with a pipe character | combined with separately commented strings can combine the documentation with the code. That way, if the pattern for a particular type of code needs changing, it should be clear how to do it.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
darbid73Author Commented:
Thanks guys I will take a look at it.

So Terry essentially what you are saying is I should build all of my different string patterns with a | pipe character between them? This would then create a pretty long regex string if I were to use say 15 different countries.
Terry WoodsIT GuruCommented:
I'm not saying you should; I'm just saying you can, and still keep the code reasonably easy to understand.
darbid73Author Commented:
"should" was a bad choice. Thank you again.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.