Solved

Regex pattern for SSN (OCR'd)

Posted on 2008-10-28
10
453 Views
Last Modified: 2013-11-26
Hi,

I'm working on a small application where I've to read TIF files, OCR them and look for SSNs in the OCR'd text. I use the following regular expression, but the issue is that depending on the quality of the TIF file, the SSNs get OCR'd in few different patterns. I would like to know if there is any single pattern that can help me identify all such patterns in one pass. I greatly appreciate your help.

Pattern that I use - \b\S{3}\-\S{2}\-\S{4}\b

Patterns I found so far
1. 99-99(space)-9999
2. 999-99-9999 (straight forward)
3. x99-99-9999
4. xx9-99-x9999
5. x9x-x9(space)-9999
where x could be any alphabet or special character. If the OCR software cannot identify the exact character, it injects some character in that position.

Thank you for looking into my problem.

Mohan
0
Comment
Question by:mohan_sekar
  • 5
  • 3
  • 2
10 Comments
 
LVL 82

Expert Comment

by:hielo
ID: 22822657
>>If the OCR software cannot identify the exact character, it injects some character in that position.
And what do you want to do in said situation? Keep the non-numeric character or not?
Try:
 \b\d+\-\d+\s*\-\d{4}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823040
Thanks Hielo, but it didn't work.
Yes, I want to keep the non-numeric character as is.
For example, iii-99-9999 or i99-99-9999 will not match.
0
 
LVL 18

Accepted Solution

by:
Pawel Witkowski earned 500 total points
ID: 22823080
try this one:

\b.{2,3}\-.{2,3}\-.{4,5}\b

0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823434
Thanks, Wilg32. Your expression covers most of my cases, but I have issues with the following ones

1. 999-99x9999 (instead of hyphen I get characters like ~ or i)
2. 999x-99-9999 (I get an extra character before the hyphen here)
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823642
try:
\b.{11,12}\b
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823736
Heilo,
Your expression is too generic. It might match with any 11 or 12 character strings and not just SSNs. Example phone numbers.
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823929
but what you are describing is also a "Generic" pattern. You don't always get a hyphen, but you also don't know what you are going to get instead of the hyphen. IF you are always getting a "~" and an "i" as alternatives, then try:
\b.{2,3}[~i\-].{2,3}[~u\-].{4,5}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22824264
I've slightly modified Wilg32's expression to suit my requirements. Thanks, Wilg32.
.{2,3}(\-|.(?=\d)).{2,3}(\-|.(?=\d)).{4,5}
Thanks for your help, Hielo.
0
 
LVL 15

Author Closing Comment

by:mohan_sekar
ID: 31510766
Thanks, Wilg32
0
 
LVL 18

Expert Comment

by:Pawel Witkowski
ID: 22826692
I just glad that I can help, sorry that I couldnt help further but I was playing volleyball ^^
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now