Solved

Regex pattern for SSN (OCR'd)

Posted on 2008-10-28
10
456 Views
Last Modified: 2013-11-26
Hi,

I'm working on a small application where I've to read TIF files, OCR them and look for SSNs in the OCR'd text. I use the following regular expression, but the issue is that depending on the quality of the TIF file, the SSNs get OCR'd in few different patterns. I would like to know if there is any single pattern that can help me identify all such patterns in one pass. I greatly appreciate your help.

Pattern that I use - \b\S{3}\-\S{2}\-\S{4}\b

Patterns I found so far
1. 99-99(space)-9999
2. 999-99-9999 (straight forward)
3. x99-99-9999
4. xx9-99-x9999
5. x9x-x9(space)-9999
where x could be any alphabet or special character. If the OCR software cannot identify the exact character, it injects some character in that position.

Thank you for looking into my problem.

Mohan
0
Comment
Question by:mohan_sekar
  • 5
  • 3
  • 2
10 Comments
 
LVL 82

Expert Comment

by:hielo
ID: 22822657
>>If the OCR software cannot identify the exact character, it injects some character in that position.
And what do you want to do in said situation? Keep the non-numeric character or not?
Try:
 \b\d+\-\d+\s*\-\d{4}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823040
Thanks Hielo, but it didn't work.
Yes, I want to keep the non-numeric character as is.
For example, iii-99-9999 or i99-99-9999 will not match.
0
 
LVL 18

Accepted Solution

by:
Pawel Witkowski earned 500 total points
ID: 22823080
try this one:

\b.{2,3}\-.{2,3}\-.{4,5}\b

0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823434
Thanks, Wilg32. Your expression covers most of my cases, but I have issues with the following ones

1. 999-99x9999 (instead of hyphen I get characters like ~ or i)
2. 999x-99-9999 (I get an extra character before the hyphen here)
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823642
try:
\b.{11,12}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823736
Heilo,
Your expression is too generic. It might match with any 11 or 12 character strings and not just SSNs. Example phone numbers.
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823929
but what you are describing is also a "Generic" pattern. You don't always get a hyphen, but you also don't know what you are going to get instead of the hyphen. IF you are always getting a "~" and an "i" as alternatives, then try:
\b.{2,3}[~i\-].{2,3}[~u\-].{4,5}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22824264
I've slightly modified Wilg32's expression to suit my requirements. Thanks, Wilg32.
.{2,3}(\-|.(?=\d)).{2,3}(\-|.(?=\d)).{4,5}
Thanks for your help, Hielo.
0
 
LVL 15

Author Closing Comment

by:mohan_sekar
ID: 31510766
Thanks, Wilg32
0
 
LVL 18

Expert Comment

by:Pawel Witkowski
ID: 22826692
I just glad that I can help, sorry that I couldnt help further but I was playing volleyball ^^
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
JavaScript can be used in a browser to change parts of a webpage dynamically. It begins with the following pattern: If condition W is true, do thing X to target Y after event Z. Below are some tips and tricks to help you get started with JavaScript …
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

778 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question