Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Regex pattern for SSN (OCR'd)

Posted on 2008-10-28
10
458 Views
Last Modified: 2013-11-26
Hi,

I'm working on a small application where I've to read TIF files, OCR them and look for SSNs in the OCR'd text. I use the following regular expression, but the issue is that depending on the quality of the TIF file, the SSNs get OCR'd in few different patterns. I would like to know if there is any single pattern that can help me identify all such patterns in one pass. I greatly appreciate your help.

Pattern that I use - \b\S{3}\-\S{2}\-\S{4}\b

Patterns I found so far
1. 99-99(space)-9999
2. 999-99-9999 (straight forward)
3. x99-99-9999
4. xx9-99-x9999
5. x9x-x9(space)-9999
where x could be any alphabet or special character. If the OCR software cannot identify the exact character, it injects some character in that position.

Thank you for looking into my problem.

Mohan
0
Comment
Question by:mohan_sekar
  • 5
  • 3
  • 2
10 Comments
 
LVL 82

Expert Comment

by:hielo
ID: 22822657
>>If the OCR software cannot identify the exact character, it injects some character in that position.
And what do you want to do in said situation? Keep the non-numeric character or not?
Try:
 \b\d+\-\d+\s*\-\d{4}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823040
Thanks Hielo, but it didn't work.
Yes, I want to keep the non-numeric character as is.
For example, iii-99-9999 or i99-99-9999 will not match.
0
 
LVL 18

Accepted Solution

by:
Pawel Witkowski earned 500 total points
ID: 22823080
try this one:

\b.{2,3}\-.{2,3}\-.{4,5}\b

0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823434
Thanks, Wilg32. Your expression covers most of my cases, but I have issues with the following ones

1. 999-99x9999 (instead of hyphen I get characters like ~ or i)
2. 999x-99-9999 (I get an extra character before the hyphen here)
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823642
try:
\b.{11,12}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823736
Heilo,
Your expression is too generic. It might match with any 11 or 12 character strings and not just SSNs. Example phone numbers.
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823929
but what you are describing is also a "Generic" pattern. You don't always get a hyphen, but you also don't know what you are going to get instead of the hyphen. IF you are always getting a "~" and an "i" as alternatives, then try:
\b.{2,3}[~i\-].{2,3}[~u\-].{4,5}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22824264
I've slightly modified Wilg32's expression to suit my requirements. Thanks, Wilg32.
.{2,3}(\-|.(?=\d)).{2,3}(\-|.(?=\d)).{4,5}
Thanks for your help, Hielo.
0
 
LVL 15

Author Closing Comment

by:mohan_sekar
ID: 31510766
Thanks, Wilg32
0
 
LVL 18

Expert Comment

by:Pawel Witkowski
ID: 22826692
I just glad that I can help, sorry that I couldnt help further but I was playing volleyball ^^
0

Featured Post

The New “Normal” in Modern Enterprise Operations

DevOps for the modern enterprise offers many benefits — increased agility, productivity, and more, but digital transformation isn’t easy, especially if you’re not addressing the right issues. Register for the webinar to dive into the “new normal” for enterprise modern ops.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
multiple selects 23 48
jQuery force form POST 7 46
Override MS CSS in SharePoint Online Modal Form 19 45
formvalidation.io validate form on class click 4 16
This article discusses the difference between strict equality operator and equality operator in JavaScript. The Need: Because JavaScript performs an implicit type conversion when performing comparisons, we have to take this into account when wri…
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
The viewer will learn how to implement Singleton Design Pattern in Java.

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question