Solved

Regex pattern for SSN (OCR'd)

Posted on 2008-10-28
10
462 Views
Last Modified: 2013-11-26
Hi,

I'm working on a small application where I've to read TIF files, OCR them and look for SSNs in the OCR'd text. I use the following regular expression, but the issue is that depending on the quality of the TIF file, the SSNs get OCR'd in few different patterns. I would like to know if there is any single pattern that can help me identify all such patterns in one pass. I greatly appreciate your help.

Pattern that I use - \b\S{3}\-\S{2}\-\S{4}\b

Patterns I found so far
1. 99-99(space)-9999
2. 999-99-9999 (straight forward)
3. x99-99-9999
4. xx9-99-x9999
5. x9x-x9(space)-9999
where x could be any alphabet or special character. If the OCR software cannot identify the exact character, it injects some character in that position.

Thank you for looking into my problem.

Mohan
0
Comment
Question by:mohan_sekar
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 2
10 Comments
 
LVL 82

Expert Comment

by:hielo
ID: 22822657
>>If the OCR software cannot identify the exact character, it injects some character in that position.
And what do you want to do in said situation? Keep the non-numeric character or not?
Try:
 \b\d+\-\d+\s*\-\d{4}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823040
Thanks Hielo, but it didn't work.
Yes, I want to keep the non-numeric character as is.
For example, iii-99-9999 or i99-99-9999 will not match.
0
 
LVL 18

Accepted Solution

by:
Pawel Witkowski earned 500 total points
ID: 22823080
try this one:

\b.{2,3}\-.{2,3}\-.{4,5}\b

0
Guide to Performance: Optimization & Monitoring

Nowadays, monitoring is a mixture of tools, systems, and codes—making it a very complex process. And with this complexity, comes variables for failure. Get DZone’s new Guide to Performance to learn how to proactively find these variables and solve them before a disruption occurs.

 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823434
Thanks, Wilg32. Your expression covers most of my cases, but I have issues with the following ones

1. 999-99x9999 (instead of hyphen I get characters like ~ or i)
2. 999x-99-9999 (I get an extra character before the hyphen here)
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823642
try:
\b.{11,12}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22823736
Heilo,
Your expression is too generic. It might match with any 11 or 12 character strings and not just SSNs. Example phone numbers.
0
 
LVL 82

Expert Comment

by:hielo
ID: 22823929
but what you are describing is also a "Generic" pattern. You don't always get a hyphen, but you also don't know what you are going to get instead of the hyphen. IF you are always getting a "~" and an "i" as alternatives, then try:
\b.{2,3}[~i\-].{2,3}[~u\-].{4,5}\b
0
 
LVL 15

Author Comment

by:mohan_sekar
ID: 22824264
I've slightly modified Wilg32's expression to suit my requirements. Thanks, Wilg32.
.{2,3}(\-|.(?=\d)).{2,3}(\-|.(?=\d)).{4,5}
Thanks for your help, Hielo.
0
 
LVL 15

Author Closing Comment

by:mohan_sekar
ID: 31510766
Thanks, Wilg32
0
 
LVL 18

Expert Comment

by:Pawel Witkowski
ID: 22826692
I just glad that I can help, sorry that I couldnt help further but I was playing volleyball ^^
0

Featured Post

Guide to Performance: Optimization & Monitoring

Nowadays, monitoring is a mixture of tools, systems, and codes—making it a very complex process. And with this complexity, comes variables for failure. Get DZone’s new Guide to Performance to learn how to proactively find these variables and solve them before a disruption occurs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
JavaScript can be used in a browser to change parts of a webpage dynamically. It begins with the following pattern: If condition W is true, do thing X to target Y after event Z. Below are some tips and tricks to help you get started with JavaScript …
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question