Solved

Retrieving a date from a "Free Text"

Posted on 2004-08-10
8
182 Views
Last Modified: 2012-05-05
Greetings,

I am currently working on a project and I got a problem on retrieving a date given a string of universities and the date of graduation with no patterns.  What I mean about "free text" or "string without patterns" is that anything goes in the string.

Here are some of the examples are:

Community College of Phildadelphia, ASN 79
Temple University BSN 1999
De La Salle University 02
Harvard University, /85
Burlington College Ass. RN 5/1995
AMA Vocational Center, August - 1998
Univeristy of Santo Tomas, Philippines, BSN May 80

As you can see,

in the first example, the date of graduation would be year 1979, at Community College of Philadelphia with a degree of ASN.

in the second example, the date of graduation would be year 1999 at the Temple University.

in the third example, the date of graduation is 2002 at De La Salle University

Fourth is Date: May 1995, School: Burlington College, Degree: RN

Fifth example: Date: May 1980, School: University of Santo Tomas, Philippines, Degree: BSN

My question is this:  is there a way to parse the dates from a string given that the string doesn't have a fixed pattern??  If yes, how?  (It would be better if the degrees and school are also parsed, but the most important is the date to be parsed).

Please feel free to ask questions, if I am not clear.

Thanks,
Fred
0
Comment
Question by:insanekid
  • 2
  • 2
  • 2
8 Comments
 
LVL 20

Expert Comment

by:TheAvenger
ID: 11760161
This is not possible. The best solution is to find out several different patterns, like the year is the last 2 digits or the last 4 digits or is in the middle separated by something, etc. Then for every line you would try to parse it with every pattern you have found. If the line passes several patterns, you have a problem (don't know which one it is) or if it does not match any pattern - the same. So after parsing all lines with all patterns you can define, you would show those that are not sure (i.e. matched none or more than 1 pattern) and give the user the option to extract the date himself
0
 

Author Comment

by:insanekid
ID: 11760565
Hi Avenger,

I was thinking of using Regular Expression split but I am not that familiar with regular expressions.  Could you help me out with this one?

Thanks,
Fred
0
 
LVL 20

Accepted Solution

by:
TheAvenger earned 75 total points
ID: 11760626
Have a look at the Regex class: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemTextRegularExpressionsRegexClassTopic.asp
Learn something about the regular expressions, e.g. from here: http://www.regular-expressions.info/
There are more tutorials available in the web, just have a look at google.
You can also make tests with regular expressions and even find some ready here: http://www.regexlib.com/Search.aspx
0
Three Reasons Why Backup is Strategic

Backup is strategic to your business because your data is strategic to your business. Without backup, your business will fail. This white paper explains why it is vital for you to design and immediately execute a backup strategy to protect 100 percent of your data.

 
LVL 3

Expert Comment

by:primeMover2004
ID: 11760804
Yes, regular expressions are the way to go here. To me this looks like your input strings assemble like this: <university> <degree> <date>

The easiest way I'd go would be to define 3 captures: one for date, one for degree and one for university. I'd have these 3 expressions run through the input file.

Some questions before we go for the expressions:

Can you expect some end of line pattern such as CR/LF?
Do you know all the strings representing a degree?
Can you give a rough estimation of the percentage of strings conforming to the <university> <degree> <date> format?
0
 

Author Comment

by:insanekid
ID: 11761929
Hi primeMover2004,

Comments to your question:
1. Can you explain what CR & LF is?
2. Nope, I don't know the strings represented by a degree
3. Yes, it is more of <university> <degree> <date> hmmm... probably 75%.

Do you have any suggestions??  

Thanks,
Fred
0
 
LVL 3

Assisted Solution

by:primeMover2004
primeMover2004 earned 75 total points
ID: 11773503

1. CR&LF stand for carriage return & line feed. Those are used to mark the end of a line, or a record as in your case.
2. So it might be a good idea to construct a regular expression and squeeze them out of that file. Do you think that's possible? Do you have access to the file?
3. This means, your application has to rely on some additional information provided by users.

My suggestions is try to find out more about the file using regular expressions and design your application so that if the input scanning finds an ambiguity the user can provide more information. I don't think there's a reasonable solution that works fully automated. Keep it simple.
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article series is supposed to shed some light on the use of IDisposable and objects that inherit from it. In essence, a more apt title for this article would be: using (IDisposable) {}. I’m just not sure how many people would ge…
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question