• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 238
  • Last Modified:

How to trap text without unique identifier

I asked a question similiar to this in another area and have had only one response so will repost here. Using VBA in Access 2002, I need to parse data from the body of received emails. There is no unique identifier for the string. Here is an example:

------------------------------
Jane Doe's Address
------------------------------

Jane Doe
123 Frenchtown Rd
P.O. Box 1156
Gainsville, FL
21903-0012
USA

Thank you for stopping by,


I need to grab the address information from this block and add it to the appropriate fields in my DB. fldAddress1, fldAddress2, fldCity... etc. When I receive these emails, this block of data does not always begin or end on same line.

I have no idea how difficult this is but can find nothing written about it so will assume it is not an easy task.

Really need your help. TIA

0
colonelmustard
Asked:
colonelmustard
  • 6
  • 4
  • 3
  • +2
1 Solution
 
rdrunnerCommented:
Hmmm....

this is "kinda hard"

What happens if there are only 1 address line ?

'----
Jane Doe
123 Frenchtown Rd
Gainsville, FL
21903-0012
USA

'----

What i could think of is trying to get a hook on the address by extracting the ZIP-code and work your way from there ...

Will a country allways be there?

Will that text allways be at the END of the body?

If its allways the last part of the body you could use a small collection and limit it to say the last 6 lines read..

Another thought would be to look for the "," and split from there so you have Town/state....

I need to know more about what the user can/MUST enter in order to do some real processing....

If i know more about whats a MUST then you could try to tackle it by using regular expressions and generate a matchsstring that will break those lines up as needed...

Hope this helps so far ...
0
 
colonelmustardAuthor Commented:
Thanks for responding rdrunner. Yes the Country is always there....not always USA though. These lines always start on lines 32-35 and end at no more than line 40. More often then not there is only one address line. What you see on that little sniplet is the maximum information the user can enter on the web form. And everything listed between the:

------------------------------
Jane Doe's Address
------------------------------

and the:


Thank you for stopping by,


I need.


Hope this helps
0
 
rdrunnerCommented:
show an example for the MIMIMUM dataentry please ...

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
rdrunnerCommented:
i THINK i could cut it appart with regular expressions ... Lemme try to get my pattern streight ...

0
 
colonelmustardAuthor Commented:
My comment above was incorrect: The minimum data in this block would be:

Jane Doe
123 Frenchtown Rd
Gainsville, FL
21903
USA

the maximum would be:

Jane Doe
c/o America Corp
123 Frenchtown Rd
Gainsville, FL
21903-0012
USA

Is this more clear?
0
 
rdrunnerCommented:
Ill take another look on Sunday if it isnt solved till then ... I think i know whats needed, but i dont have the right version of IE installed ;)

I wanted to go for a match by the double occurence by the name and work from there on ...



0
 
JohnMcCannCommented:
What you are trying to do is extremly difficult if not impossible.  My reasoning for this is quite simple.

a) There is no consistent start or end to the text.
b) The text will have no identifyable data.

If you attempted to implemnt some sort of searching algorythm it would be to complex and very unlikely to sucseed.  Your database would end up with invalidate data all over the place.

Thee only solution I would recommend to this problem, would be to use some form of data capture to capture the addresses eg. a standard from that is filled in and emailed with the address on formatted in a specific way.  Or an online form with a backend cgi script.


Best of luck

John McCann
0
 
JohnMcCannCommented:
Just read the comment

What you see on that little sniplet is the maximum information the user can enter on the web form.

So you already have a web form creating the emails?
0
 
colonelmustardAuthor Commented:
JohnMcCann,
This email that I am trying to process is a payment notification from an online service. There are other fields that I have already captured because they are actually named on the email and easily extracted. This is the only instance of a shipping address...which is nuts. I have no control over these notifications except to accept them in there current form. But I need to get the addresses into the DB for shipping purposes. The only consistantcy is that the data always is the word "Thank" just after the empty line following the address block is the only instance of that word in the email. I was hoping that using that I could capture the 8 lines of text preceding it. Excluding lines that are carriage returns.

Thanks for responding
0
 
colonelmustardAuthor Commented:
EDITED----sorry for last post!

JohnMcCann,
This email that I am trying to process is a payment notification from an online service. There are other fields that I have already captured because they are actually named on the email and easily extracted. This is the only instance of a shipping address...which is nuts. I have no control over these notifications except to accept them in there current form. But I need to get the addresses into the DB for shipping purposes. The only consistantcy is that the data always preceeds the word "Thank", which is just after the empty line following the address block. It is the only instance of that word in the email. I was hoping that using that I could capture the 8 lines of text preceding it. Excluding lines that are carriage returns.

Thanks for responding

0
 
cyberknetCommented:
Just a thought:

dim strStart as string, strEnd as string
dim lngStart as long, lngEnd as long
dim strAddress as string, strAddressLines() as string
dim strResult(5) as string

' match the start with a line of dashes and two returns.
strStart = "------------------------------" & vbCrLf & vbCrLf
' match the end with two returns and thank you...
strEnd = vbCrLf & vbCrLf & "Thank you for stopping by,"

' find the start character
lngStart = instr(1, strEmail, strStart)
lngStart = lngStart + len(strStart)
' find the end character
lngEnd = instr(lngStart, strEmail, strEnd)

if lngStart > 0 and lngEnd > 0 then
    strAddress = mid(StrEmail, lngStart, (lngEnd - lngStart))
    strAddressLines = split(strAddress, vbCrLf)
   
    if ubound(strAddressLines) = 5 then
        strResult(0) = strAddressLines(0)
        strResult(1) = strAddressLines(1)
        strResult(2) = strAddressLines(2)
        strResult(3) = strAddressLines(3)
        strResult(4) = strAddressLines(4)
        strResult(5) = strAddressLines(5)
    elseif ubound(strAddressLines) = 4 then
        strResult(0) = strAddressLines(0)
        strResult(1) = strAddressLines(1)
        strResult(3) = strAddressLines(2)
        strResult(4) = strAddressLines(3)
        strResult(5) = strAddressLines(4)
    end if
end if

----- end code
strResult(0) = "Jane Doe"
strResult(1) = 123 Frenchtown Rd (Address 1)
strResult(2) = P.O. Box 1156 (Address 2, if available)
strResult(3) = Gainsville, FL (CIty, State)
strResult(4) = 21903-0012 (zip code)
strResult(5) = USA (country)

this is pretty rough... but you should get the idea.
                 
0
 
colonelmustardAuthor Commented:
Thanks for responding...in the above example you make reference to a string named "strEmail"...can you tell me where that value is coming from? TIA
0
 
colonelmustardAuthor Commented:
Thanks for responding...in the above example you make reference to a string named "strEmail"...can you tell me where that value is coming from? TIA
0
 
cyberknetCommented:
strEmail is the text in the email. You would need to populate it.
0
 
cyberknetCommented:
also be aware that if there is no text between those two lines, or if those two lines do not appear in strEmail (after you populate strEmail with the text of the email) then there will be no address populated, just an array with 6 indexes (0 to 5) of blank strings.
0
 
y2kswCommented:
This is a typical problem solved best by Artificial Neural Networks ([A]NN). You have an arbitrary number of lines and contents, and the order may or might not be the same. Components may miss or my be complete, might be incorrect, or spaced differently, right and wrong mixed upper and lower case, interpunctuation, and so stuff, and all that recognition will be far too much to solve it easily with standard little code and efford.

Effectively, all main post offices have a NN address recognizer (sort of intelligent OCR), which distributes the mail to the respective mail boxes, trained with just about 6000 different ways to write a generic post address.

The types of NN you should look for is the Hopfield or the Bidirectional Associated Memory (BAM) algorithm. Both have pro's and cont's, but it's a nice and quiet simple way to solve all eventuallities and possibilities a human brain may invent for writing down his/hers address. Note that some of the algo's are capable to elaborate even legastenic problems (casual or regular letter and word inversions).

For easy source code (C++) have a look at:

http://www.geocities.com/CapeCanaveral/1624/
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 6
  • 4
  • 3
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now