US Postal Address Parsing into seperate Fields

dsr1811
dsr1811 used Ask the Experts™
on
I need a PHP/Regular Expression to find the parts of a US postal address using php, preferably returned in an array, identified by key values, so i can feed into a Like Search in MySQL.  See Examples of the possible input and the output below. This is a challenging problem. Thanx in advance.  Please, if you know of a better way to do this by all means let me know. MySQL's boolean search looked interesting for this type of problem.  THANKS AGAIN!!

Possible input fields are: Number, Direction, Name, Suffix, City, State, Zip, but direction could come on the either side of number.

Case does not matter, fields may be seperated by commas or spaces

Suffixes need to be validated and abreaiations should be replaced by the full names:
STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD

Compass directions need to be validated and abreaiations should be replaced by the full names:
W|West|SW|Southwest|NW|Northwest|S|South|E|East|SE|Southeast|NE|Northeast|N|North

Cities and streets may contain more than one word.

Currently the User input is turned into an array

The output should also be an array, where I will then feed to a class that will create the mySql 'AND' logic for the  fields.

array(
      [number] =>421
      [direction] =>'west'
      [city] =>'john glenn'
      [state] =>'ca'
)

Example of the solution array applied to a query:

WHERE number like '%421%' AND direction like '%west%' AND city like '%john glenn%' AND state like 'ca'

I am currently creating an array with the input and seperating as below:

$addressParts = explode(",", str_replace(" ",",",$PostAddr));


INPUT: las vegas, nv
OUTPUT: city =>las vegas, state => nv

INPUT: 90210
OUTPUT: zip => 90210

INPUT: 910 hamilton 90210
OUTPUT: number => 910, name => hamilton, zip => 90210

INPUT: 910 hamilton ave 90210
OUTPUT: number => 910, name => hamilton, suffix => avenue, zip => 90210

INPUT: 220 hamilton john glenn ca
OUTPUT: number => 220, name => hamilton, city => john glenn, state => ca

INPUT: 421 w 14th st john glenn ca
OUTPUT: number => 421, direction => west, name => 14th, suffix => street, city => john glenn, state => ca

INPUT: 220 hamilton john glenn
OUTPUT: number => 220, name => hamilton, city => john glenn

INPUT: 910 hamilton ave, campbell, ca
OUTPUT: number => 910, name => hamilton, suffix => avenue, city => campbell, state => ca

INPUT: w hamilton ln, john glenn, ca 90210
OUTPUT: direction => west, name => hamilton, suffix => lane, city => john glenn, state => ca, zip => 90210

INPUT: w hamilton ave john glenn ca 90210
OUTPUT: direction => west, name => hamilton, suffix => avenue, city => john glenn, state => ca, zip => 90210

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
"Cities and streets may contain more than one word."

"INPUT: 220 hamilton john glenn
OUTPUT: number => 220, name => hamilton, city => john glenn"

What you are asking for is very complicated. There are many possible variations. Which means you need someone to make you a script which will basically not only try to extract those possible variations but to show the user them again to check which variation is the one they have entered.

Maybe it is better to have differant input fields for the differant things, which the user doesnt have to fill in if they dont know, then you can do the SQL query with the data the user DID enter. The page could have something like the code snippet to do the address data entry.

Sorry this isn't a solution you are looking for.
<input name="address" value="421w hamilton ave">
<input name="city" value="john glenn">
<input name="state" value="ca">
<input name="zip" value="90210">

Open in new window

Most Valuable Expert 2011
Top Expert 2016

Commented:
I wonder if the data base design is wrong for this application.  The USPS may validate some of these fields against its standards, but it carries the address as a clear-text field in uppercase, and it is all one field.  The only place I have ever needed to be aware of the street-type abbreviations (ST, RD, CT, etc) is when I was creating the addresses myself.

Since this is not a question, so much as a need for application development, let's try a different approach to be productive.  Please tell us what you're trying to accomplish and maybe we can suggest a well-known design pattern.

Best, ~Ray

Author

Commented:
Thanks for the response guys!!

And i know this is a real teaser

I am trying to do something similar to the Realtor.com address search.

I have the live search for the postal code and city worked out as they do, but when someone types in an address that does not match either I need to try to construct the best search i can based in the info provided.

After  determining they have not typed in a matching zip or city first, then we know we need to interrogate each array item and based on what we know about what the individual address items could look like look,  we apply a set of regexs against each of them and pop them from the array.

i.e.

If the user would type  220 w summit 90210, the array components would first be tested against a zip code pattern.

$zipcode_pattern = '/^([0-9]{5})(-[0-9]{4})?$/';

Pop the zip code and then test:

1. State abbreviation (none)
2. Direction (pop the w and convert to west)
3. Street number (can only test for numeric, so pop 220)
4. and so on


This was my line of thinking and maybe part of my question is, is this the best approach?

I found a couple of regex patterns, that i thought would ascertain the sub parts of the address, but no luck.  See below.  



$full_address_pattern = '/^\s*((?:(?:\d+(?:\x20+\w+\.?)+(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|(?:General\x20+Delivery)|(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|(?:BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR))?)\,?\s+((?:(?:\d+(?:\x20+\w+\.?)+(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|(?:General\x20+Delivery)|(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|(?:BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR))?)?\,?\s+((?:[A-Za-z]+\x20*)+)\,\s+(A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])\s+(\d+(?:-\d+)?)\s*$/';

$full_address_pattern1 = '/^(?n:(?(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )|(P\.O\.\ Box\ \d{1,5}))\s{1,2}(?i:(?(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)\x20\w{1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)\.?)\s{1,2})?)(?[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\, \x20(?A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])\x20(?(?!0{5})\d{5}(-\d {4})?))$/';


Thanks again.
Most Valuable Expert 2011
Top Expert 2016
Commented:
You may still be overworking the problem.  FWIW, "220 w summit 90210" did not return an address from the Google Geocoder, so I put in my address like this: "1446 Colleen 22101" and the resulting map is exactly accurate.

http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=1446+Colleen+22101&sll=39.137492,-77.194211&sspn=0.006333,0.014452&ie=UTF8&hq=&hnear=1446+Colleen+Ln,+McLean,+Fairfax,+Virginia+22101&t=h&z=17

So maybe a better design pattern would be to call the Google and Yahoo geocoders with the address string and see if they can normalize the address for you, as Google did with my abbreviated address.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial