Link to home
Start Free TrialLog in
Avatar of dsr1811
dsr1811

asked on

US Postal Address Parsing into seperate Fields

I need a PHP/Regular Expression to find the parts of a US postal address using php, preferably returned in an array, identified by key values, so i can feed into a Like Search in MySQL.  See Examples of the possible input and the output below. This is a challenging problem. Thanx in advance.  Please, if you know of a better way to do this by all means let me know. MySQL's boolean search looked interesting for this type of problem.  THANKS AGAIN!!

Possible input fields are: Number, Direction, Name, Suffix, City, State, Zip, but direction could come on the either side of number.

Case does not matter, fields may be seperated by commas or spaces

Suffixes need to be validated and abreaiations should be replaced by the full names:
STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD

Compass directions need to be validated and abreaiations should be replaced by the full names:
W|West|SW|Southwest|NW|Northwest|S|South|E|East|SE|Southeast|NE|Northeast|N|North

Cities and streets may contain more than one word.

Currently the User input is turned into an array

The output should also be an array, where I will then feed to a class that will create the mySql 'AND' logic for the  fields.

array(
      [number] =>421
      [direction] =>'west'
      [city] =>'john glenn'
      [state] =>'ca'
)

Example of the solution array applied to a query:

WHERE number like '%421%' AND direction like '%west%' AND city like '%john glenn%' AND state like 'ca'

I am currently creating an array with the input and seperating as below:

$addressParts = explode(",", str_replace(" ",",",$PostAddr));


INPUT: las vegas, nv
OUTPUT: city =>las vegas, state => nv

INPUT: 90210
OUTPUT: zip => 90210

INPUT: 910 hamilton 90210
OUTPUT: number => 910, name => hamilton, zip => 90210

INPUT: 910 hamilton ave 90210
OUTPUT: number => 910, name => hamilton, suffix => avenue, zip => 90210

INPUT: 220 hamilton john glenn ca
OUTPUT: number => 220, name => hamilton, city => john glenn, state => ca

INPUT: 421 w 14th st john glenn ca
OUTPUT: number => 421, direction => west, name => 14th, suffix => street, city => john glenn, state => ca

INPUT: 220 hamilton john glenn
OUTPUT: number => 220, name => hamilton, city => john glenn

INPUT: 910 hamilton ave, campbell, ca
OUTPUT: number => 910, name => hamilton, suffix => avenue, city => campbell, state => ca

INPUT: w hamilton ln, john glenn, ca 90210
OUTPUT: direction => west, name => hamilton, suffix => lane, city => john glenn, state => ca, zip => 90210

INPUT: w hamilton ave john glenn ca 90210
OUTPUT: direction => west, name => hamilton, suffix => avenue, city => john glenn, state => ca, zip => 90210

ASKER CERTIFIED SOLUTION
Avatar of SleepinDevil
SleepinDevil

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I wonder if the data base design is wrong for this application.  The USPS may validate some of these fields against its standards, but it carries the address as a clear-text field in uppercase, and it is all one field.  The only place I have ever needed to be aware of the street-type abbreviations (ST, RD, CT, etc) is when I was creating the addresses myself.

Since this is not a question, so much as a need for application development, let's try a different approach to be productive.  Please tell us what you're trying to accomplish and maybe we can suggest a well-known design pattern.

Best, ~Ray
Avatar of dsr1811
dsr1811

ASKER

Thanks for the response guys!!

And i know this is a real teaser

I am trying to do something similar to the Realtor.com address search.

I have the live search for the postal code and city worked out as they do, but when someone types in an address that does not match either I need to try to construct the best search i can based in the info provided.

After  determining they have not typed in a matching zip or city first, then we know we need to interrogate each array item and based on what we know about what the individual address items could look like look,  we apply a set of regexs against each of them and pop them from the array.

i.e.

If the user would type  220 w summit 90210, the array components would first be tested against a zip code pattern.

$zipcode_pattern = '/^([0-9]{5})(-[0-9]{4})?$/';

Pop the zip code and then test:

1. State abbreviation (none)
2. Direction (pop the w and convert to west)
3. Street number (can only test for numeric, so pop 220)
4. and so on


This was my line of thinking and maybe part of my question is, is this the best approach?

I found a couple of regex patterns, that i thought would ascertain the sub parts of the address, but no luck.  See below.  



$full_address_pattern = '/^\s*((?:(?:\d+(?:\x20+\w+\.?)+(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|(?:General\x20+Delivery)|(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|(?:BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR))?)\,?\s+((?:(?:\d+(?:\x20+\w+\.?)+(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|(?:General\x20+Delivery)|(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|(?:BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR))?)?\,?\s+((?:[A-Za-z]+\x20*)+)\,\s+(A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])\s+(\d+(?:-\d+)?)\s*$/';

$full_address_pattern1 = '/^(?n:(?(\d{1,5}(\ 1\/[234])?(\x20[A-Z]([a-z])+)+ )|(P\.O\.\ Box\ \d{1,5}))\s{1,2}(?i:(?(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)\x20\w{1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)\.?)\s{1,2})?)(?[A-Z]([a-z])+(\.?)(\x20[A-Z]([a-z])+){0,2})\, \x20(?A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])\x20(?(?!0{5})\d{5}(-\d {4})?))$/';


Thanks again.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial