Solved

sort countries and cities from delimited file

Posted on 2002-03-13
34
703 Views
Last Modified: 2012-06-27
I have a large text file which has over 26,000 records. They are " ~ " seperated. If I import it in excel, I get number of columns depending upon the way I select import option ( like last name, first name, address etc). Now I want that my code could seperate the adress( country, city and zip) from that delimited text file. Once its done it could be exported to SQL server.
The records don't have consistancy. Some times coulmn two has address and along with it city is written, while some times in coulmn three its given city and country. I need to devise an algorithm, so that once its run the recored are seperated and and stored in SQL server in respected fields(like name, city, coutry and post code).
It should recognize if its FL then it florida and its i USA(like stuff.
Hope some one will interact with me and we can come up with a solution
0
Comment
Question by:fmufti
  • 16
  • 11
  • 4
  • +2
34 Comments
 
LVL 4

Expert Comment

by:trkcorp
Comment Utility
Code your brains out & pray.
0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
Are you saying that you're able to get Excel to parse the data?  If so, what's the problem?  That the source file is bad and causes the parse to put things in the wrong columns?  Then either fix the source file or create an "intelligence" function to examine columns to decide when something doesn't look right, and prompt to shift it left or right one cell.

0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Hi fmufi, this will take some time i guess but it sounds like regular expressions are hte way to explore since it seems like each record isn't that consistent only in character or number sequence?

for a start
http://www.freedownloadscenter.com/Programming/Visual_Basic_Tools/9626.html

:O)Bruintje
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Post first few lines from the file:

Assume you can connect to sql using ADO



CN.Open ' open a connection

' Create a table
SQL="Create Table AddressData( " + _
   "NameField varchar(50)," + _
   "Address1 varchar(50)," + _
   "Address2 varchar(50)" + _
   ")"
   
CN.Execute SQL

' Open Source data
dim wlfn as long

wlfn=freefile
open "c:\youfile.txt" for input as #wlfn
redim l$
reim Flds(0) as string
' keep reading until end of file
do while not eof(wlfn)
  Line Input #wlfn, l$

  ' extract the data into an array
  flds=Split(l$,"~")
  redim preserve fld(maxfields) ' make sure each record has all of the fields even if empty

  ' fld(0) now hold first field, 1 then next,etc

  SQL="Insert Into  AddressData (NameField,Address1,Address2,etc.) Values("

  for c=0 to ubound(fld)
    if c>0 then sql=sql+", "
    sql=sql+"'"+fld(c)+"'"
  next c
  sql=sql+")"
 
  ' Save the data into the database
  cn.execute sql
loop

close wlfn

You can now reacall the database using a select statement

sql="Select * from AddressData where State='NY'"

Set RS= New ADODB.RecordSet
RS.Open SQL, CN

If you need any more help....
 
0
 

Author Comment

by:fmufti
Comment Utility
Please try to understand. Actually once the data is fed in it is fed in a form which has three fields for address. A user can write 621, Durand lane, California, USAor he/she can write 281 Durand lane, CA or even 621, Durand Lane, USA. Either in one address field or in three fields, its upto him/her. The data is then stored as delimited text file. Now my job is to seperate the data such that it could be stored in a database with City in cit field, Country in country filed. So now tell me how to proceed???
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
ouch can't you just mold the frontend so entry would be more consistent?
0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
Ah...so the records are separated with "~" but not the fields.

Adding intelligence into a parsing routine is always tough.

Here's the catch-22:
The easiest way to parse is to match things against records in a database, but since you're using the data to build the database, you don't have those records.

I'd suggest starting by parsing based on commas, then examine each part and compare it against certain known things (like states and countries) then put the "proposed" records into a temporary table to be reviewed.  Anything that passes the review can be updated; other things will need to be modified.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Or simple method to Get the file into excel:

wlfn=freefile
open "c:\youfile.txt" for input as #wlfn
clipboard.clear
' read data
' replace ~ with tabs
' place on clipboard
clipboard.settext Replace(Input(lof(wlfn),#wlfn),"~",chr(9))


msgbox "The file has been place on the clipboard, now paste into excel."


If you show us the format of the file, just the top 3 lines, we can do a better job....forever inthedark


0
 

Author Comment

by:fmufti
Comment Utility
The file data is sort of confidential, I can give you my email and can send you part of the file in my reply.
fmufti@softhome.net
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
There is simple way to pasrse the data to do what you want but I realy want to see and example of the top of the file.

0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
If it's confidential, just XXX out the critical parts.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
I have done this before but first create a table in the format you want it.

The trick is to save the orriginal data in the file too and a CheckIt CHAR(1) flag to use when checking the data.

So when you save the data if the code detects strange format, e.g. state or zipcode missing the record can be flagged for visual inspection

Checkit = "0" = Programmer thinks there is a problem
="1" = Programmer thinks the record is ok
="9" = Has been verified by a manual visual inspection

After the data has been saved the inspection process can begin just sort records sequence by the field checkit.

0
 

Author Comment

by:fmufti
Comment Utility
Nick email me and I have fwd him the file. I cann't post it on the web. Hope ur emails are not that confidential :)
0
 

Author Comment

by:fmufti
Comment Utility
The major issue which I feel is not of seperating the data rather devising an algorithm to break it into city and country.
0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
Well, if the datastream consists of the city and country, then the parsing part is easy:  pull off the last word, locate it in the database.  If found the rest was the city; if not found, pull of the second-to-last word and compare both against the database, etc.   Do this because some wiseacre may have entered: Daytona Beach Florida United States of America

Regardless, parsing of freeform text WILL require an error report.  Follow my suggestion and that of inthedark above and you will save yourself a lot of misery.

For example, what if someone puts: Daytoona Bleach Forida, United State of Ameriga?  (People are lousy spellers, and even worse on a computer keyboard!)
0
 

Author Comment

by:fmufti
Comment Utility
If u could see the file things would be more clearer to u.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
How to parse free format address data:

Step 1

Create a text file which is a list of the full state names and also the standard and common and common mispellings.
Load the file into a string field called states.
New York may be "NY", "N.Y." etc.

Also load a list of countries

Step 2

Extract each word from the record so you get a word list.

Words=Split(Record," ")

Step 3

Profile each word to see what type of data it is.  I have pasted a profile example.

A word can be letters or numbers or a mix so the rpfile just returns A if a word or N if a number or a uk post code would look like AN and the next field looks like NA. A US zip code would be a simple N.


' delare profile like redim Profile(0) as string

redim Profile(Ubound(Words))

' add one extra dummy word to make code simple
redim Words(Ubound(Words)+1)

checkit="8"
Statesc=0
Zips=0
ZipFound=-1
StateFound=-1
CountryFound=-1
Countryc=0



For c = 0 to Ubound(Words)
    Profile(c)=StringProfile(Words(c))
Next c

For c = 0 to Ubound(Words)

    if Profile(c)="N" and len(Words(c))>5 ' may be a zip code
      ZipFound=c
      Zips = Zips+1
      Profile(c)="Z"
    else  
       if instr(States, Words(c))>0 then
          Profile(c)="S"
          Statesc=Statesc+1
          StateFound=C
          ' Now see if the state is a double word state e.g.
           if instr(States, Words(c)+" "+Words(c+1))>0 then
              ' join words
              Words(c)=Words(c)+" "+Words(c+1)
              profile(c+1)="x" ' ignore then next word
           end if
       elseif instr(Countries, Words(c))>0 then
          Countryc=Countryc+1
          Profile(c)="C"
          CountryFound=c
       end if
    end if
           
Next c

So you can now check the flags so see if record looks good

If StatesC<>1 or Zips<>1 Then
   Checkit="0"
End if

You can then use the positions of zip codes etc to deduce other info.

So when saving in the database you can now kida be certain about some things.

Zip=Words(ZipFound)
State=Words(StateFound)
Country=Words(CountryFound)

Does this help?



Function StringProfile(TextData) As String

Dim sc As Long
Dim sp$, c$

sp$ = UCase$(CStr(TextData))

StringProfile = Space$(Len(TextData))

For sc = 1 To Len(TextData)
    c$ = Mid$(sp$, sc, 1)
    If c$ >= "0" And c$ <= "9" Then
        Mid$(StringProfile, sc, 1) = "N"
    ElseIf c$ >= "A" And c$ <= "Z" Then
        Mid$(StringProfile, sc, 1) = "A"
    Else
        Mid$(StringProfile, sc, 1) = c$
   End If
   
Next

End Function

0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 17

Expert Comment

by:inthedark
Comment Utility
The file has a fixed format but you are right that the address is free format.  

Record: 0

000
001
002 NAMAMA AJAJHAJH/AAAHGHAG AHGGJHGH
003 dsaasdsa sd saddsa
004
005 1234 AAAAA RD LAKE WORTH FL 12345-1234 }
006                                        }Address fields
007                                        }
008
009
010
011
012
013
014
015
016
017 A JIM FRANK
018
019
020
021
022
023
024 sdasdsadsa
025
026 FLORIDA                   } this is mainly right
027 asdsdasdsa
028
029


The above came from:


Private Sub Command1_Click()

Dim GF As New zGF ' my class for read a file
Dim d$
ReDim recs(0) As String
Dim df
Dim w$

d$ = GF.ReadFile("C:\file_del.txt")
recs = Split(d$, vbCrLf)

ReDim flds(0) As String
Dim fc As Long
Dim rc As Long

For rc = 0 To UBound(recs)
    w$ = w$ + "Record: " + CStr(rc) + vbCrLf
    w$ = w$ + vbCrLf
    flds = Split(recs(rc), "~")
    For fc = 0 To UBound(flds)
        w$ = w$ + Format(fc, "000") + " " + flds(fc) + vbCrLf
    Next fc
    w$ = w$ + vbCrLf
Next rc

Clipboard.Clear

Clipboard.SetText w$

MsgBox "Done"


End Sub
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
There wre always 29 fields in each record so the address part can be located without problem.

Fields 5-9 seem to make up the address.

So you could use the code example posted - taking words from fields 5 to 9.

But if you recognoze a word or 2 or 3 as a state use rasphitz idea to lookup the wor you recognised as a state and convert into a common format. e.g. FL or F.L. becomes Florida.

Do you see any light at the end of the tunnel?
0
 

Author Comment

by:fmufti
Comment Utility
I am willing to give more points for this question if I am able to get answer( posting the again and awarding all the points to one who can help me provide the soltution). But the thing is how to start with and secondly will I be able to complete this project or its impossible ?????. The project is not lengthy but its the alogorithm that will take time.
inthedark Can u send me the code so that I can start working on it, atleast we have to start somewhere.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
First you need to build up a database of states that will take you about an hour:

http://www.genealogy.org/~st-clair/counties/welcome.html

For example:

Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, North Carolina, North Dakota, Nevada, New Hampshire, New Jersey, New Mexico, New York, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming

But we ned the data in a meaning full manner. For example:

[States]
State1=Alabama
State2=Alaska
State3=Arizona
etc.

[Alabama]
Abreviation=AL
Counties1=Autauga, Baldwin, Barbour, Bibb, Blount
Counties2=Bullock, Butler, Calhoun, Chambers, Cherokee
Counties3=Chilton, Choctaw, Clarke, Clay, Cleburne
Counties4=Coffee, Colbert, Conecuh, Coosa, Covington
Counties5=Crenshaw, Cullman, Dale, Dallas, Dekalb, Elmore
Counties6=Escambia, Etowah, Fayette, Franklin, Geneva
Counties7=Greene, Hale, Henry, Houston, Jackson
Counties8=Jefferson, Lamar, Lauderdale, Lawrence, Lee
Counties9=Limestone, Lowndes, Macon, Madison, Marengo
Counties10=Marion, Marshall, Mobile, Monroe, Montgomery
Counties11=Morgan, Perry, Pickens, Pike, Randolph
Counties12=Russell, Shelby, St. Clair, Sumter, Talladega
Counties13=Tallapoosa, Tuscaloosa, Walker, Washington
Counties14=Wilcox, Winston


[Alaska]


etc.

It took about 2 minutes to do above so it should not take long.







0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
This looks interesting: http://www.qsl.net/ok1rr/uscty.html
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
So I download the MDB above and also used this page:

http://www.infoplease.com/ipa/A0110468.html

for the abbreviations:

Save into a file:

Alabama,  Ala., AL
Alaska,  Alaska, AK
Arizona,  Ariz., AZ
Arkansas,  Ark., AR
California,  Calif., CA
Colorado,  Colo., CO
Connecticut,  Conn., CT
Delaware,  Del., DE
Dist. of Columbia,  D.C., DC
District of Columbia,  D.C., DC
Florida,  Fla., FL
Georgia,  Ga., GA
Guam,  Guam, GU
Hawaii,  Hawaii, HI
Idaho,  Idaho, ID
Illinois,  Ill., IL
Indiana,  Ind., IN
Iowa,  Iowa, IA
Kansas,  Kans., KS
Kentucky,  Ky., KY
Louisiana,  La., LA
Maine,  Maine, ME
Maryland,  Md., MD
Massachusetts,  Mass., MA
Michigan,  Mich., MI
Minnesota,  Minn., MN
Mississippi,  Miss., MS
Missouri,  Mo., MO
Montana,  Mont., MT
Nebraska,  Nebr., NE
Nevada,  Nev., NV
New Hampshire,  N.H., NH
New Jersey,  N.J., NJ
New Mexico,  N.M., NM
New York,  N.Y., NY
North Carolina,  N.C., NC
North Dakota,  N.D., ND
Ohio,  Ohio, OH
Oklahoma,  Okla., OK
Oregon,  Ore., OR
Pennsylvania,  Pa., PA
Puerto Rico,  P.R., PR
Rhode Island,  R.I., RI
South Carolina,  S.C., SC
South Dakota,  S.D., SD
Tennessee,  Tenn., TN
Texas,  Tex., TX
Utah,  Utah, UT
Vermont,  Vt., VT
Virginia,  Va., VA
Virgin, Islands,  V.I. VI
Washington,  Wash., WA
West Virginia,  W.Va., WV
Wisconsin  Wis. WI
Wyoming,  Wyo., WY

0
 

Author Comment

by:fmufti
Comment Utility
Ok lets first start by seperating the data in columns i.e to convert delimited into number of column(the work that get it from excel). Can u mail me that code. Once its seperated in to columns then we can move to the next step of filtering.
Code should get the txt file and save with formated columns(just like we get it in excel) then we'll move on, as we have to remove those coumns as well which are just blank will be created as there is space. See if you can do something on it.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Countries:

Afghanistan
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Terr.
Brunei
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos(Keeling) Islands
Cocos
Columbia
Comoros
Congo
Cook Islands
Costa Rica
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Faeroe (Faroe) Islands
Faeroe Islands
Faroe Islands
Falkland Islands
Fiji
Finland
France
French Guyana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Heard and McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea Democratic People's Republic Of
South Korea
North Korea
Korea Republic Of
Republic Of Korea
Kuwait
Kyrgyzstan
Laos Peoples Dem. Rep.
Laos
Latvia
Lebanon
Lesotho
Liberia
Libyan Arab Jamahiriya
Liechtenstein
Lithuania
Luxembourg
Macao
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia
Moldova
Monaco
Mongolia
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Island
Northern Island
Norway
Oman
Pakistan
Palau
Panama
Papuan W Guinea
Paraguay
Peru
Philippines
Pitcairn Island
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Rwanda
S. Georgia and the S. Sandwich Island
Samoa
San Marino
Saotome and Principe
Saudi Arabia
Senegal
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
Spain
Sri Lanka
St .Pierre and Miquelon
St. Helena
St. Lucia
St.Kitts-Nevis-Anguilla
St.Vincent and the Grenadines
Sudan
Suriname
Svalbard and Janmayen Islands
Swaziland
Sweden
Switzerland
Syrian Arab Republic
Taiwan
Tajikistan
Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad and Tobago
Tunisia
Turkey
Turkmenistan
Turks and Caicos Islands
Tuvalu
Uganda
Ukraine SSR
United Arab Emirates
United Kingdom
England
Scotland
Wales
Ireland
N.I.
UK
U. K.
United States
USA
U.S.A.
U.S.A
USA.
Uruguay
Uzbekistan
Vanuatu
Vatican City-State
Venezuela
Vietnam
Virgin Islands (British)
British Virgin Islands
Virgin Islands (U.S.)
Virgin Islands
Wallis And Futuna Islands
Western Sahara
Yemen
Yugoslavia
Zaire
Zambia
Zimbabwe
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Putting it all together:

Option Explicit

Function ReadFile(FileName As String) As String

' Read whole file.

Dim wlfn As Long

wlfn = FreeFile
Open FileName For Input As #wlfn
ReadFile = Input(LOF(wlfn), #wlfn)
Close wlfn


End Function

Private Sub Command1_Click()

' Create a project reference o DAO

' Record processing
ReDim profile(0) As String
ReDim words(0) As String
Dim l$
ReDim Flds(0) As String
Dim wlfn As Long
Dim Recs As Long


' support databases
Dim AllCountries As String
ReDim Countries(0) As String
ReDim States(0) As String
ReDim Counties(0) As String ' for each state
Dim AllStates As String
Dim ZipDB As DAO.Database
Dim ZipRS As DAO.Recordset

' Destination database
Dim fc As Long
Dim DB As DAO.Database
Dim WS As DAO.Workspace
Dim Fld As DAO.Field
Dim TD As DAO.TableDef
Dim RS As DAO.Recordset
Dim sql As String

Dim Address As String
   
Const maxfields = 29
Dim Ok


' Record stuff
Dim CheckIt As String
Dim Confidence As Long
Dim County As String
Dim Country As String
Dim State As String
Dim StateCode As String
Dim Zip As String
Dim cc As Long
   
Dim Zips As Long
Dim StatesC As Long
Dim StateFound As Long
Dim Countryc As Long
Dim CountryFound  As Long
Dim ZipFound As Long


'===========================================
' Contents
' 1 Create Destination Table
' 2 Load Country,State and open Zip database
' 3 Open/process records
' 4 Close up
' 5 Prase Data



'===========================================
' 1 Create Destination Table

' Create database for destination file

' First check if re-run/debug mode
If Len(Dir("C:\dest.mdb")) Then
    Ok = MsgBox("Delete old database", vbYesNo + vbExclamation, "Confirm Delete")
   
    If Ok = vbNo Then
        End
    End If
   
    Kill "c:\dest.mdb"
   
End If

Screen.MousePointer = vbHourglass
DoEvents

' Create a new database
Set WS = Workspaces(0)
' the syntax of this may change dependent what access version you have:
Set DB = WS.CreateDatabase("c:\dest.mdb", dbLangGeneral, DAO.DatabaseTypeEnum.dbVersion40)

' Create table
Set TD = DB.CreateTableDef("Addresses")


' Create fields

' primary key
Set Fld = TD.CreateField("ID", DAO.DataTypeEnum.dbLong)
Fld.Attributes = DAO.FieldAttributeEnum.dbAutoIncrField
TD.Fields.Append Fld

' Status of record
Set Fld = TD.CreateField("Checkit", DAO.DataTypeEnum.dbText, 1)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Confidence Factor
Set Fld = TD.CreateField("Confidence", DAO.DataTypeEnum.dbLong)
TD.Fields.Append Fld

' Zip Code
Set Fld = TD.CreateField("Zip", DAO.DataTypeEnum.dbText, 20)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' State
Set Fld = TD.CreateField("State", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Statecode
Set Fld = TD.CreateField("StateCode", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Country
Set Fld = TD.CreateField("Country", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' County
Set Fld = TD.CreateField("County", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Address
Set Fld = TD.CreateField("Address", DAO.DataTypeEnum.dbText, 255)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Orriginal Fields
For fc = 0 To maxfields
    Set Fld = TD.CreateField("F" + Format(fc, "00"), DAO.DataTypeEnum.dbText, 255)
    Fld.AllowZeroLength = True
    TD.Fields.Append Fld
    DoEvents ' always in a loop
Next fc

Set Fld = Nothing

' Save new table spec.
DB.TableDefs.Append TD

Set TD = Nothing

DoEvents


' Open open the new table to receive data
Set RS = DB.OpenRecordset("Select * from Addresses", dbOpenDynaset)

'===========================================
' 2 Load Country,State and open Zip database
' 3 Open/process records

' read data
AllCountries = UCase(ReadFile("c:\countries.txt"))
Countries = Split(AllCountries, vbCrLf)
AllStates = UCase(ReadFile("c:\states.txt"))
States = Split(AllStates, vbCrLf)
ReDim statedetails(UBound(States))
For fc = 0 To UBound(States)
    statedetails(fc) = Split(States(fc), ",")
   
    ' StatesDetail(0)(0) = full name
    ' StatesDetail(0)(1) = Abbrev
    ' StatesDetail(0)(2) = letter code (StateCode)
   
    ' Make sure no silly spaces in data
    For cc = 0 To UBound(statedetails(fc))
        statedetails(fc)(cc) = Trim$(statedetails(fc)(cc))
        DoEvents

    Next cc
Next fc

' Get zip code database
Set ZipDB = WS.OpenDatabase("c:\uscounties.mdb")
' load in county details for each state
ReDim Counties(UBound(States))
For fc = 0 To UBound(States)
    Counties(fc) = ""
    'states(fc)
    sql = "SELECT uscounties.COUNTY  From uscounties" + _
    " Where (State='" + statedetails(fc)(1) + "');"
    Set ZipRS = ZipDB.OpenRecordset(sql, dbOpenSnapshot)
    Do While Not ZipRS.EOF
        If Len(Counties(fc)) > 0 Then
            Counties(fc) = Counties(fc) + " "
        End If
        Counties(fc) = Counties(fc) + ZipRS("County")
        ZipRS.MoveNext
    Loop
    ZipRS.Close
Next fc


DoEvents

sql = "SELECT uscounties.ZIP, uscounties.STATE, uscounties.COUNTY " + _
    " From uscounties" + _
    " ORDER BY uscounties.ZIP;"

Set ZipRS = ZipDB.OpenRecordset(sql, dbOpenSnapshot)

Screen.MousePointer = vbHourglass ' remake
DoEvents


'===========================================
' 3 Open/process records

' Open Source data
wlfn = FreeFile
Open "C:\file_del.txt" For Input As #wlfn

' keep reading until end of file
Do While Not EOF(wlfn)
    Line Input #wlfn, l$
   
   
    If Len(l$) > 0 Then
        Recs = Recs + 1
       
        ' extract the data into an array
        Flds = Split(l$, "~")
        ReDim Preserve Flds(maxfields) ' make sure each record has all of the fields even if empty
       
        ' Add the address fields together
        Address = ""
        For fc = 5 To 9
            If Len(Flds(fc)) > 0 Then
                If Len(Address) > 0 Then
                    Address = Address + " "
                End If
                Address = Address + Flds(fc)
            End If
        Next fc
       
        County = ""
        State = ""
        Country = ""
        StateCode = ""
        Zip = ""
        Confidence = 0
       
        GoSub ParseData
         
        ' save new record
       
        WS.BeginTrans
       
        RS.AddNew
        For fc = 0 To maxfields
            RS("F" + Format(fc, "00")) = Flds(fc)
        Next fc
       
        RS("Country") = Country
        RS("County") = County
        RS("State") = State
        RS("Statecode") = StateCode
        RS("Zip") = Zip
        RS("Checkit") = CheckIt
        RS("Confidence") = Confidence
        RS("Address") = Address
        RS.Update
       
        WS.CommitTrans ' free locks
       
    End If
    DoEvents
Loop

'===========================================
' 4  Close up

Close wlfn

RS.Close
ZipRS.Close
DB.Close
ZipDB.Close
WS.Close

Set RS = Nothing
Set ZipRS = Nothing
Set DB = Nothing
Set ZipDB = Nothing
Set WS = Nothing

Screen.MousePointer = vbDefault

MsgBox CStr(Recs) + " records imported!"

Exit Sub

'===========================================
' 5 Parse Data

ParseData:

' remove double spaces
Address = Replace(Address, Space$(2), Space$(1))
Address = Replace(Address, " S ", " SOUTH ")
Address = Replace(Address, " N ", " NORTH ")
Address = Replace(Address, " E ", " EAST ")
Address = Replace(Address, " W ", " WEST ")
Address = Replace(Address, " RD ", " ROAD ")


words = Split(Address, " ")

ReDim profile(UBound(words))

' add one extra dummy word to make code simple
ReDim Preserve words(UBound(words) + 1)

CheckIt = "0"
ZipFound = -1
StateFound = -1
CountryFound = -1
Zips = 0
StatesC = 0
Countryc = 0

For fc = 0 To UBound(words) - 1
   profile(fc) = StringProfile(words(fc), True)
Next fc


' first time parse find the obvious zip code
For fc = 0 To UBound(words) - 1

   Do
   
        If Len(words(fc)) = 0 Then Exit Do
       
        If (profile(fc) = "N" Or profile(fc) = "N-N") And Len(words(fc)) > 4 Then  ' may be a zip code
       
            Confidence = Confidence + 100
            ZipFound = fc
            Zips = Zips + 1
            Zip = words(fc)
            If profile(fc) = "N-N" Then
                Zip = Left(Zip, InStr(Zip, "-") - 1)
            End If
            profile(fc) = "Z"
           
            ' this bit could be a bit slow but who cars for a one-off job
            ZipRS.MoveFirst
            ZipRS.FindFirst "Zip = '" + Zip + "'"
            If Not ZipRS.NoMatch Then
                 Confidence = Confidence + 10000
                 County = ZipRS("County")
                 StateCode = ZipRS("State")
                 
                 GoSub GetState
                 If Len(State) Then
                    CheckIt = "8"
                    Confidence = Confidence + 10000
                 End If
            End If
        ElseIf InStr(AllStates, words(fc)) > 0 Then
        'ADDRESS
                profile(fc) = "S"
                StatesC = StatesC + 1
                StateFound = fc
                If Len(words(fc)) > 2 Then
                    Confidence = Confidence + 20
                End If
                ' Now see if the state is a double word state e.g.
               
                cc = fc
                Do
               
                   If InStr(AllStates, words(fc) + " " + words(cc + 1)) = 0 Then
                       Exit Do
                   End If
                   
                   ' join words
                   words(fc) = words(fc) + " " + words(cc + 1)
                   words(cc + 1) = ""
                   profile(cc + 1) = "x" ' ignore then next word
                   cc = cc + 1
                   
                Loop
           
               
        ElseIf Len(words(fc)) > 4 And InStr(AllCountries, words(fc)) > 0 Then
            ' this bit is a bit hazy
            Countryc = Countryc + 1
            profile(fc) = "C"
            Country = words(fc)
            CountryFound = fc
            Confidence = Confidence + 100
            cc = fc
            Do
           
                If InStr(AllCountries, words(fc) + " " + words(cc + 1)) = 0 Then
                    Exit Do
                End If
               
                ' join words
                words(fc) = words(fc) + " " + words(cc + 1)
                words(cc + 1) = ""
                profile(cc + 1) = "x" ' ignore then next word
                cc = cc + 1
                Country = words(fc)
            Loop
                       
            For cc = 0 To UBound(Countries)
                If Countries(cc) = words(fc) Then
                    Confidence = Confidence + 5000
                    Exit For
                End If
            Next cc
           
       
            End If
        Exit Do
   Loop
Next fc

If Len(Zip) = 0 And ZipFound >= 0 Then
    Zip = words(ZipFound)
End If
   
If Len(State) = 0 And StateFound >= 0 Then
    State = words(StateFound)
End If
If Len(Country) = 0 And CountryFound >= 0 Then
    Country = words(CountryFound)
End If
If Confidence < 100 And Len(State) > 0 Then
    If Len(State) < 3 Then
        State = State + "?"
    End If
End If
If Len(State) > 0 Then

    ' now get full name of state
    For fc = 0 To UBound(States)
    If InStr(States(fc), State) > 0 Then
        State = statedetails(fc)(0)
        If Len(StateCode) = 0 Then
            StateCode = statedetails(fc)(2)
        End If
        Exit For
    End If
    Next fc

End If

If CheckIt < "8" And Confidence > 100 Then
    CheckIt = "2"
End If
If Confidence > 10000 Then

Select Case Country
    Case Is = ""
        If Confidence > 10000 Then
            Country = "USA"
        End If
    Case Is = "U.S.A"
        Country = "USA"
    Case Is = "U.S.A."
        Country = "USA"
    Case Is = "U.S. OF A"
        Country = "USA"
    Case Is = "U.S. OF A."
        Country = "USA"
    Case Is = "UNITED STATES"
         Country = "USA"
   
    Case Is = "UNITED STATES OF AMERICA"
         Country = "USA"
    Case Is = "US"
         Country = "USA"
    Case Else
        If Confidence > 10000 Then
            CheckIt = "3"
        End If
    End Select
   
End If

Return


GetState:
    For cc = 0 To UBound(States)
        If StateCode = statedetails(cc)(2) Then
            State = statedetails(cc)(0)
            Exit For
        End If
    Next cc
Return

End Sub


Function StringProfile(TextData, Optional Summary As Boolean = False) As String

Dim sc As Long
Dim sp$, c$

sp$ = UCase$(CStr(TextData))

If Summary Then
    StringProfile = ""
Else
    StringProfile = Space$(Len(TextData))
End If

For sc = 1 To Len(TextData)
    c$ = Mid$(sp$, sc, 1)
    If c$ >= "0" And c$ <= "9" Then
        c$ = "N"
    ElseIf c$ >= "A" And c$ <= "Z" Then
        c$ = "A"
    End If
    If Summary Then
        If Len(StringProfile) = 0 Then
            StringProfile = c$
        Else
            If Right(StringProfile, 1) <> c$ Then
                StringProfile = StringProfile + c$
            End If
        End If
    Else
        Mid$(StringProfile, sc, 1) = c$
    End If
Next

End Function

0
 

Author Comment

by:fmufti
Comment Utility
I have yet to gone in depth of your code. However I would like to point out few things which I saw from the output.
1. The file has 26,000 records. So if the whole file is run it runs out of srcipt error ( however when I checked for 24 records it worked fine)
2. Once the country, county, state, Zip and city are extracted they should be elimited from the address block ( the column where the whole address is present). So that the rest,  that is left,  becomes the street address automatically.
3. We have to see why the code despite putting URAGUAY and CANADA as country puts DC columbia and Delaware as the States(probably have to find the list of all major cities of the the countries)
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Hi fmufti,

Do you want to learn VB or do you want a free job? Please choose?

Forever inthedark, eager to help, but with bills to pay too!
0
 

Author Comment

by:fmufti
Comment Utility
Hi
inthedark, I use to do programming in VB 1 1/2 years back so I could not come up with the right code immediately. Your help is definately very helpful. But a little more from you will provide me the launching pad to do next steps. So can you comment on my above comment(1,2,3)
Hope that wont bother you much.
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
Change the do line here marked **:


           Country = words(fc)
           CountryFound = fc
           Confidence = Confidence + 100
           cc = fc
           Do While cc+1<Ubound(Words) '**While was missing
                         
               If InStr(AllCountries, words(fc) + " " + words(cc + 1)) = 0 Then

Also change the area:


               StateFound = fc
               If Len(words(fc)) > 2 Then
                   Confidence = Confidence + 20
               End If
               ' Now see if the state is a double word state e.g.
               
               cc = fc
               While cc+1<Ubound(Words) '**While was missing

0
 
LVL 17

Accepted Solution

by:
inthedark earned 300 total points
Comment Utility
Replacement for command1

Private Sub Command1_Click()

' Create a project reference o DAO

' Record processing
ReDim profile(0) As String
ReDim words(0) As String
Dim l$
ReDim Flds(0) As String
Dim wlfn As Long
Dim Recs As Long


' support databases
Dim AllCountries As String
ReDim countries(0) As String
ReDim States(0) As String
ReDim Counties(0) As String ' for each state
Dim AllStates As String
Dim ZipDB As DAO.Database
Dim ZipRS As DAO.Recordset


' Destination database
Dim fc As Long
Dim DB As DAO.Database
Dim WS As DAO.Workspace
Dim Fld As DAO.Field
Dim TD As DAO.TableDef
Dim RS As DAO.Recordset
Dim sql As String

Dim FullZip As String

Dim Address As String
Dim NewAddress As String
   
Const maxfields = 29
Dim Ok


' Record stuff
Dim CheckIt As String
Dim Confidence As Long
Dim County As String
Dim Country As String
Dim State As String
Dim StateCode As String
Dim Zip As String
Dim cc As Long
Dim CurrentState As Long
Dim Zips As Long
Dim StatesC As Long
Dim StateFound As Long
Dim Countryc As Long
Dim CountryFound  As Long
Dim ZipFound As Long
Dim worddone As Boolean
Dim ccs As Long


'===========================================
' Contents
' 1 Create Destination Table
' 2 Load Country,State and open Zip database
' 3 Open/process records
' 4 Close up
' 5 Prase Data



'===========================================
' 1 Create Destination Table

' Create database for destination file

' First check if re-run/debug mode
If Len(Dir("C:\dest.mdb")) Then
    Ok = MsgBox("Delete old database", vbYesNo + vbExclamation, "Confirm Delete")
   
    If Ok = vbNo Then
        End
    End If
   
    Kill "c:\dest.mdb"
   
End If

Screen.MousePointer = vbHourglass
DoEvents

' Create a new database
Set WS = Workspaces(0)
' the syntax of this may change dependent what access version you have:
Set DB = WS.CreateDatabase("c:\dest.mdb", dbLangGeneral, DAO.DatabaseTypeEnum.dbVersion40)

' Create table
Set TD = DB.CreateTableDef("Addresses")


' Create fields

' primary key
Set Fld = TD.CreateField("ID", DAO.DataTypeEnum.dbLong)
Fld.Attributes = DAO.FieldAttributeEnum.dbAutoIncrField
TD.Fields.Append Fld

' Status of record
Set Fld = TD.CreateField("Checkit", DAO.DataTypeEnum.dbText, 1)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Confidence Factor
Set Fld = TD.CreateField("Confidence", DAO.DataTypeEnum.dbLong)
TD.Fields.Append Fld

' Zip Code
Set Fld = TD.CreateField("Zip", DAO.DataTypeEnum.dbText, 20)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' State
Set Fld = TD.CreateField("State", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Statecode
Set Fld = TD.CreateField("StateCode", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Country
Set Fld = TD.CreateField("Country", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' County
Set Fld = TD.CreateField("County", DAO.DataTypeEnum.dbText, 40)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' Address
Set Fld = TD.CreateField("Address", DAO.DataTypeEnum.dbText, 255)
Fld.AllowZeroLength = True
TD.Fields.Append Fld

' New Address
Set Fld = TD.CreateField("NewAddress", DAO.DataTypeEnum.dbText, 255)
Fld.AllowZeroLength = True
TD.Fields.Append Fld


' Orriginal Fields
For fc = 0 To maxfields
    Set Fld = TD.CreateField("F" + Format(fc, "00"), DAO.DataTypeEnum.dbText, 255)
    Fld.AllowZeroLength = True
    TD.Fields.Append Fld
    DoEvents ' always in a loop
Next fc

Set Fld = Nothing

' Save new table spec.
DB.TableDefs.Append TD

Set TD = Nothing

DoEvents


' Open open the new table to receive data
Set RS = DB.OpenRecordset("Select * from Addresses", dbOpenDynaset)

'===========================================
' 2 Load Country,State and open Zip database
' 3 Open/process records

' read data
AllCountries = UCase(ReadFile("c:\countries.txt"))
countries = Split(AllCountries, vbCrLf)
AllStates = UCase(ReadFile("c:\states.txt"))
States = Split(AllStates, vbCrLf)
ReDim statedetails(UBound(States))

For fc = 0 To UBound(States)

    statedetails(fc) = Split(States(fc), ",")
   
    ' StatesDetail(0)(0) = full name
    ' StatesDetail(0)(1) = Abbrev
    ' StatesDetail(0)(2) = letter code (StateCode)
   
    ' Make sure no silly spaces in data
    For cc = 0 To UBound(statedetails(fc))
        statedetails(fc)(cc) = Trim$(statedetails(fc)(cc))
        DoEvents

    Next cc
Next fc

' Get zip code database
Set ZipDB = WS.OpenDatabase("c:\uscounties.mdb")
' load in county details for each state
ReDim Counties(UBound(States))
ReDim CountiesInState(UBound(States))

For fc = 0 To UBound(States)
    Counties(fc) = ""
    'states(fc)
    sql = "SELECT uscounties.COUNTY  From uscounties" + _
    " Where (State='" + statedetails(fc)(1) + "');"
    Set ZipRS = ZipDB.OpenRecordset(sql, dbOpenSnapshot)
    Do While Not ZipRS.EOF
        If Len(Counties(fc)) > 0 Then
            Counties(fc) = Counties(fc) + ";"
        End If
        Counties(fc) = Counties(fc) + ZipRS("County")
        ZipRS.MoveNext
    Loop
    ZipRS.Close
    CountiesInState(fc) = Split(Counties(fc), ";")
Next fc

DoEvents

sql = "SELECT uscounties.ZIP, uscounties.STATE, uscounties.COUNTY " + _
    " From uscounties" + _
    " ORDER BY uscounties.ZIP;"

Set ZipRS = ZipDB.OpenRecordset(sql, dbOpenSnapshot)

Screen.MousePointer = vbHourglass ' remake
DoEvents


'===========================================
' 3 Open/process records

' Open Source data
wlfn = FreeFile
Open "C:\file_del2.txt" For Input As #wlfn

' keep reading until end of file
Do While Not EOF(wlfn)
    Line Input #wlfn, l$
   
   
    If Len(l$) > 0 Then
        Recs = Recs + 1
           
       ' If Recs = 4 Then Stop
       
           
        ' extract the data into an array
        Flds = Split(l$, "~")
        ReDim Preserve Flds(maxfields) ' make sure each record has all of the fields even if empty
       
        ' Add the address fields together
        Address = ""
        For fc = 5 To 9
            If Len(Flds(fc)) > 0 Then
                If Len(Address) > 0 Then
                    Address = Address + " "
                End If
                Address = Address + Flds(fc)
            End If
        Next fc
       
        County = ""
        State = ""
        Country = ""
        StateCode = ""
        Zip = ""
        Confidence = 0

        FullZip = ""
       
        GoSub ParseData
         
        ' save new record
       
        WS.BeginTrans
       
        RS.AddNew
        For fc = 0 To maxfields
            RS("F" + Format(fc, "00")) = Flds(fc)
        Next fc
       
        RS("Country") = Country
        RS("County") = County
        RS("State") = State
        RS("Statecode") = StateCode
        If Len(FullZip) > 0 Then
            RS("Zip") = FullZip
        Else
            RS("Zip") = Zip
        End If
        RS("Checkit") = CheckIt
        RS("Confidence") = Confidence
        RS("Address") = Address
        RS("NewAddress") = NewAddress
        RS.Update
       
        WS.CommitTrans ' free locks
       
    End If
    DoEvents
Loop

'===========================================
' 4  Close up

Close wlfn

RS.Close
ZipRS.Close
DB.Close
ZipDB.Close
WS.Close

Set RS = Nothing
Set ZipRS = Nothing
Set DB = Nothing
Set ZipDB = Nothing
Set WS = Nothing

Screen.MousePointer = vbDefault

MsgBox CStr(Recs) + " records imported!"

Exit Sub

'===========================================
' 5 Parse Data

ParseData:

' remove dots and commas
Address = Replace(Address, ".", " ")
Address = Replace(Address, ",", " ")

' handle obvious abbreviouns
Address = Replace(Address, " S ", " SOUTH ")
Address = Replace(Address, " ST ", " STREET ")
Address = Replace(Address, " AVE ", " AVENUE ")
Address = Replace(Address, " N ", " NORTH ")
Address = Replace(Address, " E ", " EAST ")
Address = Replace(Address, " W ", " WEST ")
Address = Replace(Address, " RD ", " ROAD ")

' remove double spaces
Address = Replace(Address, Space$(2), Space$(1))

words = Split(Address, " ")



ReDim profile(UBound(words))

' add one extra dummy word to make code simple
ReDim Preserve words(UBound(words) + 1)

CheckIt = "0"
ZipFound = -1
StateFound = -1
CountryFound = -1
Zips = 0
StatesC = 0
Countryc = 0

For fc = 0 To UBound(words) - 1
   profile(fc) = StringProfile(words(fc), True)
Next fc


' first time parse find the obvious zip code
For fc = 0 To UBound(words) - 1

   If Len(words(fc)) > 0 Then
       
        worddone = False
        Do
       
            GoSub TestForZIP
            If worddone Then Exit Do
            GoSub TestForState
            If worddone Then Exit Do
            GoSub TestForCountry
            If worddone Then Exit Do
            Exit Do
        Loop
       
   End If
   
Next fc

If Len(Zip) = 0 And ZipFound >= 0 Then
    Zip = words(ZipFound)
End If
   
If Len(State) = 0 And StateFound >= 0 Then
    State = words(StateFound)
End If
If Len(Country) = 0 And CountryFound >= 0 Then
    Country = words(CountryFound)
End If


If Len(State) > 0 Then
    Do
        For cc = 0 To UBound(States)
            For ccs = 0 To UBound(statedetails(cc))
                If State = statedetails(cc)(ccs) Then
                    State = statedetails(cc)(0)
                    CurrentState = cc
                    StateCode = statedetails(cc)(2)
                    Exit Do
                End If
            Next
        Next cc
    Loop

End If

'If Confidence < 100 And Len(State) > 0 Then
'    If Len(State) < 3 Then
'        State = State + "?"
'    End If
'End If

' now get county
' this bit needs to handle counties with double words
' I don't have time for that

ReDim countywords(0) As String

If CurrentState >= 0 Then
    For cc = 0 To UBound(CountiesInState(CurrentState))
        If InStr(Address, CountiesInState(CurrentState)(cc)) > 0 Then
            ' We now now the county
            If Len(Country) > 0 Then
                If County <> CountiesInState(CurrentState)(cc) Then
                    Confidence = Confidence - 4000
                    CheckIt = "2"
                End If
            Else
                County = CountiesInState(CurrentState)(cc)
                Confidence = Confidence + 500
            End If
            'now find which words need to be
            ' removed from the profile
            ' this is complex as the county may have x words
            countywords = Split(CountiesInState(CurrentState)(cc), " ")
            For fc = 0 To UBound(words) - 1 - UBound(countywords)
                worddone = True
                For ccs = 0 To UBound(countywords)
                    If words(fc + ccs) <> countywords(ccs) Then
                        worddone = False
                        Exit For
                    End If
                Next ccs
               
                If worddone Then
                    For ccs = 0 To UBound(countywords)
                        profile(fc + ccs) = "T"
                    Next
                    Exit For
                End If
            Next fc
        End If
    Next
End If


If CheckIt < "8" And Confidence > 100 Then
    CheckIt = "2"
End If

If Confidence > 10000 Then

    If Len(Country) > 0 Then
   
        Select Case Country
            Case Is = ""
                If Confidence > 10000 Then
                    Country = "USA"
                End If
            Case Is = "U.S.A"
                Country = "USA"
            Case Is = "U.S.A."
                Country = "USA"
            Case Is = "U.S. OF A"
                Country = "USA"
            Case Is = "U.S. OF A."
                Country = "USA"
            Case Is = "UNITED STATES"
                 Country = "USA"
           
            Case Is = "UNITED STATES OF AMERICA"
                 Country = "USA"
            Case Is = "US"
                 Country = "USA"
            Case Else
                If Confidence > 10000 Then
                    CheckIt = "3"
                End If
        End Select
       
    Else
        If Len(State) > 0 Then
            Country = "USA"
        End If
    End If
Else
    If Len(State) > 0 Then
        Country = "USA"
    End If

End If

' the non-usa then make the state back to a normal word
If Country <> "USA" Then
    State = ""
    If StateFound >= 0 Then
        profile(StateFound) = "A"
    End If
End If
   
' Assume unfound zip codes not found in zip db were realy zips
If Len(Zip) = 0 Then

    If UBound(words) - 4 > 0 Then
   
        For fc = UBound(words) - 4 To UBound(words) - 1
            If profile(fc) = "N" And Len(words(fc)) = 5 Then
                Zip = words(fc)
                profile(fc) = "Z"
                Exit For
            End If
        Next fc
    End If
End If

' Now create the new address
Address = Address
NewAddress = ""
For fc = 0 To UBound(words) - 1
    If InStr("SCTZx", profile(fc)) = 0 Then
        If Len(NewAddress) > 0 Then
            NewAddress = NewAddress + " "
        End If
        NewAddress = NewAddress + words(fc)
    End If
Next fc

' reomove state if still in address
If Len(State) > 0 Then
    NewAddress = Replace(NewAddress, State, "")
End If


Return


GetState:
    For cc = 0 To UBound(States)
        If StateCode = statedetails(cc)(2) Then
            State = statedetails(cc)(0)
            CurrentState = cc
            Exit For
        End If
    Next cc
Return

TestForZIP:
    If fc < 1 Then Return ' zip code is never at the front
    If (profile(fc) = "N" Or profile(fc) = "N-N") And Len(words(fc)) > 4 Then
        ' may be a zip code
            Zip = words(fc)
       
            If profile(fc) = "N-N" Then
                Zip = Left(Zip, InStr(Zip, "-") - 1)
                FullZip = words(fc)
            End If
           
            profile(fc) = "Z"
           
            ' this bit could be a bit slow but who cars for a one-off job
            ZipRS.MoveFirst
            ZipRS.FindFirst "Zip = '" + Zip + "'"
            If Not ZipRS.NoMatch Then
                ZipFound = fc
                profile(fc) = "Z"
                Zips = Zips + 1
                Confidence = Confidence + 10000
                worddone = True
                County = ZipRS("County")
                StateCode = ZipRS("State")
               
                GoSub GetState
                If Len(State) Then
                    CheckIt = "8"
                    Confidence = Confidence + 10000
                End If
            End If
    End If

Return

TestForState:
       
    ' state is probably in the last half of words
    If fc < UBound(words) * 0.5 Then Return
   
   
    If InStr(AllStates, words(fc)) = 0 Then Return
   

    ' almost and two letter pair will be found in the
    ' allstates string
    ' so we scane just as a quick check
    ' more formal check is later
                   
    ' Now see if the state is a double word state e.g.
    cc = fc
    Do While cc <= UBound(words)
   
       If InStr(AllStates, words(fc) + " " + words(cc + 1)) = 0 Then
           Exit Do
       End If
       
       ' we should realy be working in a temp array which
       ' incase this combination of words is not found to be a state
       
       ' join words
       words(fc) = words(fc) + " " + words(cc + 1)
       words(cc + 1) = ""
       profile(cc + 1) = "x" ' ignore then next word
       cc = cc + 1
       
    Loop
   
    ' formal check
    ' now see if the word is realy a state
   
    For cc = 0 To UBound(States)
        For ccs = 0 To UBound(statedetails(cc))
            If words(fc) = statedetails(cc)(ccs) Then
           
                ' bingo we have found a state
                profile(fc) = "S"
                StatesC = StatesC + 1
                StateFound = fc
                If Len(words(fc)) > 2 Then
                    Confidence = Confidence + 1000
                End If
                worddone = True
                Return
               
            End If
        Next
    Next
   
Return

TestForCountry:

'If Recs = 8 Then Stop
    If words(fc) = "US" Or words(fc) = "USA" Then
        Countryc = Countryc + 1
        profile(fc) = "C"
        Country = words(fc)
        CountryFound = fc
        Confidence = Confidence + 100
        worddone = True
        Return
    End If
   

    If Len(words(fc)) > 4 And InStr(AllCountries, words(fc)) > 0 Then
         
           
        cc = fc
        Do While cc <= UBound(words)
       
            If InStr(AllCountries, words(fc) + " " + words(cc + 1)) = 0 Then
                Exit Do
            End If
           
            ' join words
            words(fc) = words(fc) + " " + words(cc + 1)
            words(cc + 1) = ""
            profile(cc + 1) = "x" ' ignore then next word
            cc = cc + 1
           
        Loop
       
        ' now see if realy a country
        For cc = 0 To UBound(countries)
            If countries(cc) = words(fc) Then
                Confidence = Confidence + 5000
                Countryc = Countryc + 1
                profile(fc) = "C"
                Country = words(fc)
                CountryFound = fc
                Confidence = Confidence + 100

                worddone = True
                Return
            End If
        Next cc
    End If
Return

End Sub

0
 

Author Comment

by:fmufti
Comment Utility
Thanx inthedark, I am still in modifying it so that I will be able to have Zip, city , State and Country  seperate while rest left is street.
Still Thanx a lot for hepling me a lot!
0
 
LVL 17

Expert Comment

by:inthedark
Comment Utility
What percentage had checkit = "0"  ?

Sorting the results by checkit and confidence were there many records that needed checking?
0
 

Author Comment

by:fmufti
Comment Utility
Thanx for all that you have done for me so far!

But, I hope you won’t mind guiding me a little further. I am foreseeing a problem and I would like you to comment on it. After modifying the code and running it on huge amount of data I see that address block is so flexible that it cannot be guaranteed that a user has fed in like street, city, state, country, it can be in any order. However it is still from field 5 to 9. However I have modified so that all data separated by “~” goes in separate column(field) as there are 29 fields and every “~” is for separate entry like email, No, offr, Mail , Tel  etc and so on and even in address block of 5-9 (5=address, 6=city, 7=state, 8=zip, 9=country). Some times data is entered like the pattern it should be and some times not. I have two options and I would like you to guide me.

 

First sol: I have separate every “~” in separate field so the one that is entered right goes in correct columns. Then I should pick the address field (5) and apply a address decode algorithm to separate the records where its not filled right. But I see one difficulty in it. Some time the user has entered 9,8,7,6 right but entered partial duplication of that info in address (5) as well. So now stuck what to do.

 

Second way: as you did it make 5-9 one field and then separate it. Although that will unnecessarily combine 6-9 into 5 for those address that are properly fed, but for the other it has to be...am I right?????

So if we go by your algorithm then how to make sure that 2342-1234 written either at the start or the end of address block will not be take as house address but the zip code.  Same goes for city and country (no need for county). This is a big Q and I am still confused how to proceed for that.

If you look at the last three records, here incorrect Zip was extracted from last record where 74000 was zip instead of “405-406” also how to extract Zips of Canada and UK where they are in form at AXT4U3 etc.

I am mailing my code as well yours. Please direct me which method should I adopt.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

There are many ways to remove duplicate entries in an SQL or Access database. Most make you temporarily insert an ID field, make a temp table and copy data back and forth, and/or are slow. Here is an easy way in VB6 using ADO to remove duplicate row…
This article describes some techniques which will make your VBA or Visual Basic Classic code easier to understand and maintain, whether by you, your replacement, or another Experts-Exchange expert.
Get people started with the process of using Access VBA to control Outlook using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Microsoft Outlook. Using automation, an Access applic…
Get people started with the process of using Access VBA to control Excel using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Excel. Using automation, an Access application can laun…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now