Solved

Macro to Tidy address data in Excel

Posted on 2014-10-08
68
279 Views
Last Modified: 2014-10-25
See attached file showing various types of rough data in columns A to E
For each type (one row per type), the output in terms of clean data is in columns G-N.

Currently I am cleaning this data using a series of text-to-columns, Left/right formulas and sorting. I wonder if there is a better way if someone has a clever way of writing a macro to make it faster.

There are anything from 20,000 rows upwards in each set of data.
Assume numbers in this example file could be any number from 1-100
Assume 'Name' could be any name or word - I'm just showing the number of words and spaces for each type.
It doesn't matter if macro isn't 100%, if it does 70% that helps.
0
Comment
Question by:hawkeye_zzz
  • 32
  • 31
  • 3
  • +2
68 Comments
 
LVL 11

Expert Comment

by:tel2
ID: 40369845
> See attached file showing various types of rough data in columns A to E
I can't see it.
0
 
LVL 27

Expert Comment

by:Glenn Ray
ID: 40369938
@Hawkeye_zzz:  Can you please attach your example file?  

I'm sure an automated solution - based on your description - can be generated.  But we'll need to see the data. :-)

-Glenn
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40370293
Sorry! don't know why it didn't upload before.

One other thing, these columns are part of a wider set. If any sorting is involved, we'll need to sort all columns together or just add a key column to this data.
Sample-Address-formats.xls
0
 
LVL 25

Expert Comment

by:ProfessorJimJam
ID: 40370536
seems to be very messed up data :-)
0
 
LVL 27

Expert Comment

by:Glenn Ray
ID: 40370860
Thanks for posting an example, but in order to code a good set of logic, it will be better to see an actual sample of address information.  Redact it, if needed; there doesn't appear to be any personal information in this set, however.

For example, one issue is in row 11:  you have "City" in Address Line 4, but it does not show on the converted fields to the right.

This also appears to be English addresses; if so, having a lookup table of Postal Codes will help strip those out properly (and verify their entry).

-Glenn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40371410
Is this data already in Excel or are you importing it from an ASCII text file?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40371998
Thank you all for your questions.
The original data is voter data so has to be redacted to be untraceable.
However I have left the spaces and words as they are, and from 70,000 records, given one example of each type I can find. Thus where I have written "name road" this could be holly road, klaxon drive, winterbottom crescent etc. the only real identifier here are the spaces, and the way the numbers fall. So for example the postcode is easy to pick out, it's always the rightmost data and it's always letter/doubleletter-number-space-number-letter-letter. But it will not always be in the rightmost column.

Then if in the rest of it there is only one number (1 or 15 or 109 or 14b or 27c) with a space after it, and no other numbers in the group, that is definitely the house identifier, and what follows is the street name. But the street name could be two or three words (Maverly street or Great Toad Street), the only way we'd know is to recognise key terms like: "road, street, crescent, drive, avenue" and if these terms are in the macro I can learn and add new terms as they come up.

If there are two numbers (not including the postcode) such as 14 sunny mansions 27 High street new town, the 27 here would be the house identifier and 14 Sunny Mansions would be the Flat identifier. Other flat identifiers would have to be recognised such as G-F, T-F, U-F, again if they are listed in the macro I can add to the list and learn as I use it. the point is always that two numbers in a group, the rightmost one is the street identifier and anything before it should be dumped in the flat identifier field.

Sorry about the missing 'City' in row 11. City in that case should be in the rightmost field before the postcode.

Yes these are all British addresses. It's not a bad idea Glen, but I don't have a full postcode database of the UK to match against and they are expensive.

I hope that helps. If I don't get the perfect answer here, I am happy to award points and begin a new question with new points to further refine the macro once I've added new information.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40372074
Have you looked at address validation software?

You might qualify for a free copy:
http://www.poweredbypaf.com/developer/

There are some free digital town/postal code data sources:
http://www.dangibbs.co.uk/journal/free-uk-postcode-towns-counties-database

================
You can use references, such as Wikipedia to get your postal code city validation.
http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom
Note: The larger postal districts have a live link to a detailed page with the towns/cities

Here's a 16 page table of the postal code and city reference
http://www.doogal.co.uk/PostcodeDistricts.php
Here is a tab-separated list of just the cities/towns in the UK:
Abingdon, Oxfordshire	Abingdon
Accrington	Accrington
Acle	Acle
Acton, London	Acton
Adlington, Lancashire	Adlington
Alcester	Alcester
Aldeburgh	Aldeburgh
Aldershot	Aldershot
Alford, Lincolnshire	Alford
Alfreton	Alfreton
Alnwick	Alnwick
Alsager	Alsager
Alston, Cumbria	Alston
Alton, Hampshire	Alton
Altrincham	Altrincham
Amble	Amble
Ambleside	Ambleside
Amersham	Amersham
Amesbury	Amesbury
Ampthill	Ampthill
Andover, Hampshire	Andover
Appleby-in-Westmorland	Appleby-in-Westmorland
Arlesey	Arlesey
Arundel	Arundel
Ashbourne, Derbyshire	Ashbourne
Ashburton, Devon	Ashburton
Ashby-de-la-Zouch	Ashby-de-la-Zouch
Ashby Woulds	Ashby Woulds
Ashford, Kent	Ashford
Ashington	Ashington
Ashton-under-Lyne	Ashton-under-Lyne
Askern	Askern
Aspatria	Aspatria
Atherstone	Atherstone
Attleborough	Attleborough
Axbridge	Axbridge
Axminster	Axminster
Aylesbury	Aylesbury
Aylsham	Aylsham
Bacup	Bacup
Bakewell	Bakewell
Banbury	Banbury
Barking	Barking
Barnard Castle	Barnard Castle
Barnes, London	Barnes
Barnet	Barnet
Barnoldswick	Barnoldswick
Barnsley	Barnsley
Barnstaple	Barnstaple
Barrow-in-Furness	Barrow-in-Furness
Barton-upon-Humber	Barton-upon-Humber
Basingstoke	Basingstoke
Batley	Batley
Battle, East Sussex	Battle
Bawtry	Bawtry
Beaconsfield	Beaconsfield
Beaminster	Beaminster
Bebington	Bebington
Beccles	Beccles
Beckenham	Beckenham
Bedale	Bedale
Bedford	Bedford
Bedworth	Bedworth
Belper	Belper
Bentham, North Yorkshire	Bentham
Berkeley, Gloucestershire	Berkeley
Berkhamsted	Berkhamsted
Berwick-upon-Tweed	Berwick-upon-Tweed
Beverley	Beverley
Bewdley	Bewdley
Bexhill-on-Sea	Bexhill-on-Sea
Bexley	Bexley
Bicester	Bicester
Biddulph	Biddulph
Bideford	Bideford
Biggleswade	Biggleswade
Billericay	Billericay
Billingham	Billingham
Bilston	Bilston
Bingham, Nottinghamshire	Bingham
Bingley	Bingley
Birchwood	Birchwood
Birkenhead	Birkenhead
Bishop Auckland	Bishop Auckland
Bishop's Castle	Bishop's Castle
Bishop's Stortford	Bishop's Stortford
Bishop's Waltham	Bishop's Waltham
Blackburn	Blackburn
Blackpool	Blackpool
Blackrod	Blackrod
Blackwater and Hawley	Blackwater and Hawley
Blandford Forum	Blandford Forum
Bletchley and Fenny Stratford	Bletchley and Fenny Stratford
Blyth, Northumberland	Blyth
Bodmin	Bodmin
Bognor Regis	Bognor Regis
Bollington	Bollington
Bolsover	Bolsover
Bolton	Bolton
Bootle	Bootle
Boroughbridge	Boroughbridge
Boston, Lincolnshire	Boston
Bottesford, Lincolnshire	Bottesford
Bourne, Lincolnshire	Bourne
Bournemouth	Bournemouth
Bovey Tracey	Bovey Tracey
Brackley	Brackley
Bradford-on-Avon	Bradford-on-Avon
Brading	Brading
Bradley Stoke	Bradley Stoke
Bradninch	Bradninch
Braintree, Essex	Braintree
Brampton, Carlisle	Brampton
Brandon, Suffolk	Brandon
Braunstone Town	Braunstone Town
Brentford	Brentford
Brentwood, Essex	Brentwood
Bridgnorth	Bridgnorth
Bridgwater	Bridgwater
Bridlington	Bridlington
Bridport	Bridport
Brierfield, Lancashire	Brierfield
Brierley	Brierley
Brigg	Brigg
Brighouse	Brighouse
Brightlingsea	Brightlingsea
Brixham	Brixham
Broadstairs and St Peter's	Broadstairs and St Peter's
Bromborough	Bromborough
Bromley	Bromley
Bromsgrove	Bromsgrove
Bromyard and Winslow	Bromyard and Winslow
Broseley	Broseley
Broughton, Lincolnshire	Broughton
Broughton-in-Furness	Broughton-in-Furness
Bruton	Bruton
Buckfastleigh	Buckfastleigh
Buckingham	Buckingham
Bude-Stratton	Bude-Stratton
Budleigh Salterton	Budleigh Salterton
Bulwell	Bulwell
Bungay, Suffolk	Bungay
Buntingford	Buntingford
Burford	Burford
Burgess Hill	Burgess Hill
Burgh-le-Marsh	Burgh-le-Marsh
Burnham-on-Crouch	Burnham-on-Crouch
Burnham-on-Sea and Highbridge	Burnham-on-Sea and Highbridge
Burnley	Burnley
Burntwood	Burntwood
Burslem	Burslem
Burton Latimer	Burton Latimer
Burton upon Trent	Burton upon Trent
Bury	Bury
Bury St Edmunds	Bury St Edmunds
Bushey	Bushey
Buxton	Buxton
Caistor	Caistor
Callington, Cornwall	Callington
Calne	Calne
Camborne	Camborne
Camelford	Camelford
Cannock	Cannock
Canvey Island	Canvey Island
Carnforth	Carnforth
Carlton Colville	Carlton Colville
Carshalton	Carshalton
Carterton, Oxfordshire	Carterton
Castle Cary	Castle Cary
Castleford	Castleford
Chagford	Chagford
Chapel-en-le-Frith	Chapel-en-le-Frith
Chard, Somerset	Chard
Charlbury	Charlbury
Chatham, Kent	Chatham
Chatteris	Chatteris
Cheadle, Staffordshire	Cheadle
Cheltenham	Cheltenham
Chertsey	Chertsey
Chesham	Chesham
Cheshunt	Cheshunt
Chesterfield	Chesterfield
Chester-le-Street	Chester-le-Street
Chickerell	Chickerell
Chingford	Chingford
Chippenham	Chippenham
Chipping Campden	Chipping Campden
Chipping Norton, Oxfordshire	Chipping Norton
Chipping Sodbury	Chipping Sodbury
Chorley	Chorley
Chorleywood	Chorleywood
Christchurch, Dorset	Christchurch
Chudleigh	Chudleigh
Chulmleigh	Chulmleigh
Church Stretton	Church Stretton
Cinderford	Cinderford
Cirencester	Cirencester
Clare, Suffolk	Clare
Clay Cross	Clay Cross
Cleator Moor	Cleator Moor
Cleethorpes	Cleethorpes
Cleobury Mortimer	Cleobury Mortimer
Clevedon	Clevedon
Clitheroe	Clitheroe
Clun	Clun
Cockermouth	Cockermouth
Coggeshall	Coggeshall
Colburn, North Yorkshire	Colburn
Colchester	Colchester
Coleford, Gloucestershire	Coleford
Coleshill, Warwickshire	Coleshill
Colne	Colne
Colyton, Devon	Colyton
Congleton	Congleton
Conisbrough	Conisbrough
Corbridge	Corbridge
Corby	Corby
Corringham, Essex	Corringham
Corsham	Corsham
Cotgrave	Cotgrave
Cowes	Cowes
Coulsdon	Coulsdon
Cramlington	Cramlington
Cranbrook, Kent	Cranbrook
Craven Arms	Craven Arms
Crawley	Crawley
Crediton	Crediton
Crewe	Crewe
Crewkerne	Crewkerne
Cricklade	Cricklade
Cromer	Cromer
Crosby, Merseyside	Crosby
Crowborough	Crowborough
Croydon	Croydon
Crowland	Crowland
Crowle, Lincolnshire	Crowle
Cullompton	Cullompton
Dagenham	Dagenham
Dalton Town with Newton	Dalton Town with Newton
Darley Dale	Darley Dale
Darlington	Darlington
Dartford	Dartford
Dartmouth, Devon	Dartmouth
Darwen	Darwen
Daventry	Daventry
Dawley	Dawley
Dawlish	Dawlish
Deal, Kent	Deal
Dereham	Dereham
Desborough	Desborough
Devizes	Devizes
Dewsbury	Dewsbury
Didcot	Didcot
Dinnington St John's	Dinnington St John's
Diss	Diss
Doncaster	Doncaster
Dorchester, Dorset	Dorchester
Dorking	Dorking
Dover	Dover
Dovercourt	Dovercourt
Downham Market	Downham Market
Driffield	Driffield
Droitwich Spa	Droitwich Spa
Dronfield	Dronfield
Dudley	Dudley
Dukinfield	Dukinfield
Dulverton	Dulverton
Dunstable	Dunstable
Dunwich	Dunwich
Dursley	Dursley
Ealing	Ealing
Earl Shilton	Earl Shilton
Earley	Earley
Easingwold	Easingwold
East Cowes	East Cowes
East Grinstead	East Grinstead
East Ham	East Ham
Eastbourne	Eastbourne
Eastleigh	Eastleigh
Retford	East Retford
Eastwood, Nottinghamshire	Eastwood
Eccles, Greater Manchester	Eccles
Eccleshall	Eccleshall
Edenbridge, Kent	Edenbridge
Edgware	Edgware
Edmonton, London	Edmonton
Egremont, Cumbria	Egremont
Elland	Elland
Ellesmere, Shropshire	Ellesmere
Ellesmere Port	Ellesmere Port
Elstree and Borehamwood	Elstree and Borehamwood
Emsworth	Emsworth
Enfield Town	Enfield
Epping	Epping
Epworth, Lincolnshire	Epworth
Erith	Erith
Eton, Berkshire	Eton
Evesham	Evesham
Exmouth	Exmouth
Eye, Suffolk	Eye
Fairford	Fairford
Fakenham	Fakenham
Falmouth, Cornwall	Falmouth
Fareham	Fareham
Faringdon	Faringdon
Farnham	Farnham
Faversham	Faversham
Fazeley	Fazeley
Featherstone	Featherstone
Felixstowe	Felixstowe
Ferndown	Ferndown
Ferryhill	Ferryhill
Filey	Filey
Filton	Filton
Finchley	Finchley
Fleet, Hampshire	Fleet
Fleetwood	Fleetwood
Flitwick	Flitwick
Folkestone	Folkestone
Fordbridge	Fordbridge
Fordingbridge	Fordingbridge
Fordwich	Fordwich
Fowey	Fowey
Framlingham	Framlingham
Frinton and Walton	Frinton and Walton
Frodsham	Frodsham
Frome	Frome
Gainsborough, Lincolnshire	Gainsborough
Garstang	Garstang
Gateshead	Gateshead
Gillingham, Dorset	Gillingham
Gillingham, Kent	Gillingham
Glastonbury	Glastonbury
Glossop	Glossop
Godalming	Godalming
Godmanchester	Godmanchester
Goole	Goole
Gorleston	Gorleston
Gosport	Gosport
Grange-over-Sands	Grange-over-Sands
Grantham	Grantham
Gravesend, Kent	Gravesend
Grays	Grays
Great Dunmow	Great Dunmow
Great Torrington	Great Torrington
Great Yarmouth	Great Yarmouth
Greater Willington	Greater Willington
Grimsby	Grimsby
Guildford	Guildford
Guisborough	Guisborough
Hadleigh, Suffolk	Hadleigh
Hailsham	Hailsham
Halesowen	Halesowen
Halesworth	Halesworth
Halifax, West Yorkshire	Halifax
Halstead	Halstead
Haltwhistle	Haltwhistle
Redenhall with Harleston	Redenhall with Harleston
Harlow	Harlow
Harpenden	Harpenden
Harrogate	Harrogate
Harrow, London	Harrow
Hartland, Devon	Hartland
Hartlepool	Hartlepool
Harwich	Harwich
Harworth and Bircotes	Harworth and Bircotes
Haslemere	Haslemere
Haslingden	Haslingden
Hastings	Hastings
Hatfield, Hertfordshire	Hatfield
Hatherleigh	Hatherleigh
Havant	Havant
Haverhill, Suffolk	Haverhill
Haxby	Haxby
Hayle	Hayle
Haywards Heath	Haywards Heath
Heanor and Loscoe	Heanor and Loscoe
Heathfield, East Sussex	Heathfield
Hebden Royd	Hebden Royd
Hedge End	Hedge End
Hednesford	Hednesford
Hedon	Hedon
Helmsley	Helmsley
Helston	Helston
Hemel Hempstead	Hemel Hempstead
Hemsworth	Hemsworth
Hendon	Hendon
Henley-in-Arden	Henley-in-Arden
Henley-on-Thames	Henley-on-Thames
Hertford	Hertford
Hessle	Hessle
Hetton-le-Hole	Hetton
Hexham	Hexham
Heywood, Greater Manchester	Heywood
Higham Ferrers	Higham Ferrers
Highworth	Highworth
High Wycombe	High Wycombe
Hinckley	Hinckley
Hingham, Norfolk	Hingham
Hitchin	Hitchin
Hoddesdon	Hoddesdon
Holbeach	Holbeach
Holsworthy, Devon	Holsworthy
Holt, Norfolk	Holt
Honiton	Honiton
Horley	Horley
Horncastle, Lincolnshire	Horncastle
Hornsea	Hornsea
Hornsey	Hornsey
Horsforth	Horsforth
Horsham	Horsham
Horwich	Horwich
Houghton Regis	Houghton Regis
Howden	Howden
Huddersfield	Huddersfield
Hungerford	Hungerford
Hunstanton	Hunstanton
Huntingdon	Huntingdon
Hyde, Greater Manchester	Hyde
Hythe, Kent	Hythe
Ilford	Ilford
Ilfracombe	Ilfracombe
Ilkeston	Ilkeston
Ilkley	Ilkley
Ilminster	Ilminster
Immingham	Immingham
Ingleby Barwick	Ingleby Barwick
Ipswich	Ipswich
Irthlingborough	Irthlingborough
Ivybridge	Ivybridge
Jarrow	Jarrow
Keighley	Keighley
Kempston	Kempston
Kendal	Kendal
Kenilworth	Kenilworth
Kesgrave	Kesgrave
Keswick, Cumbria	Keswick
Kettering	Kettering
Keynsham	Keynsham
Kidderminster	Kidderminster
Kidsgrove	Kidsgrove
Kimberley, Nottinghamshire	Kimberley
Kingsbridge	Kingsbridge
King's Lynn	King's Lynn
Kingston-upon-Thames	Kingston-upon-Thames
Kington, Herefordshire	Kington
Kirkby-in-Ashfield	Kirkby-in-Ashfield
Kirkby Lonsdale	Kirkby Lonsdale
Kirkby Stephen	Kirkby Stephen
Kirkbymoorside	Kirkbymoorside
Kirkham, Lancashire	Kirkham
Kirton-in-Lindsey	Kirton-in-Lindsey
Knaresborough	Knaresborough
Knutsford	Knutsford
Langport	Langport
Launceston, Cornwall	Launceston
Leatherhead	Leatherhead
Lechlade	Lechlade
Ledbury	Ledbury
Leek, Staffordshire	Leek
Leigh, Greater Manchester	Leigh
Leighton-Linslade	Leighton-Linslade
Leigh-on-Sea	Leigh-on-Sea
Leiston	Leiston
Leominster	Leominster
Letchworth Garden City	Letchworth Garden City
Lewes	Lewes
Leyburn	Leyburn
Leyton	Leyton
Liskeard	Liskeard
Littlehampton	Littlehampton
Loddon, Norfolk	Loddon
Loftus, North Yorkshire	Loftus
Long Sutton, Lincolnshire	Long Sutton
Longridge	Longridge
Longtown, Cumbria	Longtown
Looe	Looe
Lostwithiel	Lostwithiel
Loughborough	Loughborough
Loughton	Loughton
Louth, Lincolnshire	Louth
Lowestoft	Lowestoft
Ludgershall, Wiltshire	Ludgershall
Ludlow	Ludlow
Luton	Luton
Lutterworth	Lutterworth
Lydd	Lydd
Lydney	Lydney
Lyme Regis	Lyme Regis
Lynton and Lynmouth	Lynton and Lynmouth
Lytham St Annes	Lytham St Annes
Mablethorpe and Sutton	Mablethorpe and Sutton
Macclesfield	Macclesfield
Madeley, Shropshire	Madeley
Maghull	Maghull
Maidenhead	Maidenhead
Maidstone	Maidstone
Maldon, Essex	Maldon
Malmesbury, Wiltshire	Malmesbury
Maltby, South Yorkshire	Maltby
Malton, North Yorkshire	Malton
Malvern, Worcestershire	Malvern
Manningtree	Manningtree
Mansfield	Mansfield
Marazion	Marazion
March, Cambridgeshire	March
Margate	Margate
Market Bosworth	Market Bosworth
Market Deeping	Market Deeping
Market Drayton	Market Drayton
Market Harborough	Market Harborough
Market Rasen	Market Rasen
Market Weighton	Market Weighton
Marlborough, Wiltshire	Marlborough
Marlow, Buckinghamshire	Marlow
Maryport	Maryport
Masham	Masham
Matlock, Derbyshire	Matlock
Medlar with Wesham	Medlar with Wesham
Melksham	Melksham
Meltham	Meltham
Melton Mowbray	Melton Mowbray
Mere, Wiltshire	Mere
Mexborough	Mexborough
Middleham	Middleham
Middlesbrough	Middlesbrough
Middleton, Greater Manchester	Middleton
Middlewich	Middlewich
Midhurst	Midhurst
Midsomer Norton	Midsomer Norton
Mildenhall, Suffolk	Mildenhall
Millom	Millom
Minchinhampton	Minchinhampton
Minehead	Minehead
Minster, Swale	Minster
Mirfield	Mirfield
Mitcham	Mitcham
Mitcheldean	Mitcheldean
Morecambe	Morecambe
Moretonhampstead	Moretonhampstead
Moreton-in-Marsh	Moreton-in-Marsh
Morley, West Yorkshire	Morley
Morpeth, Northumberland	Morpeth
Mossley	Mossley
Much Wenlock	Much Wenlock
Nailsea	Nailsea
Nailsworth	Nailsworth
Nantwich	Nantwich
Needham Market	Needham Market
Nelson, Lancashire	Nelson
Neston	Neston
New Alresford	New Alresford
New Mills	New Mills
New Milton	New Milton
New Romney	New Romney
Newark-on-Trent	Newark-on-Trent
Newbiggin-by-the-Sea	Newbiggin-by-the-Sea
Newbury, Berkshire	Newbury
Newcastle-under-Lyme	Newcastle-under-Lyme
Newent	Newent
Newhaven, East Sussex	Newhaven
Newlyn	Newlyn
Newmarket, Suffolk	Newmarket
Newport, Isle of Wight	Newport
Newport, Shropshire	Newport
Newport Pagnell	Newport Pagnell
Newquay	Newquay
Newton Abbot	Newton Abbot
Newton-le-Willows	Newton-le-Willows
Normanton, West Yorkshire	Normanton
North Hykeham	North Hykeham
North Petherton	North Petherton
North Tawton	North Tawton
North Walsham	North Walsham
Northallerton	Northallerton
Northam, Devon	Northam
Northampton	Northampton
Northfleet	Northfleet
Northleach with Eastington	Northleach with Eastington
Northwich	Northwich
Norton-on-Derwent	Norton-on-Derwent
Nuneaton	Nuneaton
Oakengates	Oakengates
Oakham	Oakham
Okehampton	Okehampton
Oldbury, West Midlands	Oldbury
Oldham	Oldham
Ollerton and Boughton	Ollerton and Boughton
Olney, Buckinghamshire	Olney
Chipping Ongar	Ongar
Orford, Suffolk	Orford
Ormskirk	Ormskirk
Ossett	Ossett
Oswestry	Oswestry
Otley	Otley
Ottery St Mary	Ottery St Mary
Oundle	Oundle
Paddock Wood	Paddock Wood
Padiham	Padiham
Padstow	Padstow
Paignton	Paignton
Painswick	Painswick
Partington, Greater Manchester	Partington
Patchway	Patchway
Pateley Bridge	Pateley Bridge
Peacehaven	Peacehaven
Penistone	Penistone
Penkridge	Penkridge
Penrith, Cumbria	Penrith
Penryn, Cornwall	Penryn
Penwortham	Penwortham
Penzance	Penzance
Pershore	Pershore
Peterlee	Peterlee
Petersfield	Petersfield
Petworth	Petworth
Pickering, North Yorkshire	Pickering
Pocklington	Pocklington
Polegate	Polegate
Pontefract	Pontefract
Ponteland	Ponteland
Poole	Poole
Porthleven	Porthleven
Portishead and North Weston	Portishead and North Weston
Isle of Portland	Portland
Potton	Potton
Poynton-with-Worth	Poynton-with-Worth
Preesall	Preesall
Prescot	Prescot
Princes Risborough	Princes Risborough
Prudhoe	Prudhoe
Pudsey	Pudsey
Queenborough	Queenborough-in-Sheppey
Radstock	Radstock
Ramsey, Cambridgeshire	Ramsey
Ramsgate	Ramsgate
Raunds	Raunds
Rawtenstall	Rawtenstall
Rayleigh, Essex	Rayleigh
Reading, Berkshire	Reading
Redcar	Redcar
Redruth	Redruth
Reepham, Norfolk	Reepham
Reigate	Reigate
Richmond, London	Richmond
Richmond, North Yorkshire	Richmond
Ringwood	Ringwood
Ripley, Derbyshire	Ripley
Ripon	Ripon
Rochdale	Rochdale
Rochester, Kent	Rochester
Rochford	Rochford
Romford	Romford
Romsey	Romsey
Ross-on-Wye	Ross-on-Wye
Rothbury	Rothbury
Rotherham	Rotherham
Rothwell, Northamptonshire	Rothwell
Rothwell, West Yorkshire	Rothwell
Rowley Regis	Rowley Regis
Royal Leamington Spa	Royal Leamington Spa
Royal Tunbridge Wells	Royal Tunbridge Wells
Royal Wootton Bassett	Royal Wootton Bassett
Royston, Hertfordshire	Royston
Rugby, Warwickshire	Rugby
Rugeley	Rugeley
Rushden	Rushden
Ryde	Ryde
Rye, East Sussex	Rye
Saffron Walden	Saffron Walden
St Austell	St Austell
St Blazey	St Blaise
St Columb Major	St Columb Major
St Helens, Merseyside	St Helens
St Ives, Cambridgeshire	St Ives
St Ives, Cornwall	St Ives
St Just-in-Penwith	St Just-in-Penwith
St Mawes	St Mawes
St Neots	St Neots
Salcombe	Salcombe
Sale, Greater Manchester	Sale
Saltash	Saltash
Sandbach	Sandbach
Sandhurst, Berkshire	Sandhurst
Sandiacre	Sandiacre
Sandown	Sandown
Sandwich, Kent	Sandwich
Sandy, Bedfordshire	Sandy
Sawbridgeworth	Sawbridgeworth
Saxmundham	Saxmundham
Scarborough, North Yorkshire	Scarborough
Scunthorpe	Scunthorpe
Seaford, East Sussex	Seaford
Seaham	Seaham
Seaton, Devon	Seaton
Sedbergh	Sedbergh
Selby	Selby
Selsey	Selsey
Settle	Settle
Sevenoaks	Sevenoaks
Shaftesbury	Shaftesbury
Shanklin	Shanklin
Shefford, Bedfordshire	Shefford
Shepshed	Shepshed
Shepton Mallet	Shepton Mallet
Sherborne	Sherborne
Sheringham	Sheringham
Shifnal	Shifnal
Shildon	Shildon
Shipston-on-Stour	Shipston-on-Stour
Shirebrook	Shirebrook
Shoreham-by-Sea	Shoreham-by-Sea
Shrewsbury	Shrewsbury
Sidmouth	Sidmouth
Silloth	Silloth
Silsden	Silsden
Sittingbourne	Sittingbourne
Skegness	Skegness
Skelmersdale	Skelmersdale
Skelton-in-Cleveland	Skelton-in-Cleveland
Skipton	Skipton
Sleaford	Sleaford
Slough	Slough
Smethwick	Smethwick
Snaith and Cowick	Snaith and Cowick
Snodland	Snodland
Soham	Soham
Solihull	Solihull
Somerton	Somerton
South Cave	South Cave
South Elmsall	South Elmsall
South Kirkby and Moorthorpe	South Kirkby and Moorthorpe
South Molton	South Molton
South Petherton	South Petherton
South Shields	South Shields
South Woodham Ferrers	South Woodham Ferrers
Southam	Southam
Southall	Southall
Southborough, Kent	Southborough
Southend-on-Sea	Southend-on-Sea
Southgate, London	Southgate
Southminster	Southminster
Southport	Southport
Southsea	Southsea
Southwell, Nottinghamshire	Southwell
Southwick, Hampshire	Southwick
Southwold	Southwold
Spalding, Lincolnshire	Spalding
Spennymoor	Spennymoor
Spilsby	Spilsby
Stafford	Stafford
Staines-upon-Thames	Staines-upon-Thames
Stainforth, South Yorkshire	Stainforth
Stalbridge	Stalbridge
Stalham	Stalham
Stalybridge	Stalybridge
Stamford, Lincolnshire	Stamford
Stanley, County Durham	Stanley
Stanhope, County Durham	Stanhope
Stapleford, Nottinghamshire	Stapleford
Staveley, Derbyshire	Staveley
Stevenage	Stevenage
Steyning	Steyning
St Mary Cray	St Mary Cray
Stockport	Stockport
Stocksbridge	Stocksbridge
Stockton-on-Tees	Stockton-on-Tees
Stone, Staffordshire	Stone
Stonehouse, Gloucestershire	Stonehouse
Stony Stratford	Stony Stratford
Stotfold	Stotfold
Stourbridge	Stourbridge
Stourport-on-Severn	Stourport-on-Severn
Stowmarket	Stowmarket
Stow-on-the-Wold	Stow-on-the-Wold
Stratford-upon-Avon	Stratford-upon-Avon
Stretford	Stretford
Strood	Strood
Stroud, Gloucestershire	Stroud
Sturminster Newton	Sturminster Newton
Sudbury, Suffolk	Sudbury
Surbiton	Surbiton
Sutton, London	Sutton
Sutton Coldfield	Sutton Coldfield
Swaffham	Swaffham
Swanage	Swanage
Swanley	Swanley
Swanscombe and Greenhithe	Swanscombe and Greenhithe
Swindon	Swindon
Syston	Syston
Tadcaster	Tadcaster
Tadley	Tadley
Tamworth, Staffordshire	Tamworth
Taunton	Taunton
Tavistock, Devon	Tavistock
Teignmouth	Teignmouth
Telscombe	Telscombe
Tenbury Wells	Tenbury Wells
Tenterden	Tenterden
Tetbury	Tetbury
Tewkesbury	Tewkesbury
Thame	Thame
Thatcham	Thatcham
Thaxted	Thaxted
Thetford	Thetford
Thirsk	Thirsk
Thornaby-on-Tees	Thornaby-on-Tees
Thornbury, South Gloucestershire	Thornbury
Thorne, South Yorkshire	Thorne
Thorpe St Andrew	Thorpe St Andrew
Thrapston	Thrapston
Tickhill	Tickhill
Tidworth	Tidworth
Tipton	Tipton
Tisbury, Wiltshire	Tisbury
Tiverton, Devon	Tiverton
Todmorden	Todmorden
Tonbridge	Tonbridge
Topsham, Devon	Topsham
Torpoint	Torpoint
Torquay	Torquay
Totnes	Totnes
Tottenham	Tottenham
Totton and Eling	Totton and Eling
Tow Law	Tow Law
Towcester	Towcester
Tring	Tring
Trowbridge	Trowbridge
Twickenham	Twickenham
Tynemouth	Tynemouth
Uckfield	Uckfield
Ulverston	Ulverston
Uppingham	Uppingham
Upton-upon-Severn	Upton-upon-Severn
Uttoxeter	Uttoxeter
Uxbridge	Uxbridge
Ventnor	Ventnor
Verwood	Verwood
Wadebridge	Wadebridge
Wadhurst	Wadhurst
Wainfleet, Lincolnshire	Wainfleet All Saints
Wallasey	Wallasey
Wallsend	Wallsend
Wallingford, Oxfordshire	Wallingford
Walsall	Walsall
Waltham Abbey (town)	Waltham Abbey
Waltham Cross	Waltham Cross
Walthamstow	Walthamstow
Walton-on-Thames	Walton-on-Thames
Wantage	Wantage
Ware, Hertfordshire	Ware
Wareham, Dorset	Wareham
Warminster	Warminster
Warrington	Warrington
Warwick	Warwick
Watchet	Watchet
Watford	Watford
Wath-upon-Dearne	Wath-upon-Dearne
Watlington, Oxfordshire	Watlington
Watton, Norfolk	Watton
Wellingborough	Wellingborough
Wellington, Shropshire	Wellington
Wellington, Somerset	Wellington
Wells-next-the-Sea	Wells-next-the-Sea
Wem	Wem
Wembley	Wembley
Wendover	Wendover
West Bedlington	West Bedlington
West Bromwich	West Bromwich
West Ham	West Ham
West Malling	West Malling
West Mersea	West Mersea
West Tilbury	West Tilbury
Westbury, Wiltshire	Westbury
Westerham	Westerham
Westhoughton	Westhoughton
Weston-super-Mare	Weston-super-Mare
Wetherby	Wetherby
Weybridge	Weybridge
Weymouth, Dorset	Weymouth
Whaley Bridge	Whaley Bridge
Whitby	Whitby
Whitchurch, Hampshire	Whitchurch
Whitchurch, Shropshire	Whitchurch
Whitehaven	Whitehaven
Whitehill, Hampshire	Whitehill
Whitnash	Whitnash
Whittlesey	Whittlesey
Whitworth, Lancashire	Whitworth
Wickham	Wickham
Wickwar	Wickwar
Widnes	Widnes
Wigan	Wigan
Wigton	Wigton
Willenhall	Willenhall
Willesden	Willesden
Wilton, Wiltshire	Wilton
Wimbledon, London	Wimbledon
Wimborne Minster	Wimborne Minster
Wincanton	Wincanton
Winchcombe	Winchcombe
Winchelsea	Winchelsea
Windermere, Cumbria	Windermere
Windsor, Berkshire	Windsor
Winsford	Winsford
Winslow, Buckinghamshire	Winslow
Winterton, Lincolnshire	Winterton
Wirksworth	Wirksworth
Wisbech	Wisbech
Witham	Witham
Withernsea	Withernsea
Witney	Witney
Wiveliscombe	Wiveliscombe
Wivenhoe	Wivenhoe
Woburn, Bedfordshire	Woburn
Woburn Sands	Woburn Sands
Woking	Woking
Wokingham	Wokingham
Wolsingham	Wolsingham
Wolverton and Greenleys	Wolverton and Greenleys
Wood Green	Wood Green
Woodbridge, Suffolk	Woodbridge
Woodley, Berkshire	Woodley
Woodstock, Oxfordshire	Woodstock
Wooler	Wooler
Workington	Workington
Worksop	Worksop
Worthing	Worthing
Wotton-under-Edge	Wotton-under-Edge
Wragby	Wragby
Wymondham	Wymondham

Open in new window

================
While your worksheet does show some examples, it doesn't tell me in what form the data currently resides.

With a postal code/city reference, you can parse the right most data from each entry (if it exists).  Next, you need to make several passes at the remaining text, looking for the most complex first and then trying to match against the least complex patterns.
0
 
LVL 27

Expert Comment

by:Glenn Ray
ID: 40372117
aikimark's suggestions are just the thing you're going to need to get you from, say a 50% to 60% hit rate to probably over 90%.  I think it's totally worth looking into.

In the meantime, I've made some progress strictly on a parsing level.  I know enough about the UK Postal Code nomenclature so that those codes can be identified by general alphanumeric matching.  Fortunately, your raw data isn't too badly convoluted to pull that out.  Where you'll start having issues is properly allocating space for missing cities/towns or improper street/flat assignment.

I'll post an example workbook tomorrow with basic formulas set up to get you started.

-Glenn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40372127
To see an example of multiple patterns being applied to data, look through this question thread about parsing citation data.
http:Q_28254706.html

It is likely that your records that have no town identifier will probably need to be manually parsed or placed into a "bad" list.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40375159
Thanks for all your comments. aikimark that's very helpful, I've downloaded some PAF files (which seem to be just samples) and that should be useful, but only as a verification step and I'll have to clean up the addresses first.

It's clear I haven't phrased my question well. Let me therefore have another go and break it down into some smaller easier chunks that should more obviously turn into the macro I'm looking for:

Assume 7 columns (A to G) of raw data, and 7 columns (H to N) of Output data. (Flat Identifier, House Identifier, Street Name, 2nd Line, Locality, Post Town, Postcode)

1. Find the postcode and place in column N.
 - this might be as simple as taking the data in the rightmost data-containing column of A-G
 - It might add some basic verification to check that field is a postcode.

2. Where there is one clear number block in the leftmost field, (1, 22b, 115, etc), drop that number block into column I, and the rest of that field into column J, and the remaining fields dropped into J-M unchanged.

3. Where there are two clear number blocks in the raw data, take everything to the left of the 2nd number block and place in column H, then run step 2 on the rest

4. Where the single number block is not to the far left of the raw data, split everything to the left of it off into column H, then run step 2 on the rest.

5. Where there is no number block, place the first field in column H, leave I blank, and put the next data in column J and following columns.

6. If the sole number block has the word "flat" before it, place that field in column H and leave column I blank

7. If there is a sole number block, but the next but one word after that is from a set list (named list called "HouseBlockNames") which would include words like: Mansions, Court, House etc., then that whole field goes in column H rather than being split over I and J.

8. The icing on the cake would be to take take the PAF file, and verify the now-cleaned data against it.

Add a help note to explain how to change the macro if raw data is in eg. columns X onwards and the output is columns AF onwards.

Hopefully these steps can be recreated in a macro that I can run on the data carrying out most of the tidying up with one button click.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40375285
In what form is the RAW DATA?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40375298
Excel, just like the attached (same file as before with a few errors corrected)
In the main data there are additional columns  for name etc. otherwise the same.
Sample-Address-formats.xls
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40375317
@hawkeye

That worksheet shows mapping.  Are you saying that the raw data is a mapping example?

The workbook is your attempt to communicate the different possible address formats to us and their mapping.

I would expect to have you tell us that the worksheet is sequentially populated cells or that the raw data is an ASCII text file or that the raw data is a tab (or other character) delimited file
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40375402
Hi Aikimark,
I attach a file with a number of tabbed examples. I've left one anonymised row in each example.

I received these as tab delimited files which I've opened and saved in Excel. If there's a way of operating on them as raw tab files please explain how I would do that. I assumed I would have to open them in excel to run excel macros on them.

One problem I'm having is that there are often too many rows of data for Excel so when I open in Excel I am losing data. I'm assuming I have to open in a text programme and chunk the data into separate files. Another option is to upload directly into Salesforce (Force.com platform) which is the database I'm using ultimately, and write code to operate in salesforce. The wider question is maybe I should be asking this question under programming algorithms directly for salesforce, but I am very familiar with excel and I don't see any salesforce headings so I have planned to clean in excel and upload, and thus keep a good understanding of the processes so I can spot errors. I am currently cleaning all the data in excel manually using a combination of sorting, text-to-columns, left and right functions, delete spaces etc. I was hoping here to simply speed up that very manual process with a macro to do the majority of that work but ultimately, once I understand the programming steps, I would try and get an algorithm written in Salesforce and upload raw data. I'd rather get the process working in Excel where I can understand it and follow the steps and track the output results as a first stage.

Does that answer your question?
Example.xlsx
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40375651
too many rows
You have more than 1 million rows of data?!?  If so, you should be using a database.

You have an alignment problem, too.  Examples 1,3,5 seem to be the same and Examples 2,6 seem to be the same.  Examples 4 and 7 are two different sets of data (from each other and the other two sets).
Elector Number Prefix	Elector Number	Elector Number Suffix	Elector Markers	Elector DOB	Elector Name	PostCode	Address1	Address2	Address3	Address4	Address5	Address6																
Elector Number Prefix	Elector Number	Elector Number Suffix	Elector Markers	Elector DOB	Elector Surname	Elector Forename	PostCode	Address1	Address2	Address3	Address4	Address5	Address6															
Elector Number Prefix	Elector Number	Elector Number Suffix	Elector Markers	Elector DOB	Elector Name	PostCode	Address1	Address2	Address3	Address4	Address5	Address6																
Elector	Number	Prefix	Elector	Number	Elector	Number	Suffix	Elector	Markers	Elector	DOB	Elector	Name	PostCode	Address1	Address2	Address3	Address4	Address5	Address6								
Elector Number Prefix	Elector Number	Elector Number Suffix	Elector Markers	Elector DOB	Elector Name	PostCode	Address1	Address2	Address3	Address4	Address5	Address6																
Elector Number Prefix	Elector Number	Elector Number Suffix	Elector Markers	Elector DOB	Elector Surname	Elector Forename	PostCode	Address1	Address2	Address3	Address4	Address5	Address6															
Import Code	Constituency Code	Polling District Code	Full Roll Number	Title	First Name	Initials	Last Name	Suffix	Date Of Birth	Gender	Address (all one line)	Mobile Telephone Number	Home Telephone Number	Email Address	Mosaic Code	Address Line 1	Address Line 2	Address Line 3	Address Line 4	Address Line 5	Address Line 6	Postcode	House Identifier	Street Name	Election (Last Known VI)	Voting Intention (Last Time)	Voting Intention (Last Known)	Voting Intention (This Time)

Open in new window

The good news is that each set has a PostCode column.  This means that we won't have to look through the columns to find that pattern for the town lookup.

Please double-check the anonymized data you posted.  Example 4 looks strange.

Although there is an inconsistency in the Example 7 data, it might be the best of the extracts from a parsing perspective.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40382632
Sorry for the delay I've been away without internet access.

too many rows?
...sorry, I was locked in xls. and 64k lines. I can do it in xlsx and 1 million is more than enough.

Yes some Councils deliver data in very similar formats. You're right, we can ignore 3, 5 and 6 as being identical formats to others. Even then, it is negligible for me to rearrange the columns manually into an identical form where we have columns A-F as Address 1-6 and G as Postcode. So you can assume that this has been done, with blank columns created in H-N ready for the processed data as the starting point for the macro.

Ignore the fact that example 7 has 'address in one line' different from 'address in separate lines, I over-randomised the data - they are in fact the same.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40382804
* So, this data comes from different sources?
* Does this (raw) data come to you in CSV format or have you done some editing to create the CSV (or Excel workbook) files we see in this thread?
* Will the one line address datum be available from all your sources or is that only from a subset of the data sources?

* Have you assembled your city/postalcode table?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40383392
Yes, every Council in the country is a different source, each with their own format.

They all vary, but seem to boil down to a handful of different types, in any case it looks like they all have columns somewhere that equate to Address 1-5 and Postcode. (We'll also need to look at names in a future question - Address is the important one). It won't be difficult to create a macro for each council putting their columns in the right place before processing. But we could automate that: eg. Look for column heading "address 1", cut and paste/insert into A... etc.

The data seems to come in tab delimited formats, some may come in csv, but in any case all can by opened in excel.

The one line of data was just one random line I picked out and changed parts of words in, to give a sense of where the spaces are and how many words etc. There will generally be 50k to 100k rows in a file. When I've cleaned it in excel I will upload it to a Salesforce database.

I attach a lookup of City to Postcode-FirstPart
Postcode-FirstPart-Lookup.xlsx
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40383563
Now obfuscate at least one address of each type you listed in the Sample Address Formats workbook.  Do NOT obfuscate the postal code or town data for those rows.  It would be best if they were all in the same 'style' -- place on a single worksheet based on one of the Example # worksheets.

You can paste either a .CSV or tab-delimited file.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40388205
Hi Aikimark,
Is this what you're looking for? See attached.
Example2-Raw-Data.csv
Example2-columns-aligned.csv
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40388690
Thanks.

What is  (with full stop)?

I won't be able to spend any time on this until tomorrow (Sunday).
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40389373
That's just me pointing out there is a full stop there, probably you didn't need to know that!

I've cleaned up the files a bit more for that, and getting rid of spare lines and headers, see attached.

Sunday is fine, thanks
Example2-Raw-Data.csv
Example2-columns-aligned.csv
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40390705
I discovered that your aligned sample has some high-range ASCII values (160) that look like space characters (32), but aren't.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40390733
Can you treat anything that looks like a space, as a space?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40390850
I can programmatically replace the 160.

Here is my first pass results for matching data.  Assume that the town and postcode values are correct.
The House~5 Road Name~Village
              FlatID        The House
              HouseID       5
              Street        Road Name
              2ndLine       Village
The House~5 Road Name~Village
              FlatID        The House
              HouseID       5
              Street        Road Name
              2ndLine       Village
5 Road Name~Village
              HouseID       5
              Street        Road Name
              2ndLine       Village
5 Road Name~Village
              HouseID       5
              Street        Road Name
              2ndLine       Village
HouseName~5 Road Name~Village
              FlatID        HouseName
              HouseID       5
              Street        Road Name
              2ndLine       Village
HouseName~5 Road Name~Village
              FlatID        HouseName
              HouseID       5
              Street        Road Name
              2ndLine       Village
HouseName~5 Road Name~Village
              FlatID        HouseName
              HouseID       5
              Street        Road Name
              2ndLine       Village
817 Browns Lane
              HouseID       817
              Street        Browns Lane
Flat 3~7-9 Double Name Avenue
              FlatID        Flat 3
              HouseID       7-9
              Street        Double Name Avenue
HouseName~Name Road
              FlatID        HouseName
              Street        Name Road
Flat 3~158 Name Road
              FlatID        Flat 3
              HouseID       158
              Street        Name Road
St.Name Sch. Building~85 Name Road
              HouseID       85
              Street        Name Road
Lower Flat~18 Name Avenue
              FlatID        Lower Flat
              HouseID       18
              Street        Name Avenue
Flat 7~23 Name Avenue
              FlatID        Flat 7
              HouseID       23
              Street        Name Avenue
5 Name Cottages~Name Road
              HouseID       5
              Street        Name Cottages
              2ndLine       Name Road
The Cottage~Double Name Way
              FlatID        The Cottage
              Street        Double Name Way
729 The Name
              HouseID       729
              Street        The Name
Double Name~702 The RoadName
              HouseID       702
              Street        The RoadName
Three Part Name~The RoadName
              FlatID        Three Part Name
              Street        The RoadName
Flat 7~501 Name Lane
              FlatID        Flat 7
              HouseID       501
              Street        Name Lane
5 Name Court~7A Name Road
              HouseID       5
              Street        Name Court
              2ndLine       7A Name Road
4 Name Road
              HouseID       4
              Street        Name Road
5 Name Court~1B Trinity Avenue
              HouseID       5
              Street        Name Court
              2ndLine       1B Trinity Avenue
Flat 3~702 Name Road
              FlatID        Flat 3
              HouseID       702
              Street        Name Road
Lower Flat~57 Name Road
              FlatID        Lower Flat
              HouseID       57
              Street        Name Road
5 Name Cottages~Village
              HouseID       5
              Street        Name Cottages
              2ndLine       Village
4 Name House~The Name~Mapledurwell
              HouseID       4
              Street        Name House
              2ndLine       The Name~Mapledurwell
2 Name Farm~St Double Name
              HouseID       2
              Street        Name Farm
              2ndLine       St Double Name
Name Hotel~6 Name Road
              HouseID       6
              Street        Name Road
7 Name Court
              HouseID       7
              Street        Name Court
78 Name Close
              HouseID       78
              Street        Name Close
9 Name Court~Double Name Close
              HouseID       9
              Street        Name Court
              2ndLine       Double Name Close
Name House~75 Name Road~Village
              FlatID        Name House
              HouseID       75
              Street        Name Road
              2ndLine       Village
Name House~57A Double Name
              HouseID       57A
              Street        Double Name
57 Name Lane
              HouseID       57
              Street        Name Lane
118 Name Lane~Essington
              HouseID       118
              Street        Name Lane
              2ndLine       Essington
Flat~6 Name Street
              FlatID        Flat
              HouseID       6
              Street        Name Street
5 Name Road
              HouseID       5
              Street        Name Road

Open in new window

Exceptions:
1. EN4 should indicate a town of Barnet.  However, two of your EN4 lines have a town of Hertfordshire

2. the following postcode column values have a trailing space character:
"WS3 2RG "
"WS2 7LB "
"WS3 3SS "
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40390977
Thanks Aikimark, this is all great.

First of all your exceptions:
1. EN4 giving county rather than town - we could insert the town before the county, or replace the county. This is straying beyond the bounds of the question so I think I should make that a separate question. To deal with it we'd have to lookup both post-town and county, put them in specific fields and search and delete them if we find them in the given address. Then we'd have to adjust for misspellings. It would be nice to get the data that clean, but it's not necessary for now, and I can't afford the PAF licence (yet) so maybe I will post this up as an additional question when I have all the lookup data in place.

2. Trailing space in postcode - it would be nice to lose it but not essential.

Second the errors in your first pass:

Please note that I have been looking for something to dramatically speed things up, not necessarily to give me perfection, so don't worry if we don't solve everything, but here are the ones that could be better and some ideas as to how one might solve them:

line 48 - Type A - St.Name Sch. Building would be in Flat ID
line 59 - Type B: "5 Name Cottages" should be Flat ID: 5; House ID: Name Cottages; Street: Name Road
line 63  - Type C: "The Cottage" would be house ID
line 69 - Type A: same as 48: "double name" would be in Flat ID
line 72 - Type C same as 63: "Three Part Name" would be house ID
line 79 - Type D: "5 Name Court" would be Flat ID; 7A is House ID and Name Road is the Street
Line 86 - Type D same as 79 "5 Name Court" would be Flat ID
Line 102 - Type E: "Mapledurell" would be third line
line 119 - Type D same as 79 "9 Name Court" would be Flat ID
Line 128 - Type A same as 48: Name House would be Flat ID

Type A we would know because there is a number block not in the leftmost field. Can you do something like "if there's a number block 'in the middle' put anything left of it in 'Flat ID'"?

Type B, would be solved by that Type A step before treating as normal.

Type C, we'd need a rule like: "if there's no number block put the first field in House ID and the next field in Street ID" that may not be perfect but I think it would solve most.

Type D, is like Type A, only there's a leading number block in addition to the 'one in the middle'. Same rule as A would solve this as well though I think.

Type E, you may have only put them both in line 2 to save space, but we have 6 address fields so can use them.

I attach a file with the errors next to the lines referenced if that helps.

If any of those are particularly hard I don't mind if we spin those off into another question.

Many thanks
Charles
Algorithm-Output-with-errors-19-Oct.xlsx
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40391039
what is Type A,B,C,D,E?

I'll take a look at your comments,  The first one you pointed out was one I actually expected to see.  It has to do with the order of the patterns I try.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40391688
Please take a look at the attached workbook.

I copied the address data columns into a new workbook and added the parsed columns to the right.  Immediately to the right of the PostalCode column, I added some codes that let me know what exceptions happened in the program.
ZLU= zip/postal code lookup failure
CLU= city lookup failure

Note: all the 160 characters have been changed to spaces and the postal code values have been trimmed of leading/trailing spaces.
Q-28534211.xls
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40394463
Hi Aikimark,

This is terrific, it looks like there are only two 'errors' left - the CLU and the Type B.

The Type A, Type B etc. were just my attempts to categorise the errors, in the order I found them. (made up labels). Although I was wrong about 119 above being Type D, it's Type B, which is the only one it looks like remains unsolved in your recent attachment.

Re. CLU errors - Currently you seem to be leaving a blank when a city isn't found. Can we lookup the city from the postcode in that case and resolve the address anyway?

Re. Type B errors - I'm not sure how we could solve these. There's no second number block to help us locate the road. The only indication that we're putting the building name in the road field is that there's another field following it with an identifiable name ("road", "avenue", "close", "way", "street", "crescent" etc.). We could have a lookup, and if the last word in that field (column L in your example Q-28534211) matches a lookup table of road types, and column I is blank, shift all three fields one to the left.

If you can resolve those two it would be amazing. I don't mind moving the Type Bs to a new question.
What is the mechanism then for the execution? a macro in excel or something running externally to the file?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40395050
How many CLU errors do you think you'll have? (percentage or absolute number)

identifiable name ("road", "avenue", "close", "way", "street", "crescent" etc.)
What goes into the etc list?

I assume that you would create some aligned version of your data and run the code against it.  The final version of the code would prompt you to select the data in the postalcode column.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40395943
I have no idea how many CLU errors we might encounter. We could assume the same ratio as our test data.

Re, identifiable names - I think I would start with that list and add to it when I found others from eye-scanning the data. I would either create a named array called "RoadNames", or follow your instructions on where to put the list.

Are you now able to compile the final code?
If it's not an excel macro please give me a guide how to run it.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40396046
The macro is in the workbook I posted.  You are welcome to add other patterns or change existing patterns.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40396397
Ok got it.
I'm getting a Run-time error 9 "Subscript out of range" when it hits:
Set wkb = Workbooks("Postcode-FirstPart-Lookup.xlsx").
I have that file open, and moved it to the same folder as the open file with the data. Do you know why it wouldn't work?

I tried deleting that line, moving the worksheet into the file and changing the next line to:
Set wks = Worksheets("Postcode Simple Lookup")
That seemed to work.

Next run time error (which I couldn't work out) is on line:
  dicZipCities(vData(lngRow, 1)) = dicZipCities(vData(lngRow, 1)) & "^" & vData(lngRow, 2)

More generally, I don't understand your objective with the dicZipCities, or what the CLU error is? For example in rows 18 and 19 of your example file throwing up the CLU error, EN4 is Barnet, and Barnet is in Field B, what is stopping it parsing out. Do we need this error and why not just parse those rows anyway? I don't mind having not perfect data eg as per the 'Type B' issue above, but blank lines are not good.  Can you change it to leave an error code and parse anyway?

Once this is finished, can I post your completed file to another question and create a new question around the 'type B' problem?
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 45

Expert Comment

by:aikimark
ID: 40396599
That workbook must be open when the code runs.  I know that's a bit slack, but I was concentrating on the validation code.  I thought I'd included that requirement in an earlier comment.  I apologize for to omission.

I'm using a dictionary object to do the postalcode -> city lookups.  I thought about using a VLookup() function, but you have several duplicate postalcode entries and I would need to use a filter to get all of them.  The dictionary object is much faster than any of the Excel data lookup methods.

===================
This raises an interesting question...if there are multiple cities for a postalcode value, which one should be used if the row doesn't have a matching city?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40396655
I don't mind having the workbook open. The problem is the workbook is open, and it still doesn't work. I wonder why?
But I can put the worksheet in the file which solves it.

I didn't know about the dictionary tool. It looks pretty cool.
I hadn't realised there were postcode values with multiple cities. I would have no problem, if the row doesn't have either of the two alternatives, with just going with the first in the list.
Can you fix it so it does that?
And do you know why it broke on that line?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40396684
What is the active worksheet when you started the routine?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40396854
Q-28534211.xls is the active workbook, I click within it several times to be sure.
I run the macro by alt.t.m.m and choosing the macro and clicking run.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40396890
Are you running it against the data in the workbook I posted or different data?  If different, is the data consistent with the aligned sample data? (cols A:F =  Address1:Address6; col G =  postcode)

What are the values of
vData(lngRow, 1)
vData(lngRow, 2)
when the code fails?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40398222
I tried running both against new data in an aligned sample, and against the data you posted, same error.

How/where would I find the values of
vData(lngRow, 1)
vData(lngRow, 2)?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40398463
When the code errors, you will be in debug mode.  You should be able to hover your mouse over the variables.  You can also display the values in the Immediate Window by preceding the variable with a question mark.  You can also select an expression and click on the eye glasses toolbar button.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40399687
Thanks, found the values.
Sorry this is going so slow, thank you for your patience:
I got the two bugs (hadn't noticed the postcode name had dashes in the code, and row 619 had an n/a#, both now rectified).
The macro now runs.

Running on your data works perfectly.
Running on some new data delivers error ZLU in the header row (expected) and ZLU2 on the next 56 lines then stops.

The data is Address 1-6 and postcode (A-G), new sheet. I checked the Postcodes (RG24 and RG27) are in the postcode lookup file.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40399859
The operating range is hard coded.  It is 57 rows tall to get all the sample data.  For testing purposes, you can change that range address to accommodate more data rows for your testing.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40400169
I got the range ok. Now the whole lot is ZLU2 error. What is this telling me?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40400213
What does the code currently look like?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40400225
I don't understand the question
The only thing I have changed is G57 is now G60249
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40400404
earlier, you wrote that you had commented some lines of code.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40400568
All I did was put the postcode into the same file and remove the line looking for it in another workbook. I didn't save those changes, and later solved the problem by re-naming the postcode file to match the code and reverting to the original code. So nothing has been altered in the code other than changing the array from G57 to G60249.

The oustanding issues are:
1. What the ZLU2 error is telling us
2. What the CLU error is telling us, and parsing those lines, rather than leaving them blank
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40400914
look for the ZLU2 string literal in the code.

Are you sure the dictionary contains anything?  You might want to set a breakpoint at the place where the dictionary is being filled and step through the code.  Also, the dictionary defaults to case sensitive key comparisons.  If you have lower case zip codes, this will need to change.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40401775
Ok, I finally figured that out. A non-breaking space in the data.

Excel couldn't find and replace it (or said it had but hadn't). Is there a better way to get rid of those than opening in notepad, copying to word (so I can see the things), find&replace, copy back into notepad, save as csv and reopen in Excel?

Can I ask you to resolve the CLU error as the last point before closing? So these lines are parsed out rather than leaving the destination blank. I have tried to understand the code but can't work out what it's doing.
Else
                    Set rngSrc = wks.Range(rngSrc.Cells(1, 1), rngSrc.Cells(1, rngFind.Column))
                    Set rngFind = rngSrc.Find(dicZipCities(strZipCode), , , xlWhole, , xlPrevious)
                    If rngFind Is Nothing Then
                        rng.Offset(0, 1).Value = "CLU"
                        Exit Do

Open in new window

What is the error this has found? and can we drop the error code in there but proceed anyway to parse that line?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40402006
Please upload a workbook with the non-breaking space.  I will add it to the 160 character cleanup.

When getting CLU errors, can you determine if it is in cases where there are multiple cities associated with a postal code?  I just realized that this lookup statement would never find a city in such cases, since the city names have been put in a delimited string in the dictionary.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40402097
I attach the non-breaking space example.
re. CLU errors. In this case the Code is EN4 and only has one city associated with it: Barnet.
The problem is the City is in a field after a comma as follows:

Address1      
5 Name Mews                              

Address2      
Hadley Wood, Barnet

Address3      
Hertfordshire

Address4      
EN4 0ED      

Address5      

Address6      

PostCode
EN4 0ED

Having a field with two items separated by a comma throws up no errors if the city is visible in another field
If I add Barnet in another field the code works
This error appears to indicate that the city found in the postcode lookup is not in an existing standalone field.

In this case I would like it to go ahead and add the city anyway rather than stop the process, perhaps leave the CLU code as a notifier.

There are 56 cases of one postcode 2 cities. I'll put those cases together with the 'Type B' error in another question.
Example-of-non-breaking-space.xls
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40402173
did you come up with a list of road (category) strings for me?
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40402285
This version of the code looks for an open postalcode lookup workbook.  If not open, it opens it (from same path as the code is running).

The data is cleaned up before processing.  I guess that 160 is the same as a non-breaking space.

The first city associated for the row's postal code is used if not found in a search of the row's data.  The CLU code will still appear for those rows.
Q-28534211.xls
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40402307
Sorry, I listed all the road strings and didn't submit, here they are:

Road, Street, Avenue, Drive, Close, Lane, Crescent, Parade, Place, Way, Row, Mews, Hill, Gardens, Grove, Walk, Square, Boulevard.

Thanks for the above, I'll test it shortly. Nice touch to open the PC file if not open. Thanks.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40402990
A little problem with the clean up sub.
Line:
vData = wks.Range(wks.Cells(lngRow, 1), parmCell).Value
"Method 'range' of object '_Worksheet' failed"

Only for the data that had the non-breaking space problem (for other data it worked perfectly), it also failed with the same error when I replaced the non-breaking space with a normal space. Do you know what the problem might be?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40402993
Is the non-breaking space the same as the ASCII 160 character or something else?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40403823
It's the same data as before with the non-breaking space  

Presumably this is only searching for those two characters and replacing if it finds them. why would it break at that line?  What kind of thing would make it break there? I don"t understand he code enough.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40403900
Have you clicked the Reset toolbar button (Run Reset) or close/reopened the workbook?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40403932
Trying it on my mac laptop and it breaks on
Run-time error 429.
ActiveX component can't create object:
Set oRE = CreateObject("vbscript.regexp")
might be because it's a mac

Yes I've clicked the reset button, but I'll check in case I haven't reopened when I get back to PC.

It would be helpful if you could say what that line does, then I'll be able to trace error more intelligently.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40403937
No MACs allowed.  Windows only.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40404027
What does that line do? Can you write out the logical statement in plain English? I can't locate the error if I don't know what it's doing. I had assumed it would be something like: "look for a 160 or n-b-space character, if found, replace with space, else carry on" If that's the case it's hard to see what would make that code stop.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40404120
vData = wks.Range(wks.Cells(lngRow, 1), parmCell).Value

Open in new window

This statement transfers the data in the cells from column 1 through the postalcode column on the current row (in the calling routine) into the vData variable.  vData is a variant that contains an array.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40404193
Interesting:
following reboot of the PC:
1. Running the code on the Basingstoke data which had previously had the n-b-spaces replaced, and moved into the file with the code: code runs and works.
2. Running the code on the Basingstoke data in its original file with three files open (Basingstoke data, file with the code, postal code file): appears to run (you wait about as long as per (1)), but ultimately it stops, no errors nothing happens.
3. Move the Basingstoke data (moving the whole sheet) into the file with the code, then running it, causes this error.
4. Saving the file as per (3), closing Excel and reopening it then running the code as you suggested, works.

So it looks like moving a sheet into a new workbook does something to mess up the vData variable, it doesn't matter what I just know to not do that.

Interesting that (2) doesn't work. Do you know why?
(It doesn't matter, I just have to transfer the data into the file with the code, save, close and re-open), but curious if you have an explanation.

On that basis I'm accepting the answer and transferring those two outstanding elements to a new question, since they weren't part of the original brief. Thank you for working so hard on this.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40404212
I don't know what might be happening to cause those errors.  Here is a slightly better version of the cleanup routine.
Sub CleanRowData(parmCell As Range)
    Static wks As Worksheet
    Static rng As Range
    Dim vData As Variant
    Dim strValue As String
    Dim lngCol As Long
    Dim lngRow As Long
    Dim lngBefore As Long
    
    If wks Is Nothing Then
        Set wks = parmCell.Worksheet
    End If
    
    lngRow = parmCell.Row
    vData = wks.Range(wks.Cells(lngRow, 1), parmCell).Value
    
    For lngCol = 1 To UBound(vData, 2)
        lngBefore = Len(vData(1, lngCol))
        strValue = Trim(vData(1, lngCol))
        If (lngBefore <> Len(strValue)) Or (InStr(strValue, Chr(160)) <> 0) Then
            wks.Cells(lngRow, lngCol).Value = Replace(strValue, Chr(160), " ")
        End If
    Next
End Sub

Open in new window


During my testing, I did encounter a few instances of invoking the code and it going off into never land.  At least one time, I found that I had double-clicked a cell and was in edit mode.  This will certainly cause a problem.

It is possible to restructure this so that you would point to a CSV file (file open dialog) and the code would detect the six address# columns and the postcode column.  The original and cleaned data could be added to the current or a new worksheet.  However, that would probably only be a worthwhile effort if you were going to do this often.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40404214
If you are getting adequate results, it doesn't seem that adding the road names is warranted.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40404227
We rely on knowing the street name so this last step would really help. I have asked it under the following Question: ID: http:Q_28544577.html
0
 
LVL 1

Author Closing Comment

by:hawkeye_zzz
ID: 40404230
Great solution, did more than I originally asked and thanks for putting so much work in to get there.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Introduction This Article briefly covers methods of calculating the NPV and IRR variants in Excel as well as the limitations in calculating and interpreting IRR results. Paraphrasing Richard Shockley, author of my favourite finance reference tex…
This article explains all about SQL Server Piecemeal Restore with examples in step by step manner.
The viewer will learn how to simulate a series of sales calls dependent on a single skill level and learn how to simulate a series of sales calls dependent on two skill levels. Simulating Independent Sales Calls: Enter .75 into cell C2 – “skill leve…
This Micro Tutorial will demonstrate how to use longer labels with horizontal bar charts instead of the vertical column chart.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now