Solved

# New Data new problems (more cleaning)

Posted on 2014-10-29
51 Views
1. Why does it need postcodes in 2 columns to work? There's a postcode in G but seems to need same postcode in A-F or it delivers ZLU2 error. Should just find the postcode in G.

2. Why is it messing up perfectly good data and putting tildas in the data? See attached for Errors and how they should be (Test Data rows 3-16)

Prior related question: http:Q_28545779.html
Algorithm-for-cleaning-Electoral-Roll-Fi
0
Question by:hawkeye_zzz
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 10
• 6

LVL 45

Accepted Solution

aikimark earned 500 total points
ID: 40412354
1. Why does it need postcodes in 2 columns to work?
a. Because it was a starting condition from which I developed my algorithm and wrote my code.
b. Because matching against the postalcode values in the same column is much simpler and faster
c. Because I observed and asserted that all rows in your sample data had a postalcode column and you didn't correct me.

2. Why is it messing up perfectly good data and putting tildas in the data?
a. Because your sample data did not include that data configuration.
b. Because your testing did not reveal this
c. Because the cell data (to the left of the town) is concatenated into a tilde-delimited string.  The regular expression pattern that matched the string was looking for fewer 'columns'.
d. Maybe the columnar data contains a tilde.  I'll look.
0

LVL 1

Author Comment

ID: 40412379
1a and 1c, very good reasons!
1b. I don't understand. What are you matching? I thought you were just taking the postcode, stripping it and looking up the city. Sounds like you are doing something else I didn't realise. Are you checking one postcode with the other? If so is it hard to change to not need that?

2a and 2b, also very good reasons!
I formatted all the data as text on opening into excel in case that affects anything.

Please take the questions as operational/technical not judgmental. Would have been better written as "how is the code doing this...?" (Curiosity), but existential answers are good and interesting too. I hope these aren't requiring wholesale rewrite of code. I thought I'd covered everything, reality always undermines that hubris.
0

LVL 45

Expert Comment

ID: 40412407
1b explained
Without the postalcodes being in column G, the code would have to find the right-most non-empty cell and see if it was a postalcode and then try an match it against the city/zip data.  Iterating one column (G) and then finding that data to the left is more reliable.

The pattern matching is done by the regular expression object.  The patterns use a tilde character which matches the way I concatenated the cell data.  If I had used some other delimiter for the concatenation, I would have had to use that same character in the regex patterns.
0

LVL 1

Author Comment

ID: 40412754
I have postcodes in column G.
The problem is the code breaks (error ZLU2) if there is not, in addition, a postcode in one of the address fields A-F.
I wonder why one postcode in column G is not enough.

Asking about design assumptions is mainly curiosity, me learning more, trying to understand the code, partly just interest, partly so I can be more helpful at debugging. I find I give bad briefs when I don't understand the code design assumptions. I end up asking for complex things at cross purposes to the patterns of the code. Most times a good coder like yourself will re-interpret and do things another way that's in harmony but it is better if the problem is approached from the start with a good understanding of the tools available to fix it.

I see, so you concatenate all the fields then split them up again by the rules. I wonder if we could introduce a 'presumption of innocence' to the code, where we check and if Address2/Building ID has a number block in it (no spaces), and Address3/Road has an identifiable road in it, we leave those and the Address1/Flat ID as they are?  That looks like it would work with all our current test cases.
0

LVL 45

Expert Comment

ID: 40413048
the code breaks
Does the code stop iterating the postalcode values in column G?
Does the code throw an error?
How many rows don't have two postalcode values?

It might help you to think about the processing like Pac Man.  We are going to nibble data from the right side of the Address # cells.  Once we've got a matching value, we can do city look-ups and copy data to their respective positions to the right.

This approach reduced the parsing problem complexity.  Now the code only has to deal with the columns to the left of the city.  I've eliminated having to pattern match two or three columns.
0

LVL 1

Author Comment

ID: 40413213
Sorry. The Code runs through to completion but drops error code ZLU2 in H all the way down.
This error states it can't find the postcode (I think).
Summary:
With a postcode in G but not in A-F we get ZLU2 error
With a postcode in G AND copy that postcode into any other column A-F then the code works no problem.
With no postcode in G and only one copy of the postcode in any of A-F, nothing happens.
We appear to need the postcode in G AND one other column A-F.

The Pacman seems to be starting at G but wanting to eat twice.
I'm not clear why this is happening.
0

LVL 45

Expert Comment

ID: 40413245
That is how it was designed to work.  I used the fact that your data had a postalcode column in my design considerations
0

LVL 1

Author Comment

ID: 40413252
So is it hard to tell it to use the postcode in G if there isn't a second postcode in A-F?
0

LVL 45

Expert Comment

ID: 40413293
yes.  harder.

What's the problem with this requirement?
0

LVL 1

Author Comment

ID: 40413312
The Camden data I just received does not have a double postcode. So I'd have to shift the data leftwards and copy the postcodes into the next column before processing. Not awful but leaves room for manual errors.

When it's taking good data and making it messy is a bigger problem.
0

LVL 45

Expert Comment

ID: 40413430
How many other sets of data haven't you seen yet?
That is value of representative sample data.  It establishes the nature of the problem and constraints.

This entire process would be much simpler if you would post the actual data.  It seems to be from a publicly available data source.

It looks like the current parsing problems are caused by data that is in all upper case.
0

LVL 45

Expert Comment

ID: 40413449
In addition to being upper-case (which I think I've fixed), the data doesn't resemble the other data.  These rows don't start in column A and, as you've already noted, don't contain the postalcode column as well as postalcode data in one of the first six columns.
0

LVL 45

Expert Comment

ID: 40413462
After further analysis, I don't think the Cambridge data needs to be parsed.  It already appears to be parsed as you wanted.
0

LVL 1

Author Comment

ID: 40413554
There are 433 Councils in the UK, so room for plenty of surprises.
What I'm dealing with is electoral roll data and I have access to it as a registered person with very strict controls on what I can release into the public domain, i.e. any identifiable data that could refer to a specific house.

You're right it does look pretty clean. But with 133k rows, there is a lot of flipping between columns. I'll have another look at it and repost when I've looked at the detail.
0

LVL 45

Expert Comment

ID: 40413744
Voter registration is publicly available in the US.  What is considered sensitive about such data in the UK?

What kind of "flipping" are you doing?
0

LVL 45

Expert Comment

ID: 40413751
If you need to, you can open questions as "private".  This keeps them from being indexed by the Internet search engines.
0

LVL 47

Expert Comment

ID: 40466431
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This Article briefly covers methods of calculating the NPV and IRR variants in Excel as well as the limitations in calculating and interpreting IRR results. Paraphrasing Richard Shockley, author of my favourite finance reference texâ€¦
How to quickly and accurately populate Word documents with Excel data, charts and images (including Automated Bookmark generation) David Miller (dlmille) Synopsis In this article youâ€™ll learn how to use ExcelToWord! to copy data,charts, shapes â€¦
This Micro Tutorial will demonstrate how to use longer labels with horizontal bar charts instead of the vertical column chart.
This Micro Tutorial will demonstrate the scrolling table in Microsoft Excel using the INDEX function.
###### Suggested Courses
Course of the Month2 days, 7 hours left to enroll