New Data new problems (more cleaning)

1. Why does it need postcodes in 2 columns to work? There's a postcode in G but seems to need same postcode in A-F or it delivers ZLU2 error. Should just find the postcode in G.

2. Why is it messing up perfectly good data and putting tildas in the data? See attached for Errors and how they should be (Test Data rows 3-16)

Prior related question: http:Q_28545779.html
Algorithm-for-cleaning-Electoral-Roll-Fi
LVL 1
hawkeye_zzzAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

aikimarkCommented:
1. Why does it need postcodes in 2 columns to work?
a. Because it was a starting condition from which I developed my algorithm and wrote my code.
b. Because matching against the postalcode values in the same column is much simpler and faster
c. Because I observed and asserted that all rows in your sample data had a postalcode column and you didn't correct me.

2. Why is it messing up perfectly good data and putting tildas in the data?
a. Because your sample data did not include that data configuration.
b. Because your testing did not reveal this
c. Because the cell data (to the left of the town) is concatenated into a tilde-delimited string.  The regular expression pattern that matched the string was looking for fewer 'columns'.
d. Maybe the columnar data contains a tilde.  I'll look.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
hawkeye_zzzAuthor Commented:
1a and 1c, very good reasons!
1b. I don't understand. What are you matching? I thought you were just taking the postcode, stripping it and looking up the city. Sounds like you are doing something else I didn't realise. Are you checking one postcode with the other? If so is it hard to change to not need that?

2a and 2b, also very good reasons!
I formatted all the data as text on opening into excel in case that affects anything.

Please take the questions as operational/technical not judgmental. Would have been better written as "how is the code doing this...?" (Curiosity), but existential answers are good and interesting too. I hope these aren't requiring wholesale rewrite of code. I thought I'd covered everything, reality always undermines that hubris.
0
aikimarkCommented:
1b explained
Without the postalcodes being in column G, the code would have to find the right-most non-empty cell and see if it was a postalcode and then try an match it against the city/zip data.  Iterating one column (G) and then finding that data to the left is more reliable.

Why do you ask about this design assumption?

The pattern matching is done by the regular expression object.  The patterns use a tilde character which matches the way I concatenated the cell data.  If I had used some other delimiter for the concatenation, I would have had to use that same character in the regex patterns.
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

hawkeye_zzzAuthor Commented:
I have postcodes in column G.
The problem is the code breaks (error ZLU2) if there is not, in addition, a postcode in one of the address fields A-F.
I wonder why one postcode in column G is not enough.

Asking about design assumptions is mainly curiosity, me learning more, trying to understand the code, partly just interest, partly so I can be more helpful at debugging. I find I give bad briefs when I don't understand the code design assumptions. I end up asking for complex things at cross purposes to the patterns of the code. Most times a good coder like yourself will re-interpret and do things another way that's in harmony but it is better if the problem is approached from the start with a good understanding of the tools available to fix it.

I see, so you concatenate all the fields then split them up again by the rules. I wonder if we could introduce a 'presumption of innocence' to the code, where we check and if Address2/Building ID has a number block in it (no spaces), and Address3/Road has an identifiable road in it, we leave those and the Address1/Flat ID as they are?  That looks like it would work with all our current test cases.
0
aikimarkCommented:
the code breaks
Does the code stop iterating the postalcode values in column G?
Does the code throw an error?
How many rows don't have two postalcode values?

It might help you to think about the processing like Pac Man.  We are going to nibble data from the right side of the Address # cells.  Once we've got a matching value, we can do city look-ups and copy data to their respective positions to the right.

This approach reduced the parsing problem complexity.  Now the code only has to deal with the columns to the left of the city.  I've eliminated having to pattern match two or three columns.
0
hawkeye_zzzAuthor Commented:
Sorry. The Code runs through to completion but drops error code ZLU2 in H all the way down.
This error states it can't find the postcode (I think).
Summary:
With a postcode in G but not in A-F we get ZLU2 error
With a postcode in G AND copy that postcode into any other column A-F then the code works no problem.
With no postcode in G and only one copy of the postcode in any of A-F, nothing happens.
We appear to need the postcode in G AND one other column A-F.

The Pacman seems to be starting at G but wanting to eat twice.
I'm not clear why this is happening.
0
aikimarkCommented:
That is how it was designed to work.  I used the fact that your data had a postalcode column in my design considerations
0
hawkeye_zzzAuthor Commented:
So is it hard to tell it to use the postcode in G if there isn't a second postcode in A-F?
0
aikimarkCommented:
yes.  harder.

What's the problem with this requirement?
0
hawkeye_zzzAuthor Commented:
The Camden data I just received does not have a double postcode. So I'd have to shift the data leftwards and copy the postcodes into the next column before processing. Not awful but leaves room for manual errors.

When it's taking good data and making it messy is a bigger problem.
0
aikimarkCommented:
How many other sets of data haven't you seen yet?
That is value of representative sample data.  It establishes the nature of the problem and constraints.

This entire process would be much simpler if you would post the actual data.  It seems to be from a publicly available data source.

It looks like the current parsing problems are caused by data that is in all upper case.
0
aikimarkCommented:
In addition to being upper-case (which I think I've fixed), the data doesn't resemble the other data.  These rows don't start in column A and, as you've already noted, don't contain the postalcode column as well as postalcode data in one of the first six columns.
0
aikimarkCommented:
After further analysis, I don't think the Cambridge data needs to be parsed.  It already appears to be parsed as you wanted.
0
hawkeye_zzzAuthor Commented:
There are 433 Councils in the UK, so room for plenty of surprises.
What I'm dealing with is electoral roll data and I have access to it as a registered person with very strict controls on what I can release into the public domain, i.e. any identifiable data that could refer to a specific house.

You're right it does look pretty clean. But with 133k rows, there is a lot of flipping between columns. I'll have another look at it and repost when I've looked at the detail.
0
aikimarkCommented:
Voter registration is publicly available in the US.  What is considered sensitive about such data in the UK?

What kind of "flipping" are you doing?
0
aikimarkCommented:
If you need to, you can open questions as "private".  This keeps them from being indexed by the Internet search engines.
0
Martin LissOlder than dirtCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Excel

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.