Solved

New Data new problems (more cleaning)

Posted on 2014-10-29
18
45 Views
Last Modified: 2014-11-26
1. Why does it need postcodes in 2 columns to work? There's a postcode in G but seems to need same postcode in A-F or it delivers ZLU2 error. Should just find the postcode in G.

2. Why is it messing up perfectly good data and putting tildas in the data? See attached for Errors and how they should be (Test Data rows 3-16)

Prior related question: http:Q_28545779.html
Algorithm-for-cleaning-Electoral-Roll-Fi
0
Comment
Question by:hawkeye_zzz
  • 10
  • 6
18 Comments
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40412354
1. Why does it need postcodes in 2 columns to work?
a. Because it was a starting condition from which I developed my algorithm and wrote my code.
b. Because matching against the postalcode values in the same column is much simpler and faster
c. Because I observed and asserted that all rows in your sample data had a postalcode column and you didn't correct me.

2. Why is it messing up perfectly good data and putting tildas in the data?
a. Because your sample data did not include that data configuration.
b. Because your testing did not reveal this
c. Because the cell data (to the left of the town) is concatenated into a tilde-delimited string.  The regular expression pattern that matched the string was looking for fewer 'columns'.
d. Maybe the columnar data contains a tilde.  I'll look.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40412379
1a and 1c, very good reasons!
1b. I don't understand. What are you matching? I thought you were just taking the postcode, stripping it and looking up the city. Sounds like you are doing something else I didn't realise. Are you checking one postcode with the other? If so is it hard to change to not need that?

2a and 2b, also very good reasons!
I formatted all the data as text on opening into excel in case that affects anything.

Please take the questions as operational/technical not judgmental. Would have been better written as "how is the code doing this...?" (Curiosity), but existential answers are good and interesting too. I hope these aren't requiring wholesale rewrite of code. I thought I'd covered everything, reality always undermines that hubris.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40412407
1b explained
Without the postalcodes being in column G, the code would have to find the right-most non-empty cell and see if it was a postalcode and then try an match it against the city/zip data.  Iterating one column (G) and then finding that data to the left is more reliable.

Why do you ask about this design assumption?

The pattern matching is done by the regular expression object.  The patterns use a tilde character which matches the way I concatenated the cell data.  If I had used some other delimiter for the concatenation, I would have had to use that same character in the regex patterns.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40412754
I have postcodes in column G.
The problem is the code breaks (error ZLU2) if there is not, in addition, a postcode in one of the address fields A-F.
I wonder why one postcode in column G is not enough.

Asking about design assumptions is mainly curiosity, me learning more, trying to understand the code, partly just interest, partly so I can be more helpful at debugging. I find I give bad briefs when I don't understand the code design assumptions. I end up asking for complex things at cross purposes to the patterns of the code. Most times a good coder like yourself will re-interpret and do things another way that's in harmony but it is better if the problem is approached from the start with a good understanding of the tools available to fix it.

I see, so you concatenate all the fields then split them up again by the rules. I wonder if we could introduce a 'presumption of innocence' to the code, where we check and if Address2/Building ID has a number block in it (no spaces), and Address3/Road has an identifiable road in it, we leave those and the Address1/Flat ID as they are?  That looks like it would work with all our current test cases.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413048
the code breaks
Does the code stop iterating the postalcode values in column G?
Does the code throw an error?
How many rows don't have two postalcode values?

It might help you to think about the processing like Pac Man.  We are going to nibble data from the right side of the Address # cells.  Once we've got a matching value, we can do city look-ups and copy data to their respective positions to the right.

This approach reduced the parsing problem complexity.  Now the code only has to deal with the columns to the left of the city.  I've eliminated having to pattern match two or three columns.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40413213
Sorry. The Code runs through to completion but drops error code ZLU2 in H all the way down.
This error states it can't find the postcode (I think).
Summary:
With a postcode in G but not in A-F we get ZLU2 error
With a postcode in G AND copy that postcode into any other column A-F then the code works no problem.
With no postcode in G and only one copy of the postcode in any of A-F, nothing happens.
We appear to need the postcode in G AND one other column A-F.

The Pacman seems to be starting at G but wanting to eat twice.
I'm not clear why this is happening.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413245
That is how it was designed to work.  I used the fact that your data had a postalcode column in my design considerations
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40413252
So is it hard to tell it to use the postcode in G if there isn't a second postcode in A-F?
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 45

Expert Comment

by:aikimark
ID: 40413293
yes.  harder.

What's the problem with this requirement?
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40413312
The Camden data I just received does not have a double postcode. So I'd have to shift the data leftwards and copy the postcodes into the next column before processing. Not awful but leaves room for manual errors.

When it's taking good data and making it messy is a bigger problem.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413430
How many other sets of data haven't you seen yet?
That is value of representative sample data.  It establishes the nature of the problem and constraints.

This entire process would be much simpler if you would post the actual data.  It seems to be from a publicly available data source.

It looks like the current parsing problems are caused by data that is in all upper case.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413449
In addition to being upper-case (which I think I've fixed), the data doesn't resemble the other data.  These rows don't start in column A and, as you've already noted, don't contain the postalcode column as well as postalcode data in one of the first six columns.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413462
After further analysis, I don't think the Cambridge data needs to be parsed.  It already appears to be parsed as you wanted.
0
 
LVL 1

Author Comment

by:hawkeye_zzz
ID: 40413554
There are 433 Councils in the UK, so room for plenty of surprises.
What I'm dealing with is electoral roll data and I have access to it as a registered person with very strict controls on what I can release into the public domain, i.e. any identifiable data that could refer to a specific house.

You're right it does look pretty clean. But with 133k rows, there is a lot of flipping between columns. I'll have another look at it and repost when I've looked at the detail.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413744
Voter registration is publicly available in the US.  What is considered sensitive about such data in the UK?

What kind of "flipping" are you doing?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40413751
If you need to, you can open questions as "private".  This keeps them from being indexed by the Internet search engines.
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 40466431
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

A little background as to how I came to I design this code: Around 5 years ago I designed an add-in that formatted Excel files to a corporate standard, applying different cell colours and font type depending on whether the cells contained inputs,…
Improved? Move/Copy Add-in Replacement - How to avoid the annoying, “A formula or sheet you want to move or copy contains the name XXX, which already exists on the destination worksheet.” David Miller (dlmille)  It was one of those days… I wa…
This Micro Tutorial will demonstrate on a Mac how to change the sort order for chart legend values and decrpyt the intimidating chart menu.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now