Link to home
Start Free TrialLog in
Avatar of helpchrisplz
helpchrisplz

asked on

fix broken .CSV extra quotes?

hi i have this csv that has data in the wrong columns.
and i would like to find a way of cleaning it up.

i am guessing that there might be quotes inside the columns and this might be causing the problem?


This is the csv: http://christophersowerby.com/stock.CSV

This is an example of the problem:

If you open the csv in excel and look at the column called Stock Number you will see the unique identifiers for each product.

Now if you do a search for the Stock Number: 1002015
it is not in the correct column, it should be with the other stock numbers.

if it is extra commas or quotes in the CSV casing this, does any one know a php script to clean it up?

thanks
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

This looks like a general-purpose "data scrubbing" problem.  Where does the data come from?   How does the CSV get created?
Avatar of helpchrisplz
helpchrisplz

ASKER

hi ray its the same source CSV file as that last question you helped me with:

https://www.experts-exchange.com/questions/28327984/loop-through-CSV-and-update-delete-columns.html


i have now finished the full fopen script (that you helped me with) but i see that it doesn't have some of the stock numbers showing in the fixed csv that the script outputs.

The stock numbers that are not in the fixed csv are the ones that are outside of the stock number column (in the source csv) when opened in excel.

so i am hoping there is a way of cleaning the source csv before i run it through my fopen script.
SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ok then.

if it is commas inside cells then cant we check each cell for commas and remove them?

not sure if this is a dead end?

under heading: How to Sanitize Data for CSV Compatibility
https://developers.google.com/analytics/solutions/articles/gdataAnalyticsCsv
Commas inside cells should have been wrapped in quotes when the CSV file was created.  Quotes are doubled, etc.  I think the Google "before and after" chart appears to be accurate.  But the magnitude of the task may be significant.  "OUT OF 1502 LINES, WE FOUND 336 WITH MISMATCHED COLUMNS" tells us we had 1,838 lines.  But this does not match the number of lines in TextPad.  It's a big project if you have to start with this data set.
ok then i will try and talk to the stock control company to see what they can do to make the csv better.
I think that's the best approach.  If you can get them to produce the data in another format that can be "validated" you might get good results.  For example, if they can produce valid XML or JSON documents, it may be easier to turn those into CSV files.
SOLUTION
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
interesting
Here's a screenshot that illustrates the problem (the truncated data at the 4k mark), and how Excel is treating the data.
1-13-2014-9-33-36-AM.jpg
Something else you might want to suggest to them is to eliminate the unnecessary spaces around the quotes. You only have 2.2 megs of actual valid characters in that CSV file, and a whopping ~5 megs of SPACES outside of the data fields.

For example, the file is full of this:
,""                              ,""  ,""                    ,""

Open in new window

...which is no different than doing this:
,"","","",""

Open in new window


If you remove all of the unnecessary spaces, the file will go from around 7,500,000 bytes to about 2,300,000 bytes. That's a LOT of wasted bandwidth, space, and time.
@gr8gonzo: Good catch!  In any case, the responsibility goes back to the creator of the document.  I have a gut-feeling that once the extraneous blank space is eliminated, the 4,096 byte line length will be sufficient.  The creator should test the document by loading it into Excel to see if the structure is consistent with common sense (ie: the line count matches the number of rows in the spreadsheet).
I'm not sure the extra blank space will be enough to fix the line length. Really, there's no reason for a 4k limit unless you're dealing with fixed-width data files. I think the massive amounts of attribute fields could still cause problems. The author really just needs to check the write buffers and write the next 4k if there's still data to be written before moving onto the next record.
ok i just sent an email to them outlining what was said here. waiting for the response :)

Thanks everyone.
What software did you open the CSV in to get the byte numbers at the top of your screenshot? - just interested.
UltraEdit, the king of the text editor. Bought a lifetime license about 8 years ago and it's been worth every penny. Best purchase I've ever made - it has tons of useful features that I use all the time (syntax highlighting in code, column editing, macros, hex view, ASCII tables, etc... all without getting in your way). They also have other products, bundles, suites, etc, but I just use the editor.

In a pinch, NotePad++ is free and can give you a smaller portion of that functionality, but it's not quite as polished ("easy to use", IMO). I would buy another license for UE in a heartbeat over settling for Notepad++ or another free editor.

And for the record, UltraEdit has been the only commercial product that I regularly recommend. I'm an open-source guy whenever I can be.
ok will take a look.  i have been using sublime text for the last 6 month.
Hi they have fixed the truncation.

please can you take a look at a new problem that has occurred now they fixed the truncation problem.

https://www.experts-exchange.com/questions/28359357/csv-breaking-php.html