Solved

fix broken .CSV extra quotes?

Posted on 2014-01-13
19
473 Views
Last Modified: 2014-02-07
hi i have this csv that has data in the wrong columns.
and i would like to find a way of cleaning it up.

i am guessing that there might be quotes inside the columns and this might be causing the problem?


This is the csv: http://christophersowerby.com/stock.CSV

This is an example of the problem:

If you open the csv in excel and look at the column called Stock Number you will see the unique identifiers for each product.

Now if you do a search for the Stock Number: 1002015
it is not in the correct column, it should be with the other stock numbers.

if it is extra commas or quotes in the CSV casing this, does any one know a php script to clean it up?

thanks
0
Comment
Question by:helpchrisplz
  • 8
  • 5
  • 5
  • +1
19 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39776324
This looks like a general-purpose "data scrubbing" problem.  Where does the data come from?   How does the CSV get created?
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776337
hi ray its the same source CSV file as that last question you helped me with:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28327984.html


i have now finished the full fopen script (that you helped me with) but i see that it doesn't have some of the stock numbers showing in the fixed csv that the script outputs.

The stock numbers that are not in the fixed csv are the ones that are outside of the stock number column (in the source csv) when opened in excel.

so i am hoping there is a way of cleaning the source csv before i run it through my fopen script.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 50 total points
ID: 39776375
There may be, but it's not clear to me how to scrub the data.  There's so much of it that is wrong.  It might make more sense to go back to the data source and try to correct it where it is being generated.

Have a look at the output of this script (I am not going to leave this online).  According to TextPad, this is a 7+ megabyte file with 1881 lines.  Obviously Excel thinks otherwise.  So I think your instinct is right about quotes or other delimiters being wrong, but it would be a substantial research project to try to find a way to fix this many errors.

http://www.laprbass.com/RAY_temp_helpchrisplz.php

<?php // RAY_temp_helpchrisplz.php
error_reporting(E_ALL);
echo '<pre>';

// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28336931.html

// THE CSV DATA IS AVAILABLE HERE - OPEN THE FILE
$url = 'http://christophersowerby.com/stock.CSV';
$fpr = fopen($url, 'r');
if (!$fpr) trigger_error("UNABLE TO READ $url", E_USER_ERROR);

// WITH THE FILE OPENED, READ THE FIRST RECORD
$top = fgetcsv($fpr);
$num_top = count($top);
echo PHP_EOL . "EXPECTING $num_top COLUMNS IN EACH ROW";

// NOW READ THE OTHER RECORDS
$num = 1;
$err = 0;
while (!feof($fpr))
{
    // READ EACH ROW INTO AN ARRAY
    $cur = fgetcsv($fpr);
    if (empty($cur)) continue;
    $num_cur = count($cur);
    if ($num_cur != $num_top)
    {
        echo PHP_EOL . "MISMATCH AT LINE $num, HAS $num_cur COLUMNS";
        $err++;
    }
    $num++;
}
echo PHP_EOL . "OUT OF $num LINES, WE FOUND $err WITH MISMATCHED COLUMNS";

Open in new window

0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776392
ok then.

if it is commas inside cells then cant we check each cell for commas and remove them?

not sure if this is a dead end?

under heading: How to Sanitize Data for CSV Compatibility
https://developers.google.com/analytics/solutions/articles/gdataAnalyticsCsv
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39776417
Commas inside cells should have been wrapped in quotes when the CSV file was created.  Quotes are doubled, etc.  I think the Google "before and after" chart appears to be accurate.  But the magnitude of the task may be significant.  "OUT OF 1502 LINES, WE FOUND 336 WITH MISMATCHED COLUMNS" tells us we had 1,838 lines.  But this does not match the number of lines in TextPad.  It's a big project if you have to start with this data set.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776424
ok then i will try and talk to the stock control company to see what they can do to make the csv better.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39776429
I think that's the best approach.  If you can get them to produce the data in another format that can be "validated" you might get good results.  For example, if they can produce valid XML or JSON documents, it may be easier to turn those into CSV files.
0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 50 total points
ID: 39776465
Your data file is not consistent. I couldn't see record 1002015 in Excel - it did not load.

Try running this script against your file. This script will parse each line and
a) Count fields
b) Look for unclosed quotes
As you will see from the output
i) There is not much consistency in terms of the number of fields output
ii) More importantly there are many lines that have opening quotes that are not closed.
This is probably where the problem is.
<?php
// Modify to find your file
$fp = fopen("stock.CSV", "rt");
$line = 1;
while($data = fgets($fp)) {
  $fields = 0;
  $instring = false;
  for($i=0; $i < strlen($data);$i++) {
    if ($data[$i] == '"') {
      if ($instring) {
        if ($data[$i-1] != "\\") {
          $instring = false;
        }
      }
      else {
        $instring = true;
      }
    }
    if ($data[$i] == ',' && !$instring) {
      $fields++;
    }
  }
  echo "Line $line: Fields : $fields [";
  echo ($instring) ? 'Unclosed string detected]<br/>' : ']<br/>';
  $line++;
}
?>

Open in new window

0
 
LVL 34

Accepted Solution

by:
gr8gonzo earned 400 total points
ID: 39776794
The problem is that whoever was writing the file did it in 4k chunks and didn't bother to finish writing out the complete record before moving onto the next one. If you look closely at the data, you'll see that they only go up to 4096 bytes per record. Several records seem to be incomplete because they were longer than 4096 bytes.

I don't think there's anything wrong with the quotes or anything - it's simply truncated data. That truncation is screwing with everything else (quotes and so on).

It's sloppy programming on the part of whoever created this CSV.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776821
interesting
0
 
LVL 34

Expert Comment

by:gr8gonzo
ID: 39776825
Here's a screenshot that illustrates the problem (the truncated data at the 4k mark), and how Excel is treating the data.
1-13-2014-9-33-36-AM.jpg
0
 
LVL 34

Expert Comment

by:gr8gonzo
ID: 39776861
Something else you might want to suggest to them is to eliminate the unnecessary spaces around the quotes. You only have 2.2 megs of actual valid characters in that CSV file, and a whopping ~5 megs of SPACES outside of the data fields.

For example, the file is full of this:
,""                              ,""  ,""                    ,""

Open in new window

...which is no different than doing this:
,"","","",""

Open in new window


If you remove all of the unnecessary spaces, the file will go from around 7,500,000 bytes to about 2,300,000 bytes. That's a LOT of wasted bandwidth, space, and time.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39779584
@gr8gonzo: Good catch!  In any case, the responsibility goes back to the creator of the document.  I have a gut-feeling that once the extraneous blank space is eliminated, the 4,096 byte line length will be sufficient.  The creator should test the document by loading it into Excel to see if the structure is consistent with common sense (ie: the line count matches the number of rows in the spreadsheet).
0
 
LVL 34

Expert Comment

by:gr8gonzo
ID: 39779766
I'm not sure the extra blank space will be enough to fix the line length. Really, there's no reason for a 4k limit unless you're dealing with fixed-width data files. I think the massive amounts of attribute fields could still cause problems. The author really just needs to check the write buffers and write the next 4k if there's still data to be written before moving onto the next record.
0
 
LVL 1

Author Closing Comment

by:helpchrisplz
ID: 39782066
ok i just sent an email to them outlining what was said here. waiting for the response :)

Thanks everyone.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39784796
What software did you open the CSV in to get the byte numbers at the top of your screenshot? - just interested.
0
 
LVL 34

Expert Comment

by:gr8gonzo
ID: 39785514
UltraEdit, the king of the text editor. Bought a lifetime license about 8 years ago and it's been worth every penny. Best purchase I've ever made - it has tons of useful features that I use all the time (syntax highlighting in code, column editing, macros, hex view, ASCII tables, etc... all without getting in your way). They also have other products, bundles, suites, etc, but I just use the editor.

In a pinch, NotePad++ is free and can give you a smaller portion of that functionality, but it's not quite as polished ("easy to use", IMO). I would buy another license for UE in a heartbeat over settling for Notepad++ or another free editor.

And for the record, UltraEdit has been the only commercial product that I regularly recommend. I'm an open-source guy whenever I can be.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39785620
ok will take a look.  i have been using sublime text for the last 6 month.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39842509
Hi they have fixed the truncation.

please can you take a look at a new problem that has occurred now they fixed the truncation problem.

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28359357.html
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
This article discusses four methods for overlaying images in a container on a web page
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now