Solved

fix broken .CSV extra quotes?

Posted on 2014-01-13
19
535 Views
Last Modified: 2014-02-07
hi i have this csv that has data in the wrong columns.
and i would like to find a way of cleaning it up.

i am guessing that there might be quotes inside the columns and this might be causing the problem?


This is the csv: http://christophersowerby.com/stock.CSV

This is an example of the problem:

If you open the csv in excel and look at the column called Stock Number you will see the unique identifiers for each product.

Now if you do a search for the Stock Number: 1002015
it is not in the correct column, it should be with the other stock numbers.

if it is extra commas or quotes in the CSV casing this, does any one know a php script to clean it up?

thanks
0
Comment
Question by:helpchrisplz
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
  • 5
  • +1
19 Comments
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39776324
This looks like a general-purpose "data scrubbing" problem.  Where does the data come from?   How does the CSV get created?
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776337
hi ray its the same source CSV file as that last question you helped me with:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28327984.html


i have now finished the full fopen script (that you helped me with) but i see that it doesn't have some of the stock numbers showing in the fixed csv that the script outputs.

The stock numbers that are not in the fixed csv are the ones that are outside of the stock number column (in the source csv) when opened in excel.

so i am hoping there is a way of cleaning the source csv before i run it through my fopen script.
0
 
LVL 110

Assisted Solution

by:Ray Paseur
Ray Paseur earned 50 total points
ID: 39776375
There may be, but it's not clear to me how to scrub the data.  There's so much of it that is wrong.  It might make more sense to go back to the data source and try to correct it where it is being generated.

Have a look at the output of this script (I am not going to leave this online).  According to TextPad, this is a 7+ megabyte file with 1881 lines.  Obviously Excel thinks otherwise.  So I think your instinct is right about quotes or other delimiters being wrong, but it would be a substantial research project to try to find a way to fix this many errors.

http://www.laprbass.com/RAY_temp_helpchrisplz.php

<?php // RAY_temp_helpchrisplz.php
error_reporting(E_ALL);
echo '<pre>';

// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28336931.html

// THE CSV DATA IS AVAILABLE HERE - OPEN THE FILE
$url = 'http://christophersowerby.com/stock.CSV';
$fpr = fopen($url, 'r');
if (!$fpr) trigger_error("UNABLE TO READ $url", E_USER_ERROR);

// WITH THE FILE OPENED, READ THE FIRST RECORD
$top = fgetcsv($fpr);
$num_top = count($top);
echo PHP_EOL . "EXPECTING $num_top COLUMNS IN EACH ROW";

// NOW READ THE OTHER RECORDS
$num = 1;
$err = 0;
while (!feof($fpr))
{
    // READ EACH ROW INTO AN ARRAY
    $cur = fgetcsv($fpr);
    if (empty($cur)) continue;
    $num_cur = count($cur);
    if ($num_cur != $num_top)
    {
        echo PHP_EOL . "MISMATCH AT LINE $num, HAS $num_cur COLUMNS";
        $err++;
    }
    $num++;
}
echo PHP_EOL . "OUT OF $num LINES, WE FOUND $err WITH MISMATCHED COLUMNS";

Open in new window

0
Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776392
ok then.

if it is commas inside cells then cant we check each cell for commas and remove them?

not sure if this is a dead end?

under heading: How to Sanitize Data for CSV Compatibility
https://developers.google.com/analytics/solutions/articles/gdataAnalyticsCsv
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39776417
Commas inside cells should have been wrapped in quotes when the CSV file was created.  Quotes are doubled, etc.  I think the Google "before and after" chart appears to be accurate.  But the magnitude of the task may be significant.  "OUT OF 1502 LINES, WE FOUND 336 WITH MISMATCHED COLUMNS" tells us we had 1,838 lines.  But this does not match the number of lines in TextPad.  It's a big project if you have to start with this data set.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776424
ok then i will try and talk to the stock control company to see what they can do to make the csv better.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39776429
I think that's the best approach.  If you can get them to produce the data in another format that can be "validated" you might get good results.  For example, if they can produce valid XML or JSON documents, it may be easier to turn those into CSV files.
0
 
LVL 57

Assisted Solution

by:Julian Hansen
Julian Hansen earned 50 total points
ID: 39776465
Your data file is not consistent. I couldn't see record 1002015 in Excel - it did not load.

Try running this script against your file. This script will parse each line and
a) Count fields
b) Look for unclosed quotes
As you will see from the output
i) There is not much consistency in terms of the number of fields output
ii) More importantly there are many lines that have opening quotes that are not closed.
This is probably where the problem is.
<?php
// Modify to find your file
$fp = fopen("stock.CSV", "rt");
$line = 1;
while($data = fgets($fp)) {
  $fields = 0;
  $instring = false;
  for($i=0; $i < strlen($data);$i++) {
    if ($data[$i] == '"') {
      if ($instring) {
        if ($data[$i-1] != "\\") {
          $instring = false;
        }
      }
      else {
        $instring = true;
      }
    }
    if ($data[$i] == ',' && !$instring) {
      $fields++;
    }
  }
  echo "Line $line: Fields : $fields [";
  echo ($instring) ? 'Unclosed string detected]<br/>' : ']<br/>';
  $line++;
}
?>

Open in new window

0
 
LVL 35

Accepted Solution

by:
gr8gonzo earned 400 total points
ID: 39776794
The problem is that whoever was writing the file did it in 4k chunks and didn't bother to finish writing out the complete record before moving onto the next one. If you look closely at the data, you'll see that they only go up to 4096 bytes per record. Several records seem to be incomplete because they were longer than 4096 bytes.

I don't think there's anything wrong with the quotes or anything - it's simply truncated data. That truncation is screwing with everything else (quotes and so on).

It's sloppy programming on the part of whoever created this CSV.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39776821
interesting
0
 
LVL 35

Expert Comment

by:gr8gonzo
ID: 39776825
Here's a screenshot that illustrates the problem (the truncated data at the 4k mark), and how Excel is treating the data.
1-13-2014-9-33-36-AM.jpg
0
 
LVL 35

Expert Comment

by:gr8gonzo
ID: 39776861
Something else you might want to suggest to them is to eliminate the unnecessary spaces around the quotes. You only have 2.2 megs of actual valid characters in that CSV file, and a whopping ~5 megs of SPACES outside of the data fields.

For example, the file is full of this:
,""                              ,""  ,""                    ,""

Open in new window

...which is no different than doing this:
,"","","",""

Open in new window


If you remove all of the unnecessary spaces, the file will go from around 7,500,000 bytes to about 2,300,000 bytes. That's a LOT of wasted bandwidth, space, and time.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39779584
@gr8gonzo: Good catch!  In any case, the responsibility goes back to the creator of the document.  I have a gut-feeling that once the extraneous blank space is eliminated, the 4,096 byte line length will be sufficient.  The creator should test the document by loading it into Excel to see if the structure is consistent with common sense (ie: the line count matches the number of rows in the spreadsheet).
0
 
LVL 35

Expert Comment

by:gr8gonzo
ID: 39779766
I'm not sure the extra blank space will be enough to fix the line length. Really, there's no reason for a 4k limit unless you're dealing with fixed-width data files. I think the massive amounts of attribute fields could still cause problems. The author really just needs to check the write buffers and write the next 4k if there's still data to be written before moving onto the next record.
0
 
LVL 1

Author Closing Comment

by:helpchrisplz
ID: 39782066
ok i just sent an email to them outlining what was said here. waiting for the response :)

Thanks everyone.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39784796
What software did you open the CSV in to get the byte numbers at the top of your screenshot? - just interested.
0
 
LVL 35

Expert Comment

by:gr8gonzo
ID: 39785514
UltraEdit, the king of the text editor. Bought a lifetime license about 8 years ago and it's been worth every penny. Best purchase I've ever made - it has tons of useful features that I use all the time (syntax highlighting in code, column editing, macros, hex view, ASCII tables, etc... all without getting in your way). They also have other products, bundles, suites, etc, but I just use the editor.

In a pinch, NotePad++ is free and can give you a smaller portion of that functionality, but it's not quite as polished ("easy to use", IMO). I would buy another license for UE in a heartbeat over settling for Notepad++ or another free editor.

And for the record, UltraEdit has been the only commercial product that I regularly recommend. I'm an open-source guy whenever I can be.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39785620
ok will take a look.  i have been using sublime text for the last 6 month.
0
 
LVL 1

Author Comment

by:helpchrisplz
ID: 39842509
Hi they have fixed the truncation.

please can you take a look at a new problem that has occurred now they fixed the truncation problem.

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28359357.html
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
This article discusses how to implement server side field validation and display customized error messages to the client.
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question