Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

Editing CSV file

Hi
I have an unquoted CSV file that also has columns / cells with comas in
I need to fix it prior to ingesting into a database (This is it, it can't be generated properly)
 
I've wrote a short perl script isolate the rows with too many columns outputting the deliminator as a '|' pipe
I now need to edit the rest

Bringing the file into excel or open office calc can help to work out where the problems with each row is because you can see patterns in the data BUT totally messes up formatting like leading zeros and dates

Using a plane text editor keeps this formatting but is sole destroying looking at 100's lines of CSV data comparing with the spreadsheet  and error prone

Any one know of a CSV editor that can do both?

In 1 pane have an text /CSV editor  and as the deliminator  is removed or added dynamically updates a spreadsheet like interface so you can see instantly the column data lines up
ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
1) Start with this in your code...

use Text::CSV_XS qw( csv );

Open in new window


Then adjust Text::CSV_XS input settings to ingest your .csv file data.

Then output data however you like, by adjusting Text::CSV_XS output settings.

2) To just edit raw .csv files use vi/vim or any other text based editor.
Avatar of trevor1940
trevor1940

ASKER

To just edit raw .csv files use vi/vim or any other text based editor.

The ordinal file before a ran my simple script was coma separated
my Bad output file converted these to pipes
 
The problem with using a simple editor  (I'm Using Katle BTW) is you can't see the patterns in the data the way  you can in a spreadsheet

EG Col G is open text containing comas/pipes  coll Y is a date you can see in the Spreedsheet some rows it's Coll Z ++
 
In a Text editor I see

......FOO|BAR|BISH|BOSH|||||||||||||||2019-04-09
it should be
......FOO,BAR,BISH,BOSH|||||||||||||||2019-04-09

I need to edit each line manually  as  each row is different so pattern matching  would be so complex

BP

Unfortunately  I haven't got Visual Studio on this PC I had thought about writing something in .NET on another PC
I was hopping someone had come across  the problem before
Visual Studio Code is Microsofts free text editor, not the same as the commercial Visual Studio coding IDE / suite.  It's free and is a good programmers text editor.

Visual Studio Code - Code Editing. Redefined


»bp
Do I understand correctly?
You got a CSV file which should have n columns per row.

Due to the fact,
that the CSV generator was broken, strings containing commas weren't surrounded by double quotes, every line containing m commas results in a line containing n + m columns instead of the desired n.

I assume regenerating the CSV with a proper CSV writer is out of question?

You look for a solution to identify all rows with that problem and a simple way to say, which ',' is NOT a separator?

I could imagine a small script, that reads the file line by line and asks you for every line which ',' should not be interpreted as a comma and writes out a file that adds the required escaping (double quotes) or changes all ',' that should be separators with tabs.

example let's imagine the file

miller, peter, 2019-01-01
smith, mr, 2019-01-02
jones, john, paul, peter and mary, 20190-01-03

Open in new window



I could imagine a script, that would just copy the first and second line and prompts you for the 3rd one in following way:
jones, john, paul, peter and mary, 20190-01-03
     a     b     c               d 20190-01-03

So it names each comma with an id (a,b,c,d) and you hust have to answer which commas you'd like to change.
in that case you'd answer  "bc" which means, that the comma with the id b and the comma with the id c should be commas and not separators and will write the result accordingly.

Would something like this help? It would handle 26 - number of columns commas per line and could be enhanced to more if necessary.

If yes, I might help to write a small python script doing that.

Open in new window

It would be helpful if you posted a representative sample of the file.
Yes please do supply a sample, the bigger the better. I can imagine editing the file to flag those lines which have the wrong number of delimiters (i.e columns - 1). You could subsequently edit the flagged file with a normal editor.
With your sample, please indicate how many columns / delimiters there are / should be.
Hi
Up loading a sample isn't possible I could probably make something up though
My original CSV file was 110mb it contained  32 real columns with  a header row

I wrote a small perl script that counted the , in each row if a row had  the same number as the header row it went to a good file if not it went to a bad file and changed the coma to a pipe

I now have a 1.5 mb file that I need to "clean"  row by row
1 problem I don't know which column is incorrect, on one row it could be G another it's, H, M   or T
I found 1 row where there were over 100 columns

If data exists in J it will be a specific format

G & H are reference lists
M is free text
T is another list
Some columns are dates

From these clues you can work out what should be comas or remain as pipes (This is helped if can see it in a spreedsheet)

I don't think additional scripting will help apart from find & replace in a text editor

I'm surprised no one has had a similar problem before , I have not quite on this scale

Am I correct in thinking there isn't a piece of software out there that has both free text editor and spreedsheet like viewer?
Well here a script, (I hope you can run python (python3 or python2 if you read my other posts))  that prompts for each bad line and allows you to choose which columns to join

The code could also suggest guesses, but for writing a decent guessing function we would really need some example lines.
We do not need the entire 1.5 MB, but at least a few examples of lines and expected results for the different cases, that you showed.

#!/usr/bin/env python

import csv

EXPECTED_NUM_OF_COLUMS = 4
IDS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"


def get_input():
    while True:
        line = input().strip()
        if line and not line.startswith("#"):
            break
    return line


def convert_input_to_indexes(separator_ids):
    """ returns list of reverse indexes if input is valid,
        None otherwise
    """

    separator_indexes = []
    for separator_id in separator_ids:
        try:
            idx = IDS[:len(row) - 1].index(separator_id)
        except:
            print("separator id %s is out of range (should be between"
                  "'a' and '%s'" % (separator_id, IDS[len(row)-1]))
            return None
        separator_indexes.append(idx)
    print(separator_indexes)
    return sorted(separator_indexes, reverse=True)


def adapt_row(row, indexes):
    new_row = list(row)
    print(row, new_row)
    for index in indexes:
        new_row[index:index+2] = [new_row[index] + "," + new_row[index + 1]]
        print(new_row)
    return new_row

def guess_fix(row):
    """ here code could be added to guess fix, but for that
        we'd really need some example lines
    """
    # The current guess assumes the first , is a real ,
    new_row = list(row)
    new_row[0:2] = [new_row[0] + "," + new_row[1]]
    return row

def prompt_for_change(row, lineno, tracefile):
    """ display bad line and ask which commas should be real commas
        wait for user input, change lines accordingly
        and write a trace file, that can be piped back into this script
        as a replay in case an error was made during the first run
    """
    def print_and_log(msg):
        """ prints a message and adds it to the trace file """
        print(msg)
        tracefile.write("# " + msg + "\n")

    # report the bad line and prompt
    print_and_log("line %5d has %2d columns instead of %d" %
                  (lineno, len(row), EXPECTED_NUM_OF_COLUMS))
    print_and_log("please choose which separator should be considered as comma")
    print_and_log("   " + row[0])
    for field_id, field in zip(IDS, row[1:]):
        print_and_log("%s: %s" % (field_id, field))
    print_and_log("")
    guessed_fix = guess_fix(row)
    print("My guessed fix would be", guessed_fix)
    print("Please enter the separators (a-%s) to convert to a real comma"
          " followed by CR (enter '!' to quit (and copy all other lines as is) or '$' to choose the automatic 'guess'" % (IDS[len(row)-1]), end=": ")
    input_str = get_input()
    tracefile.write(input_str + "\n")
    return input_str


with open("input.csv") as fin,\
        open("output.csv", "w") as fout,\
        open("changetrace.log", "w") as tracefile:
    reader = csv.reader(fin, delimiter=",")
    writer = csv.writer(fout, delimiter=",", quotechar='"')
    shall_quit = False
    for lineno, row in enumerate(reader, 1):
        while not shall_quit and len(row) > EXPECTED_NUM_OF_COLUMS:
            input_str = prompt_for_change(row, lineno, tracefile)
            if input_str == '!':
                shall_quit = True
                break
            if input_str == "$":
                row = guess_fix(row)
                break
            separator_indexes = convert_input_to_indexes(input_str)
            row = adapt_row(row, separator_indexes)
        writer.writerow(row)

Open in new window


I tested my code with following input
miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,john,paul,peter and mary,2019-01-03,0031205705200
walker,beam or james,paula,ann or jill,2019-01-03,2010-01-04 or any other date,0012127089400

Open in new window


and when entering '12' for line 3
and '024' for line 4

I'd get following csv output:
miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,"john,paul,peter and mary",2019-01-03,0031205705200
"walker,beam or james","paula,ann or jill","2019-01-03,2010-01-04 or any other date",0012127089400

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
A few lines of good data and the entire 1.5MB bad file will do.
Libreoffice is a free tool that can read spreadsheets. It's a complete Office drop-in, very popular on Windows as a replacement for Microsoft's product. There are Linux binaries (which I usually install) or you can build from source
if you give enough info it could be possible to write code, that fixes all lines where it is sure about its guessing and that prompts only the lines where guessing is ambigiuous.

Please tell me whether you can run my python script. It must be run with python3.
If your csv doesn't contain unicode characters and you have only python2, then add following lines after line 1
from __future__ import print_function

import sys
if sys.version_info < (3,):
    input = raw_input

Open in new window

and the script should also run with python2
Yep got OpenOffice running in 1 window

Sorry I'm not able to upload even a edited version as it's on a closed system
The originating system is no longer in existance

As I said I could make something up if it'll help

It would be very helpful if you could explain for EACH column waht can be expected.
Good Question well presented

first 4 numbers
3 text
1 text        looks empty
date
3 text        problem columns possibly coma separated list no set format I'm guessing where these should be seperated
text           free text this is an anchor because someone has written comments here also may have comas
number   if filled
text           [A-Z_0-9}{3}/\d+  if filled
5 text        if filled
2 date       if filled
9 text         if filled

hope that helps or do you need to make up an example?

BTW:
On this system I'm running fedora I have perl but not python
So, this isn't a Windows 7 environment?
As mentioned in my earlier post, Visual Studio Code can do side by side, as shown in the image below.  It's not perfect, but might be worth a look.  It's free, and supports multiple platforms including Windows and Linux (I've only used on Windows).

User generated image

»bp
Can you please check / verify / correct: I just see 31 columns

A,B,C, D numbers
E,F, G text without commas
H empty text
I date
J, K, L text, that might contain commas (anything, that might help, which text belongs to which column?)
M text that might contain commas (any way to detect start / end of this column and distinguish it from J,K, L?)
N empty or number
O empty or  [A-Z_0-9}{3}/\d+
P, Q, R, S, T empty or text  without commas
U,V empty or date
W, X, Y, Z, AA, AB, AC, AD, AE text without comma

A few more questions:
Q1: Is it of any help to post python code if you cannot install python on your closed system
Q2: DO you have no way at all of exchanging files between the closed system and other machines?
No ssh / ftp / TCP port open between your work station and the closed system?
No USB stick to copy files to?
Any remote access (RDP / VNC where you can copy paste at least something)


Does that mean, that if we would provide you a perl script you had to type it manually into your closed system?
@gelonida

I just see 31 columns
Were you expecting more or fewer columns?
@aikimark:

Trevor wrote:

My original CSV file was 110mb it contained  32 real columns with  a header row
good catch
I've just run this I'm now left whit just over 500 lines
use strict;

my @strs = ("XX33|11-1 ED 2", "XX 33|11-1 ED 2");

my $regex = qr/([A-Z]{2}\s*\d{2})\|(\d{2}-\d+\sED\s\d*)/mp;

foreach my $str (@strs){
if ( $str =~ s/$regex/$1,$2/g ) {
  print "str out $str\n";

}
}

Open in new window


Generally I can copy from here to the closed system but not the other way round

Can you please check / verify / correct: I just see 31 columns


A,B,C, D numbers
E,F, G,H text without commas
I empty text
J date
K, L, M text, that might contain commas (anything, that might help, which text belongs to which column?)
(Sorry that is not easy to define the above regex fixed column L) K had a different pattern and M was empty
Other rows have different list patterns of varying length some of these I recognize others I'm guessing )
N text that might contain commas (any way to detect start / end of this column and distinguish it from K, L,M?)
Only that it is a sentence not a code list
O empty or number
P empty or  [A-Z_0-9}{3}/\d+
Q, R, S, T,U empty or text  without commas
V, W empty or date
X, Y, Z, AA, AB, AC, AD, AE text without comma or empty

AF if not empty should be uppercase with '/'  Last column
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Well without detailed infos we can mostly help you restructuring your thoughts and suggest generic tools

Yes knowing I couldn't upload even a sample data set my opening question was to do with if software was available that had flexibility of good programmers   text  editor and could also view delimited data in a spreed sheet like format

BP has suggested visualstudio which I'm hopping to get installed today

I hadn't expected this level of response  consequently unsure which post should be the winner
I hadn't expected this level of response  consequently unsure which post should be the winner
Just so you know, it doesn't have to be just one, you can select multiple answers as solutions...


»bp
Hi bp
Currently trying to install   Visual Studio Code on the  system, Although I can boot into Win7 if necessary I've very  little software on this OS most is in Fedora
Hadn't released  Visual Studio Code had dependancies

Any ideas where to get the Excel Viewer add-in you mentioned?

This page sugests Excel Viewer has been retired
Thanx for your help
Yes I know its a late contribution but replace ships with mysql.
E.G. replace "|" "," -- ../my.csv