Editing CSV file

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

David Favor

1) Start with this in your code...

use Text::CSV_XS qw( csv );

Then adjust Text::CSV_XS input settings to ingest your .csv file data.

Then output data however you like, by adjusting Text::CSV_XS output settings.

2) To just edit raw .csv files use vi/vim or any other text based editor.

ASKER

To just edit raw .csv files use vi/vim or any other text based editor.

The ordinal file before a ran my simple script was coma separated
my Bad output file converted these to pipes

The problem with using a simple editor (I'm Using Katle BTW) is you can't see the patterns in the data the way you can in a spreadsheet

EG Col G is open text containing comas/pipes coll Y is a date you can see in the Spreedsheet some rows it's Coll Z ++

In a Text editor I see

......FOO|BAR|BISH|BOSH|||||||||||||||2019-04-09
it should be
......FOO,BAR,BISH,BOSH|||||||||||||||2019-04-09

I need to edit each line manually as each row is different so pattern matching would be so complex

BP

Unfortunately I haven't got Visual Studio on this PC I had thought about writing something in .NET on another PC
I was hopping someone had come across the problem before

Visual Studio Code is Microsofts free text editor, not the same as the commercial Visual Studio coding IDE / suite. It's free and is a good programmers text editor.

Visual Studio Code - Code Editing. Redefined

»bp

Do I understand correctly?
You got a CSV file which should have n columns per row.

Due to the fact,
that the CSV generator was broken, strings containing commas weren't surrounded by double quotes, every line containing m commas results in a line containing n + m columns instead of the desired n.

I assume regenerating the CSV with a proper CSV writer is out of question?

You look for a solution to identify all rows with that problem and a simple way to say, which ',' is NOT a separator?

I could imagine a small script, that reads the file line by line and asks you for every line which ',' should not be interpreted as a comma and writes out a file that adds the required escaping (double quotes) or changes all ',' that should be separators with tabs.

example let's imagine the file

miller, peter, 2019-01-01
smith, mr, 2019-01-02
jones, john, paul, peter and mary, 20190-01-03

I could imagine a script, that would just copy the first and second line and prompts you for the 3rd one in following way:

jones, john, paul, peter and mary, 20190-01-03
     a     b     c               d 20190-01-03

So it names each comma with an id (a,b,c,d) and you hust have to answer which commas you'd like to change.
in that case you'd answer  "bc" which means, that the comma with the id b and the comma with the id c should be commas and not separators and will write the result accordingly.

Would something like this help? It would handle 26 - number of columns commas per line and could be enhanced to more if necessary.

If yes, I might help to write a small python script doing that.

It would be helpful if you posted a representative sample of the file.

Duncan Roe

Yes please do supply a sample, the bigger the better. I can imagine editing the file to flag those lines which have the wrong number of delimiters (i.e columns - 1). You could subsequently edit the flagged file with a normal editor.
With your sample, please indicate how many columns / delimiters there are / should be.

ASKER

Hi
Up loading a sample isn't possible I could probably make something up though
My original CSV file was 110mb it contained 32 real columns with a header row

I wrote a small perl script that counted the , in each row if a row had the same number as the header row it went to a good file if not it went to a bad file and changed the coma to a pipe

I now have a 1.5 mb file that I need to "clean" row by row
1 problem I don't know which column is incorrect, on one row it could be G another it's, H, M or T
I found 1 row where there were over 100 columns

If data exists in J it will be a specific format

G & H are reference lists
M is free text
T is another list
Some columns are dates

From these clues you can work out what should be comas or remain as pipes (This is helped if can see it in a spreedsheet)

I don't think additional scripting will help apart from find & replace in a text editor

I'm surprised no one has had a similar problem before , I have not quite on this scale

Am I correct in thinking there isn't a piece of software out there that has both free text editor and spreedsheet like viewer?

Well here a script, (I hope you can run python (python3 or python2 if you read my other posts)) that prompts for each bad line and allows you to choose which columns to join

The code could also suggest guesses, but for writing a decent guessing function we would really need some example lines.
We do not need the entire 1.5 MB, but at least a few examples of lines and expected results for the different cases, that you showed.

#!/usr/bin/env python

import csv

EXPECTED_NUM_OF_COLUMS = 4
IDS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"


def get_input():
    while True:
        line = input().strip()
        if line and not line.startswith("#"):
            break
    return line


def convert_input_to_indexes(separator_ids):
    """ returns list of reverse indexes if input is valid,
        None otherwise
    """

    separator_indexes = []
    for separator_id in separator_ids:
        try:
            idx = IDS[:len(row) - 1].index(separator_id)
        except:
            print("separator id %s is out of range (should be between"
                  "'a' and '%s'" % (separator_id, IDS[len(row)-1]))
            return None
        separator_indexes.append(idx)
    print(separator_indexes)
    return sorted(separator_indexes, reverse=True)


def adapt_row(row, indexes):
    new_row = list(row)
    print(row, new_row)
    for index in indexes:
        new_row[index:index+2] = [new_row[index] + "," + new_row[index + 1]]
        print(new_row)
    return new_row

def guess_fix(row):
    """ here code could be added to guess fix, but for that
        we'd really need some example lines
    """
    # The current guess assumes the first , is a real ,
    new_row = list(row)
    new_row[0:2] = [new_row[0] + "," + new_row[1]]
    return row

def prompt_for_change(row, lineno, tracefile):
    """ display bad line and ask which commas should be real commas
        wait for user input, change lines accordingly
        and write a trace file, that can be piped back into this script
        as a replay in case an error was made during the first run
    """
    def print_and_log(msg):
        """ prints a message and adds it to the trace file """
        print(msg)
        tracefile.write("# " + msg + "\n")

    # report the bad line and prompt
    print_and_log("line %5d has %2d columns instead of %d" %
                  (lineno, len(row), EXPECTED_NUM_OF_COLUMS))
    print_and_log("please choose which separator should be considered as comma")
    print_and_log("   " + row[0])
    for field_id, field in zip(IDS, row[1:]):
        print_and_log("%s: %s" % (field_id, field))
    print_and_log("")
    guessed_fix = guess_fix(row)
    print("My guessed fix would be", guessed_fix)
    print("Please enter the separators (a-%s) to convert to a real comma"
          " followed by CR (enter '!' to quit (and copy all other lines as is) or '$' to choose the automatic 'guess'" % (IDS[len(row)-1]), end=": ")
    input_str = get_input()
    tracefile.write(input_str + "\n")
    return input_str


with open("input.csv") as fin,\
        open("output.csv", "w") as fout,\
        open("changetrace.log", "w") as tracefile:
    reader = csv.reader(fin, delimiter=",")
    writer = csv.writer(fout, delimiter=",", quotechar='"')
    shall_quit = False
    for lineno, row in enumerate(reader, 1):
        while not shall_quit and len(row) > EXPECTED_NUM_OF_COLUMS:
            input_str = prompt_for_change(row, lineno, tracefile)
            if input_str == '!':
                shall_quit = True
                break
            if input_str == "$":
                row = guess_fix(row)
                break
            separator_indexes = convert_input_to_indexes(input_str)
            row = adapt_row(row, separator_indexes)
        writer.writerow(row)

I tested my code with following input

miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,john,paul,peter and mary,2019-01-03,0031205705200
walker,beam or james,paula,ann or jill,2019-01-03,2010-01-04 or any other date,0012127089400

and when entering '12' for line 3
and '024' for line 4

I'd get following csv output:

miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,"john,paul,peter and mary",2019-01-03,0031205705200
"walker,beam or james","paula,ann or jill","2019-01-03,2010-01-04 or any other date",0012127089400

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

A few lines of good data and the entire 1.5MB bad file will do.

Duncan Roe

Libreoffice is a free tool that can read spreadsheets. It's a complete Office drop-in, very popular on Windows as a replacement for Microsoft's product. There are Linux binaries (which I usually install) or you can build from source

if you give enough info it could be possible to write code, that fixes all lines where it is sure about its guessing and that prompts only the lines where guessing is ambigiuous.

Please tell me whether you can run my python script. It must be run with python3.
If your csv doesn't contain unicode characters and you have only python2, then add following lines after line 1

from __future__ import print_function

import sys
if sys.version_info < (3,):
    input = raw_input

and the script should also run with python2

ASKER

Yep got OpenOffice running in 1 window

Sorry I'm not able to upload even a edited version as it's on a closed system
The originating system is no longer in existance

As I said I could make something up if it'll help

It would be very helpful if you could explain for EACH column waht can be expected.

Good Question well presented

first 4 numbers
3 text
1 text looks empty
date
3 text problem columns possibly coma separated list no set format I'm guessing where these should be seperated
text free text this is an anchor because someone has written comments here also may have comas
number if filled
text [A-Z_0-9}{3}/\d+ if filled
5 text if filled
2 date if filled
9 text if filled

hope that helps or do you need to make up an example?

BTW:
On this system I'm running fedora I have perl but not python

So, this isn't a Windows 7 environment?

As mentioned in my earlier post, Visual Studio Code can do side by side, as shown in the image below. It's not perfect, but might be worth a look. It's free, and supports multiple platforms including Windows and Linux (I've only used on Windows).

»bp

Can you please check / verify / correct: I just see 31 columns

A,B,C, D numbers
E,F, G text without commas
H empty text
I date
J, K, L text, that might contain commas (anything, that might help, which text belongs to which column?)
M text that might contain commas (any way to detect start / end of this column and distinguish it from J,K, L?)
N empty or number
O empty or [A-Z_0-9}{3}/\d+
P, Q, R, S, T empty or text without commas
U,V empty or date
W, X, Y, Z, AA, AB, AC, AD, AE text without comma

A few more questions:
Q1: Is it of any help to post python code if you cannot install python on your closed system
Q2: DO you have no way at all of exchanging files between the closed system and other machines?
No ssh / ftp / TCP port open between your work station and the closed system?
No USB stick to copy files to?
Any remote access (RDP / VNC where you can copy paste at least something)

Does that mean, that if we would provide you a perl script you had to type it manually into your closed system?

@gelonida

I just see 31 columns

Were you expecting more or fewer columns?

@aikimark:

Trevor wrote:

My original CSV file was 110mb it contained 32 real columns with a header row

good catch

ASKER

I've just run this I'm now left whit just over 500 lines

use strict;

my @strs = ("XX33|11-1 ED 2", "XX 33|11-1 ED 2");

my $regex = qr/([A-Z]{2}\s*\d{2})\|(\d{2}-\d+\sED\s\d*)/mp;

foreach my $str (@strs){
if ( $str =~ s/$regex/$1,$2/g ) {
  print "str out $str\n";

}
}

Generally I can copy from here to the closed system but not the other way round

Can you please check / verify / correct: I just see 31 columns

A,B,C, D numbers
E,F, G,H text without commas
I empty text
J date
K, L, M text, that might contain commas (anything, that might help, which text belongs to which column?)
(Sorry that is not easy to define the above regex fixed column L) K had a different pattern and M was empty
Other rows have different list patterns of varying length some of these I recognize others I'm guessing )
N text that might contain commas (any way to detect start / end of this column and distinguish it from K, L,M?)
Only that it is a sentence not a code list
O empty or number
P empty or [A-Z_0-9}{3}/\d+
Q, R, S, T,U empty or text without commas
V, W empty or date
X, Y, Z, AA, AB, AC, AD, AE text without comma or empty

AF if not empty should be uppercase with '/' Last column

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Well without detailed infos we can mostly help you restructuring your thoughts and suggest generic tools

Yes knowing I couldn't upload even a sample data set my opening question was to do with if software was available that had flexibility of good programmers text editor and could also view delimited data in a spreed sheet like format

BP has suggested visualstudio which I'm hopping to get installed today

I hadn't expected this level of response consequently unsure which post should be the winner

I hadn't expected this level of response consequently unsure which post should be the winner

Just so you know, it doesn't have to be just one, you can select multiple answers as solutions...

»bp

ASKER

Hi bp
Currently trying to install Visual Studio Code on the system, Although I can boot into Win7 if necessary I've very little software on this OS most is in Fedora
Hadn't released Visual Studio Code had dependancies

Any ideas where to get the Excel Viewer add-in you mentioned?

This page sugests Excel Viewer has been retired

Excel Viewer - Visual Studio Marketplace