trevor1940
asked on
Editing CSV file
Hi
I have an unquoted CSV file that also has columns / cells with comas in
I need to fix it prior to ingesting into a database (This is it, it can't be generated properly)
I've wrote a short perl script isolate the rows with too many columns outputting the deliminator as a '|' pipe
I now need to edit the rest
Bringing the file into excel or open office calc can help to work out where the problems with each row is because you can see patterns in the data BUT totally messes up formatting like leading zeros and dates
Using a plane text editor keeps this formatting but is sole destroying looking at 100's lines of CSV data comparing with the spreadsheet and error prone
Any one know of a CSV editor that can do both?
In 1 pane have an text /CSV editor and as the deliminator is removed or added dynamically updates a spreadsheet like interface so you can see instantly the column data lines up
I have an unquoted CSV file that also has columns / cells with comas in
I need to fix it prior to ingesting into a database (This is it, it can't be generated properly)
I've wrote a short perl script isolate the rows with too many columns outputting the deliminator as a '|' pipe
I now need to edit the rest
Bringing the file into excel or open office calc can help to work out where the problems with each row is because you can see patterns in the data BUT totally messes up formatting like leading zeros and dates
Using a plane text editor keeps this formatting but is sole destroying looking at 100's lines of CSV data comparing with the spreadsheet and error prone
Any one know of a CSV editor that can do both?
In 1 pane have an text /CSV editor and as the deliminator is removed or added dynamically updates a spreadsheet like interface so you can see instantly the column data lines up
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
To just edit raw .csv files use vi/vim or any other text based editor.
The ordinal file before a ran my simple script was coma separated
my Bad output file converted these to pipes
The problem with using a simple editor (I'm Using Katle BTW) is you can't see the patterns in the data the way you can in a spreadsheet
EG Col G is open text containing comas/pipes coll Y is a date you can see in the Spreedsheet some rows it's Coll Z ++
In a Text editor I see
......FOO|BAR|BISH|BOSH|||
it should be
......FOO,BAR,BISH,BOSH|||
I need to edit each line manually as each row is different so pattern matching would be so complex
BP
Unfortunately I haven't got Visual Studio on this PC I had thought about writing something in .NET on another PC
I was hopping someone had come across the problem before
Visual Studio Code is Microsofts free text editor, not the same as the commercial Visual Studio coding IDE / suite. It's free and is a good programmers text editor.
Visual Studio Code - Code Editing. Redefined
»bp
Visual Studio Code - Code Editing. Redefined
»bp
Do I understand correctly?
You got a CSV file which should have n columns per row.
Due to the fact,
that the CSV generator was broken, strings containing commas weren't surrounded by double quotes, every line containing m commas results in a line containing n + m columns instead of the desired n.
I assume regenerating the CSV with a proper CSV writer is out of question?
You look for a solution to identify all rows with that problem and a simple way to say, which ',' is NOT a separator?
I could imagine a small script, that reads the file line by line and asks you for every line which ',' should not be interpreted as a comma and writes out a file that adds the required escaping (double quotes) or changes all ',' that should be separators with tabs.
example let's imagine the file
I could imagine a script, that would just copy the first and second line and prompts you for the 3rd one in following way:
You got a CSV file which should have n columns per row.
Due to the fact,
that the CSV generator was broken, strings containing commas weren't surrounded by double quotes, every line containing m commas results in a line containing n + m columns instead of the desired n.
I assume regenerating the CSV with a proper CSV writer is out of question?
You look for a solution to identify all rows with that problem and a simple way to say, which ',' is NOT a separator?
I could imagine a small script, that reads the file line by line and asks you for every line which ',' should not be interpreted as a comma and writes out a file that adds the required escaping (double quotes) or changes all ',' that should be separators with tabs.
example let's imagine the file
miller, peter, 2019-01-01
smith, mr, 2019-01-02
jones, john, paul, peter and mary, 20190-01-03
I could imagine a script, that would just copy the first and second line and prompts you for the 3rd one in following way:
jones, john, paul, peter and mary, 20190-01-03
a b c d 20190-01-03
So it names each comma with an id (a,b,c,d) and you hust have to answer which commas you'd like to change.
in that case you'd answer "bc" which means, that the comma with the id b and the comma with the id c should be commas and not separators and will write the result accordingly.
Would something like this help? It would handle 26 - number of columns commas per line and could be enhanced to more if necessary.
If yes, I might help to write a small python script doing that.
It would be helpful if you posted a representative sample of the file.
Yes please do supply a sample, the bigger the better. I can imagine editing the file to flag those lines which have the wrong number of delimiters (i.e columns - 1). You could subsequently edit the flagged file with a normal editor.
With your sample, please indicate how many columns / delimiters there are / should be.
With your sample, please indicate how many columns / delimiters there are / should be.
ASKER
Hi
Up loading a sample isn't possible I could probably make something up though
My original CSV file was 110mb it contained 32 real columns with a header row
I wrote a small perl script that counted the , in each row if a row had the same number as the header row it went to a good file if not it went to a bad file and changed the coma to a pipe
I now have a 1.5 mb file that I need to "clean" row by row
1 problem I don't know which column is incorrect, on one row it could be G another it's, H, M or T
I found 1 row where there were over 100 columns
If data exists in J it will be a specific format
G & H are reference lists
M is free text
T is another list
Some columns are dates
From these clues you can work out what should be comas or remain as pipes (This is helped if can see it in a spreedsheet)
I don't think additional scripting will help apart from find & replace in a text editor
I'm surprised no one has had a similar problem before , I have not quite on this scale
Am I correct in thinking there isn't a piece of software out there that has both free text editor and spreedsheet like viewer?
Up loading a sample isn't possible I could probably make something up though
My original CSV file was 110mb it contained 32 real columns with a header row
I wrote a small perl script that counted the , in each row if a row had the same number as the header row it went to a good file if not it went to a bad file and changed the coma to a pipe
I now have a 1.5 mb file that I need to "clean" row by row
1 problem I don't know which column is incorrect, on one row it could be G another it's, H, M or T
I found 1 row where there were over 100 columns
If data exists in J it will be a specific format
G & H are reference lists
M is free text
T is another list
Some columns are dates
From these clues you can work out what should be comas or remain as pipes (This is helped if can see it in a spreedsheet)
I don't think additional scripting will help apart from find & replace in a text editor
I'm surprised no one has had a similar problem before , I have not quite on this scale
Am I correct in thinking there isn't a piece of software out there that has both free text editor and spreedsheet like viewer?
Well here a script, (I hope you can run python (python3 or python2 if you read my other posts)) that prompts for each bad line and allows you to choose which columns to join
The code could also suggest guesses, but for writing a decent guessing function we would really need some example lines.
We do not need the entire 1.5 MB, but at least a few examples of lines and expected results for the different cases, that you showed.
I tested my code with following input
and when entering '12' for line 3
and '024' for line 4
I'd get following csv output:
The code could also suggest guesses, but for writing a decent guessing function we would really need some example lines.
We do not need the entire 1.5 MB, but at least a few examples of lines and expected results for the different cases, that you showed.
#!/usr/bin/env python
import csv
EXPECTED_NUM_OF_COLUMS = 4
IDS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def get_input():
while True:
line = input().strip()
if line and not line.startswith("#"):
break
return line
def convert_input_to_indexes(separator_ids):
""" returns list of reverse indexes if input is valid,
None otherwise
"""
separator_indexes = []
for separator_id in separator_ids:
try:
idx = IDS[:len(row) - 1].index(separator_id)
except:
print("separator id %s is out of range (should be between"
"'a' and '%s'" % (separator_id, IDS[len(row)-1]))
return None
separator_indexes.append(idx)
print(separator_indexes)
return sorted(separator_indexes, reverse=True)
def adapt_row(row, indexes):
new_row = list(row)
print(row, new_row)
for index in indexes:
new_row[index:index+2] = [new_row[index] + "," + new_row[index + 1]]
print(new_row)
return new_row
def guess_fix(row):
""" here code could be added to guess fix, but for that
we'd really need some example lines
"""
# The current guess assumes the first , is a real ,
new_row = list(row)
new_row[0:2] = [new_row[0] + "," + new_row[1]]
return row
def prompt_for_change(row, lineno, tracefile):
""" display bad line and ask which commas should be real commas
wait for user input, change lines accordingly
and write a trace file, that can be piped back into this script
as a replay in case an error was made during the first run
"""
def print_and_log(msg):
""" prints a message and adds it to the trace file """
print(msg)
tracefile.write("# " + msg + "\n")
# report the bad line and prompt
print_and_log("line %5d has %2d columns instead of %d" %
(lineno, len(row), EXPECTED_NUM_OF_COLUMS))
print_and_log("please choose which separator should be considered as comma")
print_and_log(" " + row[0])
for field_id, field in zip(IDS, row[1:]):
print_and_log("%s: %s" % (field_id, field))
print_and_log("")
guessed_fix = guess_fix(row)
print("My guessed fix would be", guessed_fix)
print("Please enter the separators (a-%s) to convert to a real comma"
" followed by CR (enter '!' to quit (and copy all other lines as is) or '$' to choose the automatic 'guess'" % (IDS[len(row)-1]), end=": ")
input_str = get_input()
tracefile.write(input_str + "\n")
return input_str
with open("input.csv") as fin,\
open("output.csv", "w") as fout,\
open("changetrace.log", "w") as tracefile:
reader = csv.reader(fin, delimiter=",")
writer = csv.writer(fout, delimiter=",", quotechar='"')
shall_quit = False
for lineno, row in enumerate(reader, 1):
while not shall_quit and len(row) > EXPECTED_NUM_OF_COLUMS:
input_str = prompt_for_change(row, lineno, tracefile)
if input_str == '!':
shall_quit = True
break
if input_str == "$":
row = guess_fix(row)
break
separator_indexes = convert_input_to_indexes(input_str)
row = adapt_row(row, separator_indexes)
writer.writerow(row)
I tested my code with following input
miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,john,paul,peter and mary,2019-01-03,0031205705200
walker,beam or james,paula,ann or jill,2019-01-03,2010-01-04 or any other date,0012127089400
and when entering '12' for line 3
and '024' for line 4
I'd get following csv output:
miller,peter,2019-01-01,00442073238299
smith,mr,2019-01-02,0033140205317
jones,"john,paul,peter and mary",2019-01-03,0031205705200
"walker,beam or james","paula,ann or jill","2019-01-03,2010-01-04 or any other date",0012127089400
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
A few lines of good data and the entire 1.5MB bad file will do.
Libreoffice is a free tool that can read spreadsheets. It's a complete Office drop-in, very popular on Windows as a replacement for Microsoft's product. There are Linux binaries (which I usually install) or you can build from source
if you give enough info it could be possible to write code, that fixes all lines where it is sure about its guessing and that prompts only the lines where guessing is ambigiuous.
Please tell me whether you can run my python script. It must be run with python3.
If your csv doesn't contain unicode characters and you have only python2, then add following lines after line 1
Please tell me whether you can run my python script. It must be run with python3.
If your csv doesn't contain unicode characters and you have only python2, then add following lines after line 1
from __future__ import print_function
import sys
if sys.version_info < (3,):
input = raw_input
and the script should also run with python2
ASKER
Yep got OpenOffice running in 1 window
Sorry I'm not able to upload even a edited version as it's on a closed system
The originating system is no longer in existance
As I said I could make something up if it'll help
first 4 numbers
3 text
1 text looks empty
date
3 text problem columns possibly coma separated list no set format I'm guessing where these should be seperated
text free text this is an anchor because someone has written comments here also may have comas
number if filled
text [A-Z_0-9}{3}/\d+ if filled
5 text if filled
2 date if filled
9 text if filled
hope that helps or do you need to make up an example?
BTW:
On this system I'm running fedora I have perl but not python
Sorry I'm not able to upload even a edited version as it's on a closed system
The originating system is no longer in existance
As I said I could make something up if it'll help
It would be very helpful if you could explain for EACH column waht can be expected.Good Question well presented
first 4 numbers
3 text
1 text looks empty
date
3 text problem columns possibly coma separated list no set format I'm guessing where these should be seperated
text free text this is an anchor because someone has written comments here also may have comas
number if filled
text [A-Z_0-9}{3}/\d+ if filled
5 text if filled
2 date if filled
9 text if filled
hope that helps or do you need to make up an example?
BTW:
On this system I'm running fedora I have perl but not python
So, this isn't a Windows 7 environment?
Can you please check / verify / correct: I just see 31 columns
A,B,C, D numbers
E,F, G text without commas
H empty text
I date
J, K, L text, that might contain commas (anything, that might help, which text belongs to which column?)
M text that might contain commas (any way to detect start / end of this column and distinguish it from J,K, L?)
N empty or number
O empty or [A-Z_0-9}{3}/\d+
P, Q, R, S, T empty or text without commas
U,V empty or date
W, X, Y, Z, AA, AB, AC, AD, AE text without comma
A few more questions:
Q1: Is it of any help to post python code if you cannot install python on your closed system
Q2: DO you have no way at all of exchanging files between the closed system and other machines?
No ssh / ftp / TCP port open between your work station and the closed system?
No USB stick to copy files to?
Any remote access (RDP / VNC where you can copy paste at least something)
Does that mean, that if we would provide you a perl script you had to type it manually into your closed system?
A,B,C, D numbers
E,F, G text without commas
H empty text
I date
J, K, L text, that might contain commas (anything, that might help, which text belongs to which column?)
M text that might contain commas (any way to detect start / end of this column and distinguish it from J,K, L?)
N empty or number
O empty or [A-Z_0-9}{3}/\d+
P, Q, R, S, T empty or text without commas
U,V empty or date
W, X, Y, Z, AA, AB, AC, AD, AE text without comma
A few more questions:
Q1: Is it of any help to post python code if you cannot install python on your closed system
Q2: DO you have no way at all of exchanging files between the closed system and other machines?
No ssh / ftp / TCP port open between your work station and the closed system?
No USB stick to copy files to?
Any remote access (RDP / VNC where you can copy paste at least something)
Does that mean, that if we would provide you a perl script you had to type it manually into your closed system?
@gelonida
I just see 31 columnsWere you expecting more or fewer columns?
@aikimark:
Trevor wrote:
Trevor wrote:
My original CSV file was 110mb it contained 32 real columns with a header row
good catch
ASKER
I've just run this I'm now left whit just over 500 lines
Generally I can copy from here to the closed system but not the other way round
A,B,C, D numbers
E,F, G,H text without commas
I empty text
J date
K, L, M text, that might contain commas (anything, that might help, which text belongs to which column?)
(Sorry that is not easy to define the above regex fixed column L) K had a different pattern and M was empty
Other rows have different list patterns of varying length some of these I recognize others I'm guessing )
N text that might contain commas (any way to detect start / end of this column and distinguish it from K, L,M?)
Only that it is a sentence not a code list
O empty or number
P empty or [A-Z_0-9}{3}/\d+
Q, R, S, T,U empty or text without commas
V, W empty or date
X, Y, Z, AA, AB, AC, AD, AE text without comma or empty
AF if not empty should be uppercase with '/' Last column
use strict;
my @strs = ("XX33|11-1 ED 2", "XX 33|11-1 ED 2");
my $regex = qr/([A-Z]{2}\s*\d{2})\|(\d{2}-\d+\sED\s\d*)/mp;
foreach my $str (@strs){
if ( $str =~ s/$regex/$1,$2/g ) {
print "str out $str\n";
}
}
Generally I can copy from here to the closed system but not the other way round
Can you please check / verify / correct: I just see 31 columns
A,B,C, D numbers
E,F, G,H text without commas
I empty text
J date
K, L, M text, that might contain commas (anything, that might help, which text belongs to which column?)
(Sorry that is not easy to define the above regex fixed column L) K had a different pattern and M was empty
Other rows have different list patterns of varying length some of these I recognize others I'm guessing )
N text that might contain commas (any way to detect start / end of this column and distinguish it from K, L,M?)
Only that it is a sentence not a code list
O empty or number
P empty or [A-Z_0-9}{3}/\d+
Q, R, S, T,U empty or text without commas
V, W empty or date
X, Y, Z, AA, AB, AC, AD, AE text without comma or empty
AF if not empty should be uppercase with '/' Last column
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Well without detailed infos we can mostly help you restructuring your thoughts and suggest generic tools
Yes knowing I couldn't upload even a sample data set my opening question was to do with if software was available that had flexibility of good programmers text editor and could also view delimited data in a spreed sheet like format
BP has suggested visualstudio which I'm hopping to get installed today
I hadn't expected this level of response consequently unsure which post should be the winner
I hadn't expected this level of response consequently unsure which post should be the winnerJust so you know, it doesn't have to be just one, you can select multiple answers as solutions...
»bp
ASKER
Hi bp
Currently trying to install Visual Studio Code on the system, Although I can boot into Win7 if necessary I've very little software on this OS most is in Fedora
Hadn't released Visual Studio Code had dependancies
Any ideas where to get the Excel Viewer add-in you mentioned?
This page sugests Excel Viewer has been retired
Currently trying to install Visual Studio Code on the system, Although I can boot into Win7 if necessary I've very little software on this OS most is in Fedora
Hadn't released Visual Studio Code had dependancies
Any ideas where to get the Excel Viewer add-in you mentioned?
This page sugests Excel Viewer has been retired
ASKER
Thanx for your help
Yes I know its a late contribution but replace ships with mysql.
E.G. replace "|" "," -- ../my.csv
E.G. replace "|" "," -- ../my.csv
Open in new window
Then adjust Text::CSV_XS input settings to ingest your .csv file data.
Then output data however you like, by adjusting Text::CSV_XS output settings.
2) To just edit raw .csv files use vi/vim or any other text based editor.