Link to home
Start Free TrialLog in
Avatar of Swadhin Ray
Swadhin RayFlag for United States of America

asked on

remove double quotes from CSV file

Hello Experts,

I have a CSV file which contains the data like :

"a","a,a","a"a",","a"","a" 

Open in new window



Now few fields are missing the second enclosed double quotes.

I want to write a script which will take the file name and will remove the double quotes only for the missing fields.

Boz I want to load  the file into my oracle database which is causing me error due to missing second enclosed double quotes  .
Avatar of farzanj
farzanj
Flag of Canada image

Looks like there are un escaped double quotes as well within the fields.

Also please give example as to what exactly you would like to do-from input to output
Which of those double quotes do you want to remove?
If it's just the ,",  your looking to change to ,, try:

sed -i  's/,",/,,/g'  filename.csv

Open in new window


Or:

perl -pe  's/,",/,,/g'  filename.csv > filename_new.csv

Open in new window

Avatar of pepr
pepr

What about the "a,a" (the second and third a). Should those also be two elements?

The problem with unreliable sources is that there may be no rules to transform it correctly. It depends on how regular is the data -- if you can define alternative rules that help to fix the problem. For example, it can be fairly easy if any string element never contains a comma.

The mess should be cleaned as early as possible. Can you fix the code that generates the .csv file?

Can you attach a real fragment of the file (you can change all letters to, say, a)? Otherwise, it is difficult to say what can be done.
Avatar of Swadhin Ray

ASKER

You can find my sample file in the question that I have posted :

https://www.experts-exchange.com/questions/27869641/SQL-Loader-issue.html
OK. I have it. But when reading the file, I can observe no problem. All records have exactly 21 elements. Can you extract few lines for which you observe problems?
Here is a Python snippet that just reads the file and prints separately the header and the rows with non-zero (as string) used for the second element:
import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    header = next(reader)
    print header
    print '========================================='
    for row in reader:
       if row[1] != '0':
           print row
       ##if len(row) != 21:
       ##    print len(row), row

Open in new window

The first record it self getting me an issue :

"","0","","xxxxxxx_IT","2012/08/29 19:57:51","0","3040","1197984991","UPDATE-Ticket State","ts00","0","2012/08/29 19:57:51","000000044960961","Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now “Ready to Close”. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxxx.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************************","1","*","ts","000000122173173","4","2012/08/29 19:57:51","2012/08/29 19:57:51"

Notification Feedback Log field it self having double quotes in between which is causing me an issue.  Which is also having first and last with double quotes.

Now if I want to load the file then I need to read like :

fields terminated by "," 
 optionally enclosed by '"' 

Open in new window


Where as the column is having 4 double quotes which is stopping me to load the data into oracle database.
When I run the python code it gives me :

invalid mode ('rb')
This is my shell script which I use it to load flat files into my oracle database :

#!/bin/sh
export ORACLE_BASE=/opt/app/oracle
export ORACLE_SID=MYSID
export ORACLE_HOME=$ORACLE_BASE/product/11.1.0
export LD_LIBRARY_PATH=$ORACLE_HOME/lib
export ORACLE_PATH=$ORACLE_HOME/bin
export PATH=$ORACLE_HOME/bin:$PATH

sqlldr userid=username$ORACLE_SID/password control=/home/myfiles/control.ctl log=/home/myfiles/test.log skip=1 data= /home/myfiles/test.csv

Open in new window



Now if I want to incorporate the python script to this shell script then how do i do that .
So that it should replace any double quote that wasn't at the beginning or  end .

So that my field value should look like :

"Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now Ready to Close. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxxx.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************************"

Open in new window

What version of Python do you use? (Just type python and see at the first lines of the interactive mode or the "python --version". The script above was written for Python 2.x. Then the open(fname, 'rb') means read in binary mode and it should work.

For the quotes around “Ready to Close”. Actually, they are not ASCII quotes. This way they do not affect the Python csv module. However, the file is in UTF-8 (outside of ASCII), and your Oracle table expects VARCHAR2. You should use NVARCHAR2 for the purpose, or you have to get rid somehow of the non-ASCII characters.

For reading the csv correctly by Python, it depends on whether you use Python 2.x or Python 3.x. For Python 2, the binary data (old strings) should be decoded from UTF-8 encoding and the Unicode strings should be processed later. For Python 3, the file should not be open in binary mode and the encoding should be passed also. The result of reading will also be Unicode strings.
Back to the trees. I am wrong with UTF-8 (I was probably looking at the result of some coversion via editor). Anyway, the principe is the same. You need to work with some encoding. When working with Python 3, csv.reader wants to return strings (i.e. the Unicode strings).
SOLUTION
Avatar of arober11
arober11
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@arober11:

Thanks for the information. But ' ~ ' is also having in the same field ...

What I understand that you are asking to make '~' in first and last for all the fields.

But what if the same field values will have this '~' on them ?
You'll hit the same issue / error, ideally you want whatever is creating the CSV files to filter the fields to ensure they don't contain your field enclosing characters.
I do not know what tool you want to process the csv file. However Python consumes it quite happily. To remove the non-ASCII quotes from rows that you get through reading the test.csv, you can replace the known unwanted characters like this:
import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
    
       # Replace the non-ASCII quotes in the row[13].
       s = row[13].replace('\x93', '')  # the opening quote
       row[13] = s.replace('\x94', '')  # the closing quote
       
       # Process the row...
       print row[13]

Open in new window

I just wanted to add one little logic for this as to add "|" in opening and closing field or like
Fields terminated by '|' symbol.

Then I can check if that works for me..
I am able to terminate fields with '|' with the below code:

import csv
csvFile = open(r'C:\Users\sloba\Desktop\MY_DIR\test.csv').read().replace('","', '|')
csvReader = csv.reader(csvFile, delimiter='|')
for row in csvReader:
print row

Open in new window


Here I am able to print the rows but I want to save the file with a new name one the same path.

Any suggestion ..
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I will test this and update it.
But the replacement is probably not going to help you with the problem of non-ASCII quotes anyway.
I want to make that to save as .txt file
I think just need few changes on this code which will work for me :

import os

# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = fname + 'X' + extension           # X in front of the extension dot

# Just to see it.
print fnameIn
print fnameOut

# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
f.close()

# Open the output file, write the content, close.
f  = open(fnameOut, 'w')
f.write(content)
f.close()

Open in new window


To add a logic to make it a text file in place of .csv and then remove ' " ' double quotes stating and end of the entire rows.
I am able to convert it into .txt file from this below code:

import os
 
# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = r'C:\Users\sloba\Desktop\MY_DIR\test.txt'          # export into .txt file

# Just to see it.
print fnameIn
print fnameOut
 
# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
content = content.replace('\x93', '')  # the opening quote
content = content.replace('\x94', '')  # the closing quote
f.close()
 
# Open the output file, write the content, close.
f  = open(fnameOut, 'w')
f.write(content)
f.close()

Open in new window

Only change is now required for removing the
content = content.replace('\x93', '')  # the opening quote
content = content.replace('\x94', '')  # the closing quote

for example in my .txt file I am getting like :

from
"Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM"

Open in new window



to

Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM

Open in new window

But the replacement of "," is quite unrelated to .replace('\x93', ''). The .replace() is used to get rid of the non-ASCII characters.

The first double quote is not such, and there also is the final double quote that should be removed. This way you only change the delimiter and quoting of the original csv file thus making the process of reading the csv file more complicated. Actually, there were no problems with reading the original csv file. There probably was only the problem with the 14th elemet containing non-ASCII characters (row[13]). After a lot of extra work, you will get the same as in http:Q_27869822.html#a38416837
Shared the points as everyone has contributed to get an idea on my issue .