Swadhin Ray
asked on
remove double quotes from CSV file
Hello Experts,
I have a CSV file which contains the data like :
Now few fields are missing the second enclosed double quotes.
I want to write a script which will take the file name and will remove the double quotes only for the missing fields.
Boz I want to load the file into my oracle database which is causing me error due to missing second enclosed double quotes .
I have a CSV file which contains the data like :
"a","a,a","a"a",","a"","a"
Now few fields are missing the second enclosed double quotes.
I want to write a script which will take the file name and will remove the double quotes only for the missing fields.
Boz I want to load the file into my oracle database which is causing me error due to missing second enclosed double quotes .
Which of those double quotes do you want to remove?
If it's just the ,", your looking to change to ,, try:
Or:
sed -i 's/,",/,,/g' filename.csv
Or:
perl -pe 's/,",/,,/g' filename.csv > filename_new.csv
What about the "a,a" (the second and third a). Should those also be two elements?
The problem with unreliable sources is that there may be no rules to transform it correctly. It depends on how regular is the data -- if you can define alternative rules that help to fix the problem. For example, it can be fairly easy if any string element never contains a comma.
The mess should be cleaned as early as possible. Can you fix the code that generates the .csv file?
Can you attach a real fragment of the file (you can change all letters to, say, a)? Otherwise, it is difficult to say what can be done.
The problem with unreliable sources is that there may be no rules to transform it correctly. It depends on how regular is the data -- if you can define alternative rules that help to fix the problem. For example, it can be fairly easy if any string element never contains a comma.
The mess should be cleaned as early as possible. Can you fix the code that generates the .csv file?
Can you attach a real fragment of the file (you can change all letters to, say, a)? Otherwise, it is difficult to say what can be done.
ASKER
You can find my sample file in the question that I have posted :
https://www.experts-exchange.com/questions/27869641/SQL-Loader-issue.html
https://www.experts-exchange.com/questions/27869641/SQL-Loader-issue.html
OK. I have it. But when reading the file, I can observe no problem. All records have exactly 21 elements. Can you extract few lines for which you observe problems?
Here is a Python snippet that just reads the file and prints separately the header and the rows with non-zero (as string) used for the second element:
import csv
with open('test.csv', 'rb') as f:
reader = csv.reader(f)
header = next(reader)
print header
print '========================================='
for row in reader:
if row[1] != '0':
print row
##if len(row) != 21:
## print len(row), row
ASKER
The first record it self getting me an issue :
"","0","","xxxxxxx_IT","20 12/08/29 19:57:51","0","3040","1197 984991","U PDATE-Tick et State","ts00","0","2012/08 /29 19:57:51","000000044960961 ","Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now “Ready to Close”. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxx x.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************** ********** ","1","*", "ts","0000 0012217317 3","4","20 12/08/29 19:57:51","2012/08/29 19:57:51"
Notification Feedback Log field it self having double quotes in between which is causing me an issue. Which is also having first and last with double quotes.
Now if I want to load the file then I need to read like :
Where as the column is having 4 double quotes which is stopping me to load the data into oracle database.
"","0","","xxxxxxx_IT","20
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now “Ready to Close”. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxx
Notification Feedback Log field it self having double quotes in between which is causing me an issue. Which is also having first and last with double quotes.
Now if I want to load the file then I need to read like :
fields terminated by ","
optionally enclosed by '"'
Where as the column is having 4 double quotes which is stopping me to load the data into oracle database.
ASKER
When I run the python code it gives me :
invalid mode ('rb')
invalid mode ('rb')
ASKER
This is my shell script which I use it to load flat files into my oracle database :
Now if I want to incorporate the python script to this shell script then how do i do that .
So that it should replace any double quote that wasn't at the beginning or end .
So that my field value should look like :
#!/bin/sh
export ORACLE_BASE=/opt/app/oracle
export ORACLE_SID=MYSID
export ORACLE_HOME=$ORACLE_BASE/product/11.1.0
export LD_LIBRARY_PATH=$ORACLE_HOME/lib
export ORACLE_PATH=$ORACLE_HOME/bin
export PATH=$ORACLE_HOME/bin:$PATH
sqlldr userid=username$ORACLE_SID/password control=/home/myfiles/control.ctl log=/home/myfiles/test.log skip=1 data= /home/myfiles/test.csv
Now if I want to incorporate the python script to this shell script then how do i do that .
So that it should replace any double quote that wasn't at the beginning or end .
So that my field value should look like :
"Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now Ready to Close. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxxx.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************************"
What version of Python do you use? (Just type python and see at the first lines of the interactive mode or the "python --version". The script above was written for Python 2.x. Then the open(fname, 'rb') means read in binary mode and it should work.
For the quotes around “Ready to Close”. Actually, they are not ASCII quotes. This way they do not affect the Python csv module. However, the file is in UTF-8 (outside of ASCII), and your Oracle table expects VARCHAR2. You should use NVARCHAR2 for the purpose, or you have to get rid somehow of the non-ASCII characters.
For reading the csv correctly by Python, it depends on whether you use Python 2.x or Python 3.x. For Python 2, the binary data (old strings) should be decoded from UTF-8 encoding and the Unicode strings should be processed later. For Python 3, the file should not be open in binary mode and the encoding should be passed also. The result of reading will also be Unicode strings.
For the quotes around “Ready to Close”. Actually, they are not ASCII quotes. This way they do not affect the Python csv module. However, the file is in UTF-8 (outside of ASCII), and your Oracle table expects VARCHAR2. You should use NVARCHAR2 for the purpose, or you have to get rid somehow of the non-ASCII characters.
For reading the csv correctly by Python, it depends on whether you use Python 2.x or Python 3.x. For Python 2, the binary data (old strings) should be decoded from UTF-8 encoding and the Unicode strings should be processed later. For Python 3, the file should not be open in binary mode and the encoding should be passed also. The result of reading will also be Unicode strings.
Back to the trees. I am wrong with UTF-8 (I was probably looking at the result of some coversion via editor). Anyway, the principe is the same. You need to work with some encoding. When working with Python 3, csv.reader wants to return strings (i.e. the Unicode strings).
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@arober11:
Thanks for the information. But ' ~ ' is also having in the same field ...
What I understand that you are asking to make '~' in first and last for all the fields.
But what if the same field values will have this '~' on them ?
Thanks for the information. But ' ~ ' is also having in the same field ...
What I understand that you are asking to make '~' in first and last for all the fields.
But what if the same field values will have this '~' on them ?
You'll hit the same issue / error, ideally you want whatever is creating the CSV files to filter the fields to ensure they don't contain your field enclosing characters.
I do not know what tool you want to process the csv file. However Python consumes it quite happily. To remove the non-ASCII quotes from rows that you get through reading the test.csv, you can replace the known unwanted characters like this:
import csv
with open('test.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
# Replace the non-ASCII quotes in the row[13].
s = row[13].replace('\x93', '') # the opening quote
row[13] = s.replace('\x94', '') # the closing quote
# Process the row...
print row[13]
ASKER
I just wanted to add one little logic for this as to add "|" in opening and closing field or like
Fields terminated by '|' symbol.
Then I can check if that works for me..
Fields terminated by '|' symbol.
Then I can check if that works for me..
ASKER
I am able to terminate fields with '|' with the below code:
Here I am able to print the rows but I want to save the file with a new name one the same path.
Any suggestion ..
import csv
csvFile = open(r'C:\Users\sloba\Desktop\MY_DIR\test.csv').read().replace('","', '|')
csvReader = csv.reader(csvFile, delimiter='|')
for row in csvReader:
print row
Here I am able to print the rows but I want to save the file with a new name one the same path.
Any suggestion ..
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I will test this and update it.
But the replacement is probably not going to help you with the problem of non-ASCII quotes anyway.
ASKER
I want to make that to save as .txt file
ASKER
I think just need few changes on this code which will work for me :
To add a logic to make it a text file in place of .csv and then remove ' " ' double quotes stating and end of the entire rows.
import os
# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = fname + 'X' + extension # X in front of the extension dot
# Just to see it.
print fnameIn
print fnameOut
# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
f.close()
# Open the output file, write the content, close.
f = open(fnameOut, 'w')
f.write(content)
f.close()
To add a logic to make it a text file in place of .csv and then remove ' " ' double quotes stating and end of the entire rows.
ASKER
I am able to convert it into .txt file from this below code:
import os
# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = r'C:\Users\sloba\Desktop\MY_DIR\test.txt' # export into .txt file
# Just to see it.
print fnameIn
print fnameOut
# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
content = content.replace('\x93', '') # the opening quote
content = content.replace('\x94', '') # the closing quote
f.close()
# Open the output file, write the content, close.
f = open(fnameOut, 'w')
f.write(content)
f.close()
ASKER
Only change is now required for removing the
content = content.replace('\x93', '') # the opening quote
content = content.replace('\x94', '') # the closing quote
for example in my .txt file I am getting like :
from
to
content = content.replace('\x93', '') # the opening quote
content = content.replace('\x94', '') # the closing quote
for example in my .txt file I am getting like :
from
"Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM"
to
Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM
But the replacement of "," is quite unrelated to .replace('\x93', ''). The .replace() is used to get rid of the non-ASCII characters.
The first double quote is not such, and there also is the final double quote that should be removed. This way you only change the delimiter and quoting of the original csv file thus making the process of reading the csv file more complicated. Actually, there were no problems with reading the original csv file. There probably was only the problem with the 14th elemet containing non-ASCII characters (row[13]). After a lot of extra work, you will get the same as in http:Q_27869822.html#a38416837
The first double quote is not such, and there also is the final double quote that should be removed. This way you only change the delimiter and quoting of the original csv file thus making the process of reading the csv file more complicated. Actually, there were no problems with reading the original csv file. There probably was only the problem with the 14th elemet containing non-ASCII characters (row[13]). After a lot of extra work, you will get the same as in http:Q_27869822.html#a38416837
ASKER
Shared the points as everyone has contributed to get an idea on my issue .
Also please give example as to what exactly you would like to do-from input to output