?
Solved

remove double quotes from CSV file

Posted on 2012-09-18
27
Medium Priority
?
2,485 Views
Last Modified: 2012-09-26
Hello Experts,

I have a CSV file which contains the data like :

"a","a,a","a"a",","a"","a" 

Open in new window



Now few fields are missing the second enclosed double quotes.

I want to write a script which will take the file name and will remove the double quotes only for the missing fields.

Boz I want to load  the file into my oracle database which is causing me error due to missing second enclosed double quotes  .
0
Comment
Question by:Swadhin Ray
  • 13
  • 9
  • 3
  • +2
27 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 38412067
Looks like there are un escaped double quotes as well within the fields.

Also please give example as to what exactly you would like to do-from input to output
0
 
LVL 85

Expert Comment

by:ozo
ID: 38412069
Which of those double quotes do you want to remove?
0
 
LVL 26

Expert Comment

by:arober11
ID: 38412474
If it's just the ,",  your looking to change to ,, try:

sed -i  's/,",/,,/g'  filename.csv

Open in new window


Or:

perl -pe  's/,",/,,/g'  filename.csv > filename_new.csv

Open in new window

0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 29

Expert Comment

by:pepr
ID: 38412587
What about the "a,a" (the second and third a). Should those also be two elements?

The problem with unreliable sources is that there may be no rules to transform it correctly. It depends on how regular is the data -- if you can define alternative rules that help to fix the problem. For example, it can be fairly easy if any string element never contains a comma.

The mess should be cleaned as early as possible. Can you fix the code that generates the .csv file?

Can you attach a real fragment of the file (you can change all letters to, say, a)? Otherwise, it is difficult to say what can be done.
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38412734
You can find my sample file in the question that I have posted :

http://www.experts-exchange.com/Database/Oracle/Q_27869641.html
0
 
LVL 29

Expert Comment

by:pepr
ID: 38413092
OK. I have it. But when reading the file, I can observe no problem. All records have exactly 21 elements. Can you extract few lines for which you observe problems?
0
 
LVL 29

Expert Comment

by:pepr
ID: 38413129
Here is a Python snippet that just reads the file and prints separately the header and the rows with non-zero (as string) used for the second element:
import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    header = next(reader)
    print header
    print '========================================='
    for row in reader:
       if row[1] != '0':
           print row
       ##if len(row) != 21:
       ##    print len(row), row

Open in new window

0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38413136
The first record it self getting me an issue :

"","0","","xxxxxxx_IT","2012/08/29 19:57:51","0","3040","1197984991","UPDATE-Ticket State","ts00","0","2012/08/29 19:57:51","000000044960961","Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now “Ready to Close”. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxxx.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************************","1","*","ts","000000122173173","4","2012/08/29 19:57:51","2012/08/29 19:57:51"

Notification Feedback Log field it self having double quotes in between which is causing me an issue.  Which is also having first and last with double quotes.

Now if I want to load the file then I need to read like :

fields terminated by "," 
 optionally enclosed by '"' 

Open in new window


Where as the column is having 4 double quotes which is stopping me to load the data into oracle database.
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38413141
When I run the python code it gives me :

invalid mode ('rb')
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38413150
This is my shell script which I use it to load flat files into my oracle database :

#!/bin/sh
export ORACLE_BASE=/opt/app/oracle
export ORACLE_SID=MYSID
export ORACLE_HOME=$ORACLE_BASE/product/11.1.0
export LD_LIBRARY_PATH=$ORACLE_HOME/lib
export ORACLE_PATH=$ORACLE_HOME/bin
export PATH=$ORACLE_HOME/bin:$PATH

sqlldr userid=username$ORACLE_SID/password control=/home/myfiles/control.ctl log=/home/myfiles/test.log skip=1 data= /home/myfiles/test.csv

Open in new window



Now if I want to incorporate the python script to this shell script then how do i do that .
So that it should replace any double quote that wasn't at the beginning or  end .

So that my field value should look like :

"Auto Notification Feedback - Ticket: 000000122173173 Account: COMPANY Asset ID: Iphone Active Org: RTLSVCDESK-ALL
Message Sent: Trouble Ticket 000000122173173 Resolved::Your ticket is now Ready to Close. This either means the issue was resolved or we don’t have sufficient information to continue working on this ticket. If the issue persists, please contact us by Click-to-Chat (available on the RSS Helpdesk website) or by phone within 5 calendar days. If we do not receive a response, we will assume the issue is resolved and close this ticket. Ticket Number: 000000122173173 Ticket Status: Ready to Close Creation Date: Tue Mar 30 14:23:14 EDT 2010 Trouble Description: Iphone will not download the loop If this ticket requires escalation, have your manager e-mail RSSTicketEscalation@xxxxxxx.com. Thank you, RSS Helpdesk RSS Helpdesk Website: http://dashboardsuite.it.xxxxxxx.com:8180/blazeds/RSS/RSS.html# Phone: 877.448.6767 – Option 1, then Option 2 OR Option 3 ************************************"

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 38413268
What version of Python do you use? (Just type python and see at the first lines of the interactive mode or the "python --version". The script above was written for Python 2.x. Then the open(fname, 'rb') means read in binary mode and it should work.

For the quotes around “Ready to Close”. Actually, they are not ASCII quotes. This way they do not affect the Python csv module. However, the file is in UTF-8 (outside of ASCII), and your Oracle table expects VARCHAR2. You should use NVARCHAR2 for the purpose, or you have to get rid somehow of the non-ASCII characters.

For reading the csv correctly by Python, it depends on whether you use Python 2.x or Python 3.x. For Python 2, the binary data (old strings) should be decoded from UTF-8 encoding and the Unicode strings should be processed later. For Python 3, the file should not be open in binary mode and the encoding should be passed also. The result of reading will also be Unicode strings.
0
 
LVL 29

Expert Comment

by:pepr
ID: 38413327
Back to the trees. I am wrong with UTF-8 (I was probably looking at the result of some coversion via editor). Anyway, the principe is the same. You need to work with some encoding. When working with Python 3, csv.reader wants to return strings (i.e. the Unicode strings).
0
 
LVL 26

Assisted Solution

by:arober11
arober11 earned 1000 total points
ID: 38414338
As a stop gap you could always alter string enclosure character from a " to something a bit less common in your control.ctl e.g.

fields terminated by "," optionally enclosed by "~"

Then tweak the file to match e.g.

sed -i -e 's/^"/~/'  -e  's/"$/~/' -e 's/","/~,~/g'  filename.csv

Open in new window

0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38414490
@arober11:

Thanks for the information. But ' ~ ' is also having in the same field ...

What I understand that you are asking to make '~' in first and last for all the fields.

But what if the same field values will have this '~' on them ?
0
 
LVL 26

Expert Comment

by:arober11
ID: 38414721
You'll hit the same issue / error, ideally you want whatever is creating the CSV files to filter the fields to ensure they don't contain your field enclosing characters.
0
 
LVL 29

Expert Comment

by:pepr
ID: 38416837
I do not know what tool you want to process the csv file. However Python consumes it quite happily. To remove the non-ASCII quotes from rows that you get through reading the test.csv, you can replace the known unwanted characters like this:
import csv

with open('test.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
    
       # Replace the non-ASCII quotes in the row[13].
       s = row[13].replace('\x93', '')  # the opening quote
       row[13] = s.replace('\x94', '')  # the closing quote
       
       # Process the row...
       print row[13]

Open in new window

0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38418186
I just wanted to add one little logic for this as to add "|" in opening and closing field or like
Fields terminated by '|' symbol.

Then I can check if that works for me..
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38420096
I am able to terminate fields with '|' with the below code:

import csv
csvFile = open(r'C:\Users\sloba\Desktop\MY_DIR\test.csv').read().replace('","', '|')
csvReader = csv.reader(csvFile, delimiter='|')
for row in csvReader:
print row

Open in new window


Here I am able to print the rows but I want to save the file with a new name one the same path.

Any suggestion ..
0
 
LVL 29

Accepted Solution

by:
pepr earned 1000 total points
ID: 38421181
Your csvFile variable now contains the content of the original test.csv. This way you can just open new file for writing, and write the content.
import os

# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = fname + 'X' + extension           # X in front of the extension dot

# Just to see it.
print fnameIn
print fnameOut

# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
f.close()

# Open the output file, write the content, close.
f  = open(fnameOut, 'w')
f.write(content)
f.close()

Open in new window

0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38421460
I will test this and update it.
0
 
LVL 29

Expert Comment

by:pepr
ID: 38421488
But the replacement is probably not going to help you with the problem of non-ASCII quotes anyway.
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38422450
I want to make that to save as .txt file
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38422517
I think just need few changes on this code which will work for me :

import os

# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = fname + 'X' + extension           # X in front of the extension dot

# Just to see it.
print fnameIn
print fnameOut

# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
f.close()

# Open the output file, write the content, close.
f  = open(fnameOut, 'w')
f.write(content)
f.close()

Open in new window


To add a logic to make it a text file in place of .csv and then remove ' " ' double quotes stating and end of the entire rows.
0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38422569
I am able to convert it into .txt file from this below code:

import os
 
# Process the filename separately.
fnameIn = r'C:\Users\sloba\Desktop\MY_DIR\test.csv'
fname, extension = os.path.splitext(fnameIn)
fnameOut = r'C:\Users\sloba\Desktop\MY_DIR\test.txt'          # export into .txt file

# Just to see it.
print fnameIn
print fnameOut
 
# Open the input file, read the content and modify it, close the file.
f = open(fnameIn)
content = f.read().replace('","', '|')
content = content.replace('\x93', '')  # the opening quote
content = content.replace('\x94', '')  # the closing quote
f.close()
 
# Open the output file, write the content, close.
f  = open(fnameOut, 'w')
f.write(content)
f.close()

Open in new window

0
 
LVL 16

Author Comment

by:Swadhin Ray
ID: 38422585
Only change is now required for removing the
content = content.replace('\x93', '')  # the opening quote
content = content.replace('\x94', '')  # the closing quote

for example in my .txt file I am getting like :

from
"Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM"

Open in new window



to

Assigned To|Auto Escalation Indicator|CFN|Client Id|Create Date|Delivery Status|Event Code|Event Sequence Id|Event Type|Last Modified By|Log String Size|Modified Date|Notification Id|Notification Feedback Log|Seq No|Short Description|Submitter|Ticket|Ticket Role|FBResponseLog ModifiedDtMinM|FBResponseLog ModifiedDtMaxM

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 38424221
But the replacement of "," is quite unrelated to .replace('\x93', ''). The .replace() is used to get rid of the non-ASCII characters.

The first double quote is not such, and there also is the final double quote that should be removed. This way you only change the delimiter and quoting of the original csv file thus making the process of reading the csv file more complicated. Actually, there were no problems with reading the original csv file. There probably was only the problem with the 14th elemet containing non-ASCII characters (row[13]). After a lot of extra work, you will get the same as in http:Q_27869822.html#a38416837
0
 
LVL 16

Author Closing Comment

by:Swadhin Ray
ID: 38437323
Shared the points as everyone has contributed to get an idea on my issue .
0

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Suggested Courses

621 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question