How to remove unicode characters from csv file?

How to remove the unicode characters from csv file using Python 3?
yogesh bansalAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

MishaProgrammerCommented:
0
gelonidaCommented:
How is the CSV file encoded (utf8?)?
What exactly do you mean by removing unicode characters?
- removing all unicode characters
- replace then with a '?' or another place holder
0
Shaun VermaakTechnical Specialist IVCommented:
@Misha: Did you read the answer in that link?

@OP: Please post a sample file
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

yogesh bansalAuthor Commented:
Hi,
I had gone through that link before but no use completely.

I have a csv file like this. It is a bi csv file with more patterns of unicode characters.

messi \u0632\u064a\u0646 \u0645 \u0632\u0639 \u0647\u0646  \u0645\u064a\u0633\u064a \u0641\u064a \u0643\u0631\u062a\u064a\u0646 \u0633\u0628\u0642\u062a\u064a\u0646 \u0645 \u0645\u0632\u0639 \u0634\u064a\u062a \u0643\u0646\u062a 4 \u0645 2
@sarkhat7ajar \u0632\u0648\u064a\u064a\u064a\u0646 \u0647 \ud83d\ude02\ud83d\ude02
\uc774\uc81c \ucd95\uad6c \uc880 \ubcfc\uae4c \ud558\uace0 \ud2f0\ube44 \ucf30\ub354\ub2c8 \uba54\uc2dc \uace8;
UGHHHHHHHHHHH

I want to remove all the unicode characters from this csv file.

I tried in Python 2.7 as well as in Python 3.5

Code for Python 2.7:-
import re
myre = re.compile('\ud83c[\udf00-\udfff]|\ud83d[\udc00-\ude4f\ude80-\udeff]|[\u2600-\u26FF\u2700-\u27BF]')

def clean(inputFile,outputFile):
    with open(inputFile, 'rb') as n5,open(outputFile, 'w+') as n6:
        for line in n5:
            line = myre.sub('', line.decode('ascii'))
            n6.write(line)

clean("test.csv","n8.csv")

with this code, I am getting a blank csv file. I think I am making some mistake in the regular expressions.

Code in Python 3.5:
import csv
import re

re_pattern = re.compile(r"\ud83c[\udf00-\udfff]|\ud83d[\udc00-\ude4f\ude80-\udeff]|[\u2600-\u26FF\u2700-\u27BF]", re.UNICODE)

def limit_to_BMP(value, patt=re_pattern):
    return patt.sub(patt, unicode(value, 'utf8')).encode('utf8')

with open('test.csv', 'rU') as ifile, open('outest1.csv', 'wb') as ofile:
    reader = csv.reader(ifile, dialect=csv.excel_tab)
    writer = csv.writer(ofile)
    next(reader, None)  # header is not added to output file
    writer.writerows(map(limit_to_BMP, row) for row in reader)

This code also does not work for this kind of csv.
0
yogesh bansalAuthor Commented:
decode does not work in Python 3.5
0
yogesh bansalAuthor Commented:
@gelonida I hope I have answered your question.
0
gelonidaCommented:
Hi Yogesh thanks for your post.

I'm still not sure about the file encoding and without knowing the encoding it is difficult to find the solution.

Could you please run following script and post the output? It will help to understand the file encoding and the type of csv file:
with open('input.csv', 'rb') as fin:
    for i in range(20):
        byteline = fin.read(30)
        print(repr(byteline))

Open in new window


Another question that I have.
Do you want to remove the unicode characters as you're not interested in them or just because they're not ASCII and can cause trouble.

The reason, that I'm asking is that there is code that can convert at least some unicode characters in ASCII characters:

example
é and è would just be converted to an e without accent.

I do this if I have unicode first and last names and I want to convert then into an ascii email.
0
yogesh bansalAuthor Commented:
Hi,

I have attached the csv file I am working on.
text.csv
0
yogesh bansalAuthor Commented:
I want to remove all the unicode characters from this csv.
0
yogesh bansalAuthor Commented:
Hi,

I don't have such occurrences é and è in the csv file. I just want to remove unicode characters like \ud863. My intention is not to convert to ASCII.
0
yogesh bansalAuthor Commented:
Hi,

I ran your code on the csv file I have also uploaded above.
 This is the output.

'\\u201c@Dorytbh: why did god ma'
'ke me a horny slut\\u201d @Sadi'
'e_marciano @EmmyzAEatchuu  haa'
'aaaaaaallpppppp\nBesides hollan'
'd, portugal can beat any team '
'that has played before without'
' a question.\nMessi will never '
'have a great game because team'
's focus on stopping him and he'
' always plays in the middle\n@s'
'arkhat7ajar \\u0632\\u0648\\u064a'
'\\u064a\\u064a\\u0646 \\u0648\\u062'
'7\\u0644\\u0644\\u0647 \\ud83d\\ude'
'02\\ud83d\\ude02\nFlares <3\n@e'
'lliebrileyy @vunirinio leave n'
'ow pls x http:\\/\\/t.co\\/IiXHh6'
'6E17\n@_Mrbr1ghtside ily too La'
'uren\nMy talents include gettin'
'g ill on a regular basis, proc'
'rastinating and feeling sorry '
0
gelonidaCommented:
Please execute the code snippet that I suggested:

My suggested approach is very robust towards all these issues, should work with python2 and python3 and create pure ASCII output that you can paste without loosing information.

Your posted file doesn't look at all like a CSV file and I I'm not sure that it really contains Unicode.
It looks more as if it is a pure ascii file with escape strings for Unicode.

If you post the output of my suggested code snippet I can confirm this theory.
0
gelonidaCommented:
Ah thanks our messages crossed.

Will ook at the result.
0
yogesh bansalAuthor Commented:
Hi,
"Your posted file doesn't look at all like a CSV file and I I'm not sure that it really contains Unicode.
It looks more as if it is a pure ascii file with escape strings for Unicode"

It is a csv file I downloaded.  I am not sure if these are unicode characters or not. You can be right i saying it is a pure ascii file with escape strings for Unicode.
0
gelonidaCommented:
This seems to be a pure ascii file.

It is not really a csv file at least it doesn't have any columns.
This looks just like a text file with everything being escaped which is not ascii.
In fact even '/' characters seem to be escaped.

So if I understand you correctly you want to convert this text (not csv) file into a text file, that has all \\uxxxx characters removed?

Well the file might have a .csv suffix, but it is not a normal CSV file.
0
yogesh bansalAuthor Commented:
Hi,

You are absolutely right.
0
yogesh bansalAuthor Commented:
Sorry, I didn't know the difference between ASCII and unicode charaters. I was mistaken in my endeavours finding the right direction. I don't have much knowledge about this encoding stuff.
0
gelonidaCommented:
Well In fact this file seems to be HTML with unicode escapes:

I found for example following in your file
<3

Open in new window


which is the HTML escape for
<3

Open in new window


Whatever If you just want to get rid of the unicode escapes you might try following script.

It should work for python2 and python3

import re

uni_escape = re.compile(r'\\u[0-9a-f]{4}')

with open("input.csv", "r") as fin:
    with open("output.txt", "w") as fout:
        for line in fin:
            unesc_line = re.sub(uni_escape, '', line)
            fout.write(unesc_line)

Open in new window


If you want to unescape the HTML all potential HTML characters, then a little more work had to be done.
0
yogesh bansalAuthor Commented:
Thanks a lot. It worked. I was trying very hard to accomplish this since morning. Again, thanks a ton.
0
yogesh bansalAuthor Commented:
r'\\u[0-9a-f]{4}'

Here, we are removing words with 0 to 9 or a to f. I didn't get why we are using \\ and {4}
0
gelonidaCommented:
we are removing any '\' charcter followed by a 'u' folowed by exactly four characters that must be 0-9 or a-f
1
gelonidaCommented:
Following python3 only script might do an even better job for your context as it treats also the HTML escaping

#!/usr/bin/env python3

import re
from html import parser


p = parser.HTMLParser()
uni_escape = re.compile(r'\\u[0-9a-f]{4}')

with open("input.csv", "r") as fin, open("output.txt", "w") as fout:
    for line in fin:
        # first decode HTML entities. this might result in unicode characters.
        # so we remove them with the two statements after
        unesc_html_line = p.unescape(line)  
        print(repr(unesc_html_line))

        # now create an ASCII byte string (unicode will be skipped
        encoded_line = bytes(unesc_html_line, 'ascii', errors='ignore')
        print(repr(encoded_line))

        # and convert it back to ASCII text
        decoded_ascii_line = encoded_line.decode('ascii')
        print(repr(decoded_ascii_line))

        # Now remove the ascii escaped unicode characters
        unesc_line = re.sub(uni_escape, '', decoded_ascii_line)
        print(repr(unesc_line))
        print()

        # now write yo your file
        fout.write(unesc_line)

Open in new window


I added print statements for visualisation and created lots of intermedaite variables for better following.
Just remove the prints for your final code.

This script gets first rid of the HTML escapes.
Unfortunately the HTML escape might introduce unicode characters (not sure this is the case in your file, but if the file contained for example following sequence &#169; then you would get the copyright character, that is not ASCII.
so you have to get rid of it as you said you don't want to have any unicode characters.
This can be done by encoding to an ascii byte string (and skipping errors) and decoding back to a str.

Now at the end you can handle the unicode ascii escaped characters that you want to get rid of.
1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
gelonidaCommented:
As I mentioned in the beginning of this thread. It really depends how many foreign language words will be in these files.

If there's many words like Vis-à-vis or Ångström it might be difficult to read if you remove all of these characters and it could be better to decode these unicode characters and if they're 'just' accented letters to get rid of the accents.

For mostly english text this might not be necessary except the participants' names should remain understandable.
1
aikimarkCommented:
You might want to use pattern(s) that don't include \u0000-\u00ff.
1
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming

From novice to tech pro — start learning today.