Solved

How can I get rid of character based on their first byte in utf8?

Posted on 2013-06-12
2
327 Views
Last Modified: 2013-06-12
Hi

How can I get rid of character based on their first byte in utf8?
I want to get rid of all the control characters, or characters with first byte '\xc2'
Many control characters are double byte characters like '\xc2\x8a' , '\xc2\x90', etc

Thanks
Jamie
0
Comment
Question by:jamie_lynn
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39243405
Not all of the utf-8 characters beginning with c2 are control characters (http://www.utf8-chartable.de/), but regardless...

Decode your utf-8 string to unicode, replace the character range you want to get rid of with a regular expression and then re-encode back into utf-8 (if that's the encoding you're after).

sample_utf8_string = '\xc2\x8ahello there\xc2\x90'
unicode_string = sample_utf8_string.decode('utf-8')  # decode it into unicode
updated_unicode_string = re.sub(u'[\x7f-\x9f]','',unicode_string)  # remove control characters
final_utf8_string = updated_unicode_string.encode('utf-8')
print final_utf8_string

Open in new window


In the code above, I only replaced the true control characters ([\x7f-\x9f].  If you really want everything that starts with a c2 gone, change the regular expression replace to:

  updated_unicode_string = re.sub(u'[\x7f-\xbf]','',unicode_string)  # remove all c2 utf-8 chars

And if you want the one-liner version:

   re.sub(u'[\x7f-\x9f]','','\xc2\x8ahello there\xc2\x90'.decode('utf-8')).encode('utf-8')
0
 

Author Closing Comment

by:jamie_lynn
ID: 39243552
Works great!
Thanks!
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Python threading 1 73
Is my Java RTS game window portable to Python? 7 99
SPSS Output Table Formatting 3 91
Ptyhon Dex.txt how to use this regex for this python script? 40 91
Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now