Solved

How can I get rid of character based on their first byte in utf8?

Posted on 2013-06-12
2
328 Views
Last Modified: 2013-06-12
Hi

How can I get rid of character based on their first byte in utf8?
I want to get rid of all the control characters, or characters with first byte '\xc2'
Many control characters are double byte characters like '\xc2\x8a' , '\xc2\x90', etc

Thanks
Jamie
0
Comment
Question by:jamie_lynn
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39243405
Not all of the utf-8 characters beginning with c2 are control characters (http://www.utf8-chartable.de/), but regardless...

Decode your utf-8 string to unicode, replace the character range you want to get rid of with a regular expression and then re-encode back into utf-8 (if that's the encoding you're after).

sample_utf8_string = '\xc2\x8ahello there\xc2\x90'
unicode_string = sample_utf8_string.decode('utf-8')  # decode it into unicode
updated_unicode_string = re.sub(u'[\x7f-\x9f]','',unicode_string)  # remove control characters
final_utf8_string = updated_unicode_string.encode('utf-8')
print final_utf8_string

Open in new window


In the code above, I only replaced the true control characters ([\x7f-\x9f].  If you really want everything that starts with a c2 gone, change the regular expression replace to:

  updated_unicode_string = re.sub(u'[\x7f-\xbf]','',unicode_string)  # remove all c2 utf-8 chars

And if you want the one-liner version:

   re.sub(u'[\x7f-\x9f]','','\xc2\x8ahello there\xc2\x90'.decode('utf-8')).encode('utf-8')
0
 

Author Closing Comment

by:jamie_lynn
ID: 39243552
Works great!
Thanks!
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question