Solved

How can I get rid of character based on their first byte in utf8?

Posted on 2013-06-12
2
329 Views
Last Modified: 2013-06-12
Hi

How can I get rid of character based on their first byte in utf8?
I want to get rid of all the control characters, or characters with first byte '\xc2'
Many control characters are double byte characters like '\xc2\x8a' , '\xc2\x90', etc

Thanks
Jamie
0
Comment
Question by:jamie_lynn
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39243405
Not all of the utf-8 characters beginning with c2 are control characters (http://www.utf8-chartable.de/), but regardless...

Decode your utf-8 string to unicode, replace the character range you want to get rid of with a regular expression and then re-encode back into utf-8 (if that's the encoding you're after).

sample_utf8_string = '\xc2\x8ahello there\xc2\x90'
unicode_string = sample_utf8_string.decode('utf-8')  # decode it into unicode
updated_unicode_string = re.sub(u'[\x7f-\x9f]','',unicode_string)  # remove control characters
final_utf8_string = updated_unicode_string.encode('utf-8')
print final_utf8_string

Open in new window


In the code above, I only replaced the true control characters ([\x7f-\x9f].  If you really want everything that starts with a c2 gone, change the regular expression replace to:

  updated_unicode_string = re.sub(u'[\x7f-\xbf]','',unicode_string)  # remove all c2 utf-8 chars

And if you want the one-liner version:

   re.sub(u'[\x7f-\x9f]','','\xc2\x8ahello there\xc2\x90'.decode('utf-8')).encode('utf-8')
0
 

Author Closing Comment

by:jamie_lynn
ID: 39243552
Works great!
Thanks!
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question