Solved

How can I get rid of character based on their first byte in utf8?

Posted on 2013-06-12
2
326 Views
Last Modified: 2013-06-12
Hi

How can I get rid of character based on their first byte in utf8?
I want to get rid of all the control characters, or characters with first byte '\xc2'
Many control characters are double byte characters like '\xc2\x8a' , '\xc2\x90', etc

Thanks
Jamie
0
Comment
Question by:jamie_lynn
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 39243405
Not all of the utf-8 characters beginning with c2 are control characters (http://www.utf8-chartable.de/), but regardless...

Decode your utf-8 string to unicode, replace the character range you want to get rid of with a regular expression and then re-encode back into utf-8 (if that's the encoding you're after).

sample_utf8_string = '\xc2\x8ahello there\xc2\x90'
unicode_string = sample_utf8_string.decode('utf-8')  # decode it into unicode
updated_unicode_string = re.sub(u'[\x7f-\x9f]','',unicode_string)  # remove control characters
final_utf8_string = updated_unicode_string.encode('utf-8')
print final_utf8_string

Open in new window


In the code above, I only replaced the true control characters ([\x7f-\x9f].  If you really want everything that starts with a c2 gone, change the regular expression replace to:

  updated_unicode_string = re.sub(u'[\x7f-\xbf]','',unicode_string)  # remove all c2 utf-8 chars

And if you want the one-liner version:

   re.sub(u'[\x7f-\x9f]','','\xc2\x8ahello there\xc2\x90'.decode('utf-8')).encode('utf-8')
0
 

Author Closing Comment

by:jamie_lynn
ID: 39243552
Works great!
Thanks!
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

910 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now