Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

How can I get rid of character based on their first byte in utf8?

Posted on 2013-06-12
2
Medium Priority
?
331 Views
Last Modified: 2013-06-12
Hi

How can I get rid of character based on their first byte in utf8?
I want to get rid of all the control characters, or characters with first byte '\xc2'
Many control characters are double byte characters like '\xc2\x8a' , '\xc2\x90', etc

Thanks
Jamie
0
Comment
Question by:jamie_lynn
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 2000 total points
ID: 39243405
Not all of the utf-8 characters beginning with c2 are control characters (http://www.utf8-chartable.de/), but regardless...

Decode your utf-8 string to unicode, replace the character range you want to get rid of with a regular expression and then re-encode back into utf-8 (if that's the encoding you're after).

sample_utf8_string = '\xc2\x8ahello there\xc2\x90'
unicode_string = sample_utf8_string.decode('utf-8')  # decode it into unicode
updated_unicode_string = re.sub(u'[\x7f-\x9f]','',unicode_string)  # remove control characters
final_utf8_string = updated_unicode_string.encode('utf-8')
print final_utf8_string

Open in new window


In the code above, I only replaced the true control characters ([\x7f-\x9f].  If you really want everything that starts with a c2 gone, change the regular expression replace to:

  updated_unicode_string = re.sub(u'[\x7f-\xbf]','',unicode_string)  # remove all c2 utf-8 chars

And if you want the one-liner version:

   re.sub(u'[\x7f-\x9f]','','\xc2\x8ahello there\xc2\x90'.decode('utf-8')).encode('utf-8')
0
 

Author Closing Comment

by:jamie_lynn
ID: 39243552
Works great!
Thanks!
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

664 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question