Link to home
Start Free TrialLog in
Avatar of dmontgom
dmontgom

asked on

Python and a unicode char in my html

Hi,

I get this character in html code that I am reverse proxiing....

ý

It should look like a big dot or a bullet point.  This happens because I am converting the code to a string from beautiful soup.  

So....how to I find and replace ý with what I what?  

When I try and find and replace I get this error in python.

if html.find('ý')>=0:
       html = html.replace('ý','.')

SyntaxError: Non-ASCII character '\xef' in file /home/da/workspace/call_tracking/www.preci.com/index.py on line 541, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (index.py, line 541)







Avatar of pepr
pepr

When any characters above ASCII appears in the source code as a literal (i.e. with code greater than 127), you have to tell what is the encoding of the source code. This is what the pep-0263 says. It could be, for example:

# -*- coding: latin-1 -*-

The idea behind is to tell Python what encoding is used by your editor that displays the specific character (the glyph).

In your case, you probably have a different intention -- you want to replace one character with some encoding (no matter how it looks) with dot. This way, you can simply write the character the same way as it is written in the error message. Moreover, you probably do not want to do "if html.find(..." as it will not optimize your code (in the sense "if the char is found, find it again and replace" -- i.e. searched twice).

Try (no need to declare encoding using the above special comment):

html = html.replace('\xef','.')
Avatar of dmontgom

ASKER

Hi,

I tried the code...did not seem to find the character.  

 if html.find('\xef')>=0:
      html = html.replace('\xef','ff')
Have a look at this page..you can see what is going on.  Its the daimands with the ? mark.
http://www.precisionautoelectric.adhui.com

 The real page is www.precisionautoelectric.com
Well, the page does not define explicitly the encoding of the content. This can be done via meta element inside the head element (see the snippet below). If it is not defined explicitly, then utf-8 is assumed (if I recall correctly).

However, when using utf-8 encoding of the UNICODE text, one character can be stored on one byte only when it is ASCII. The characters with code greater than 127 are stored as sequences of bytes of the length from 2 to 6 bytes. (Usually up to 3 bytes.) This means that you cannot replace a character by simply searching a replacing single bytes.

Moreover the \xef can actually come from so called BOM ((Byte-Order Mark) than may be placed at the beginning of the UTF-8 documents (the full sequence is EF BB BF). This has nothing to do with the textual content.

In other word, you do not want to replace the BOM. Also, you never want to replace one character by '\xff' as it it not the acceptable UTF-8 sequence. Also, it could be the case that the page is not intended to be in UTF-8. Then the encoding must be explicitly stated in the head element.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Open in new window

Hi,

I just added the below before the <head> tag.  Still not working.  you can the the update in the proxy site.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Any ideas?

Thanks
The meta tag with the charset only explicitly states what was earlier assumed. Still, you cannot replace one byte by another byte if the content of the byte is greater than 127 (i.e. any byte that causes the error as written in the original question). The problem is unrelated to whether the page contains the meta tag with charset or not.

Probably, the correct approach is to convert the string representing the content (the html variable) to unicode, do the wanted replacement, and then write the unicode string into the target file using the utf-8 encoding.

Thanks,

What is the code to convert the string?

html = html.encode() gave me this error...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x95 in position 21737: unexpected code byte

I will not understand this right away.  Please show the code.

Thanks
The .encode() method is for the opposite conversion (when you have unicode string and want to get non-unicode string in the identified encoding.

To convert the non-unicode string into the unicode string, use the built-in function unicode() http://docs.python.org/library/functions.html#unicode

html = unicode(html, 'utf-8')

If you want to write unicode texts into a file in utf-8, use the codecs module like this (http://docs.python.org/library/codecs.html#codecs.open):

f = codecs.open('output.html', 'w', 'utf-8-sig')  
f.write(myUnicodeTextVariable)
f.close()
Hi,

This is not working.....I just dont get it.

1) I use
 f = urllib2.urlopen('http://www.precisionautoelectric.com')
 html = f.read()

2) I find find a replace the strings I need replaced.

3) I then use web.py to print out the html.

So....given that the fetch is simple...how would you modify the html?  Also...does not the fetch keep the utf encoding?

Thanks

OK, I tried the page physically. The truth is that it does not define its own encoding and it is not in utf-8 (bad page). It is stored in some 8-bit encoding. Probably the web.py needs a correct page. Try the snippet below. It replaces the 'dot' by asterix and then convert the document content into unicode while assuming it is in latin-1. The unicode content is then stored to the file using the utf-8 encoding.

I do not know the intention... you should try.
import codecs
import urllib2
 
f = urllib2.urlopen('http://www.precisionautoelectric.com')
content = f.read()
f.close()
 
unicode_content = unicode(content.replace('\x95', '*'), 'latin-1')
 
f = codecs.open('out.html', 'w', 'utf-8')
f.write(unicode_content)
f.close()

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So awesome!  I will need your help again.  How can I request you?

This is what I am doing.

1) I am doing a reverse proxy using python and beautiful soup to replace content on the fly for advertising for clients.  

2) Sometimes..not often...I get this situation....where some characters are lost and not converted properly.

3) Ideally I would like to keep the original character but I don't know how.  For the time being just replaced with a *.  

I am using web.py as my web framework.  I dont save the file...once the html is modified...I serve the page.  This is all done in real time.  See http://www.precisionautoelectric.adhui.com for the end result.




As far as I know, the latest versions of BeautifulSoup is less forgiving the badly formed documents. For the sake of preserving all special characters, the first thing that should be done is to find what encoding is used for the document pages. I did use cp1250 only because the 0x95 is the bullet there and it was the first encoding that I got when searching. When the encoding is explicitly stated in the document, the BeautifulSoup should not spoil anything as it works with Unicode internally (if I am not wrong).

Occasionally, you may want to run HTML Tidy to clean up the structure of the document (http://tidy.sourceforge.net/). It can also be called from Python as external utility (http://stackoverflow.com/questions/700051/how-do-i-run-html-tidy-from-python-without-extra-libraries) or the Tidy library can be used in the form of a Python module (http://utidylib.berlios.de/). I am not sure how, but I guess it could be capable to get the information about the existing encoding and to make corrections to produce a valid HTML document -- including the headings with the encoding.... (Although, you may expect some experimentation -- this may not be that easy to clean up the mess. More pragmatic approach may be easier and sufficient.)

Then you should probably keep everything in Unicode (probably in UTF-8 when stored in a file). This way you should never be forced to replace any characters. It should also be easier to apply various transformations.