cpeters5
asked on
Python 3.4 - Extracting unicode
Attached code fails to match extended ASCII character (e.g. the cross character x, or the u-umlaud Künkele).
The sample code fetches url = "http://www.theplantlist.org/1.1/browse/A/Orchidaceae/Dactylorhiza/"
and looks for lines of the form
<i class="genus">Dactylorhiza</i> <i class="specieshybrid">×</i> <i class="species">abantiana</i> <span class="authorship">H.Baumann & Künkele</span>
It then extracts 6 values (bold faced). The code fails to return strings that contain any extended ASCII characters. Decoding to utf-8 doesn't help.
How do I fix this?
fetchFromBrowse3.py
The sample code fetches url = "http://www.theplantlist.org/1.1/browse/A/Orchidaceae/Dactylorhiza/"
and looks for lines of the form
<i class="genus">Dactylorhiza</i> <i class="specieshybrid">×</i> <i class="species">abantiana</i> <span class="authorship">H.Baumann & Künkele</span>
It then extracts 6 values (bold faced). The code fails to return strings that contain any extended ASCII characters. Decoding to utf-8 doesn't help.
How do I fix this?
fetchFromBrowse3.py
That page says it is displaying the content in UT8 already... so it's not surprising that 'Decoding to utf-8 doesn't help.' The small 'x' is found on this page http://www.alanwood.net/unicode/spacing_modifier_letters.html and it is a two-byte character in UTF8. Which is not "extended ASCII".
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you clockwatcher, I got the extended characters part fixed. Now, how do I get the cross character "x"? Result still shows "None" for them.
Dave Baldwin, Thanks for pointing out the utf-8 already declared from the source (however, without the
gline = gline.decode('utf-8') block, I got \xc3\xbc output for the u-umlaud.) I don't really understand how to handle non-ASCII characters.
Dave Baldwin, Thanks for pointing out the utf-8 already declared from the source (however, without the
gline = gline.decode('utf-8') block, I got \xc3\xbc output for the u-umlaud.) I don't really understand how to handle non-ASCII characters.
I don't either. I just know about the problem, not so much what to do about it.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Sorry guys for the messy code. THis is my first Python code and I haven't shreded my Perl habit quite yet!
However, I took clockwatcher and pepr suggested and simplify.
Turns out it is really font problem. When stripped off all other search criteria, just searching for the cross character. I got the error message
According to the utf-8 character table, the unicode code point of the cross sign × is U+00D7 (which is interpreted to \xd7?) and is represented in hex by c3 97.
I suspect it is the matter of how to decode "gline" to get \xd7 detected
However, I took clockwatcher and pepr suggested and simplify.
Turns out it is really font problem. When stripped off all other search criteria, just searching for the cross character. I got the error message
t = re.search('(?:"specieshybrid">(.*))</i>',gline)
print(t.group(1))
'charmap' codec can't encode character '\xd7' in position 0: character maps to <undefined>
According to the utf-8 character table, the unicode code point of the cross sign × is U+00D7 (which is interpreted to \xd7?) and is represented in hex by c3 97.
I suspect it is the matter of how to decode "gline" to get \xd7 detected
Yes, the `\x` escape means that the unicode character is expressed as a number using the next two hexadecimal digits. Similarly, the `\u` says that the next four characters are hexadecimal digits of the character ordinal value.
Your code is fine. The problem is related to the `print` itself. It is not capable to convert the special character for displaying on the console. (Console does not support unicode, and the character cannot be converted to 8-bit encoding.) You can write it to the file instead of printing to console. Then you will see it works.
Use r'raw string' literals for regular expression patterns to avoid doubling backslashes (similar to Perl's /rawstring/ in patterns, if I recall correctly).
Your code is fine. The problem is related to the `print` itself. It is not capable to convert the special character for displaying on the console. (Console does not support unicode, and the character cannot be converted to 8-bit encoding.) You can write it to the file instead of printing to console. Then you will see it works.
Use r'raw string' literals for regular expression patterns to avoid doubling backslashes (similar to Perl's /rawstring/ in patterns, if I recall correctly).
ASKER
Hi pepr,
I did just that. No deal. Problem with both print and write to file.
At this point, I will just drop this part of the question. I need to finish the code soon before my attention goes somewhere else :-)
I am taking a workaround by simply detecting specieshybrid string, and if exist, then just write the "×" to that record. This works (I am 95% sure the specieshybrid tag is consistent with the existence of the ×) I still wish to learn how to handle these non ascii characters properly in python.
Thank you all for your help.
I did just that. No deal. Problem with both print and write to file.
At this point, I will just drop this part of the question. I need to finish the code soon before my attention goes somewhere else :-)
I am taking a workaround by simply detecting specieshybrid string, and if exist, then just write the "×" to that record. This works (I am 95% sure the specieshybrid tag is consistent with the existence of the ×) I still wish to learn how to handle these non ascii characters properly in python.
Thank you all for your help.
ASKER
Thank you all for the help. Thanks also to clockwatcher, although I didn't try his suggestion to use an HTML parser, being new to Python (I haven't gotten very deep into the tutorial yet.) This is just a quick and dirty code that I need to finish in a short time. I will revisit it later and by then I will have more time and skill to do it properly, perhaps with BeautifulSoup.
pax
pax
Try the following:
s = '\xd7' # the same as '\u00d7'
with open('test.txt', 'w', encoding='utf-8') as f:
f.write(s)
print(s)
The test.txt file should contain two bytes with hexa values c3 97. When observing it with your favourite UTF-8 capable editor, you will see the special 'x'. The next print will fail in your case.
ASKER
Hello,
I just post another related question:
https://www.experts-exchange.com/questions/28551363/Python-Pattern-matching-matches-previous-match.html
Thanks!
pax
I just post another related question:
https://www.experts-exchange.com/questions/28551363/Python-Pattern-matching-matches-previous-match.html
Thanks!
pax
ASKER
Never mind, I got it!
Thanks
Thanks