Solved

Python 3.4 - Extracting unicode

Posted on 2014-10-31
13
329 Views
Last Modified: 2014-11-05
Attached code fails to match extended ASCII character (e.g. the cross character x, or the u-umlaud Künkele).
The sample code fetches url = "http://www.theplantlist.org/1.1/browse/A/Orchidaceae/Dactylorhiza/"
and looks for lines of the form

<i class="genus">Dactylorhiza</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">abantiana</i> <span class="authorship">H.Baumann & Künkele</span>

It then extracts 6 values (bold faced).  The code fails to return strings that contain any extended ASCII characters.  Decoding to utf-8 doesn't help.

How do I fix this?
fetchFromBrowse3.py
0
Comment
Question by:cpeters5
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
  • 2
  • +1
13 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 40416907
That page says it is displaying the content in UT8 already... so it's not surprising that 'Decoding to utf-8 doesn't help.'  The small 'x' is found on this page http://www.alanwood.net/unicode/spacing_modifier_letters.html and it is a two-byte character in UTF8.  Which is not "extended ASCII".
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 300 total points
ID: 40416936
Not exactly sure what you think it's failing to return.  

I think your biggest problem is that your masking your true problem in your try/except block.   The default codepage for an English language Windows console doesn't have the characters you want to display so python throws an exception on your print(x).  That exception will cause it to fall into the except block and skip over the code that writes those lines to your file.

Either move the write to the file up before the print or encode the data that you're planning on printing to the console to something a little safer  (e.g.)
   print(x.encode('unicode-escape'))

Open in new window



In other words, try this:
	try:	#-- look for source info
		w = re.search('class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?(?:class="infraspr">([^<]+)</span>){0,1}.*?(?:class="infraspe">([^<]+)</i>){0,1}.*?class="authorship">([^<]+)</span>',str(gline))
		x = '{0},{1},{2},{3},{4},{5}\n'.format(w.group(1),w.group(2),w.group(3),w.group(4),w.group(5),w.group(6))
		f.write(x)
		print(x.encode('unicode-escape')) 
	except Exception as e:
		#print("3",str(e))
		continue

Open in new window

0
 

Author Comment

by:cpeters5
ID: 40417092
Thank you clockwatcher, I got the extended characters part fixed.  Now, how do I get the cross character "x"?  Result still shows "None" for them.

Dave Baldwin, Thanks for pointing out the utf-8 already declared from the source (however, without the
    gline = gline.decode('utf-8') block, I got \xc3\xbc output for the u-umlaud.)   I don't really understand how to handle non-ASCII characters.
0
Enroll in June's Course of the Month

June’s Course of the Month is now available! Experts Exchange’s Premium Members, Team Accounts, and Qualified Experts have access to a complimentary course each month as part of their membership—an extra way to sharpen your skills and increase training.

 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 40417382
I don't either.  I just know about the problem, not so much what to do about it.
0
 
LVL 25

Assisted Solution

by:clockwatcher
clockwatcher earned 300 total points
ID: 40417624
Your problem isn't with non-ascii characters.  It's with your regular expression.

class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?(?:class="infraspr">([^<]+)</span>){0,1}.*?(?:class="infraspe">([^<]+)</i>){0,1}.*?class="authorship">([^<]+)</span>

I know what you're trying to do but that regular expression of yours is incredibly tough to read with all the optional stuff you've got going on with it.  And the optional stuff is what is causing your problems, with it in there, you'll notice your group isn't picked up.  With it out, you'll see it is:

import re

line = '<i class="genus">Dactylorhiza</i> <i class="specieshybrid">WHATEVER</i>&nbsp;<i class="species">abantiana</i> <span class="authorship">H.Baumann & Knkele</span>'

m = re.search('class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?(?:class="infraspr">([^<]+)</span>){0,1}.*?(?:class="infraspe">([^<]+)</i>){0,1}.*?class="authorship">([^<]+)</span>',line)
print(m.group(2))   # returns None

m = re.search('class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>).*?class="species">([^<]+)</i>.*?(?:class="infraspr">([^<]+)</span>){0,1}.*?(?:class="infraspe">([^<]+)</i>){0,1}.*?class="authorship">([^<]+)</span>',line)
print(m.group(2))  # returns WHATEVER

Open in new window


Anyway... it has nothing to do with the non-ascii characters.   As you can see from the above.

I really suggest using a HTML parser/treebuilder (e.g., BeautifulSoup) to parse HTML rather than muck about with regular expressions like you're trying to do.  It's going to make your code much more maintainable.

If you want to stick with just regular expressions, I would suggest breaking things up into manageable chunks and using multiple regular expressions to find what you're after rather than try to get everything done in one.   Something along the lines of:

import re
import urllib.request

url = "http://www.theplantlist.org/1.1/browse/A/Orchidaceae/Dactylorhiza/"

page = urllib.request.urlopen(url).read().decode('utf-8')
for td_match in re.finditer("<td.*?>(.*?)</td>", page, flags=re.DOTALL):
    td_html = td_match.group(1)
    if re.search('class="genus"', td_html):
        match = re.search('<i class="genus">(.*?)</i>', td_html)
        if match:
            genus = match.group(1)

        match = re.search('<i class="specieshybrid">(.*?)</i>', td_html)
        if match:
            species_hybrid = match.group(1)

        match = re.search('<span class="authorship">(.*?)</span>', td_html)
        if match:
            authorship = match.group(1)


 
        print("Genus: {0}\nHybrid: {1}\nAuthor: {2}".format(genus, species_hybrid.encode('unicode-escape'), authorship.encode('unicode-escape')))

Open in new window

0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 200 total points
ID: 40417819
The problem is that you probably cannot force the regular expression engine to choose the variant that you prefer. When you look at
(?:class="specieshybrid">([^<]*)</i>){0,1}.*?

Open in new window

the correct match is also to ignore the specieshybrid part and eat it using the following .*?

The more complex the regular expression is, the less control you have.

Anyway, you can at least break them to more readable pieces:
rex = re.compile(r'''class="genus">([^<]+)</i>.*?
                     (?:class="specieshybrid">([^<]*)</i>){0,1}.*?
                     class="species">([^<]+)</i>.*?
                     (?:class="infraspr">([^<]+)</span>){0,1}.*?
                     (?:class="infraspe">([^<]+)</i>){0,1}.*?
                     class="authorship">([^<]+)
                     </span>''', re.VERBOSE)

Open in new window

0
 

Author Comment

by:cpeters5
ID: 40418020
Sorry guys for the messy code.  THis is my first Python code and I haven't shreded my Perl habit quite yet!
However, I took clockwatcher and pepr suggested and simplify.
Turns out it is really font problem.  When stripped off all other search criteria, just searching for the cross character. I got the error message

t = re.search('(?:"specieshybrid">(.*))</i>',gline)
print(t.group(1))

'charmap' codec can't encode character '\xd7' in position 0: character maps to <undefined>

Open in new window


According to the utf-8 character table, the unicode code point of the cross sign × is U+00D7 (which is interpreted to \xd7?) and is represented in hex by  c3 97.

I suspect it is the matter of how to decode "gline" to get \xd7 detected
0
 
LVL 29

Expert Comment

by:pepr
ID: 40418105
Yes, the `\x` escape means that the unicode character is expressed as a number using the next two hexadecimal digits. Similarly, the `\u` says that the next four characters are hexadecimal digits of the character ordinal value.

Your code is fine. The problem is related to the `print` itself. It is not capable to convert the special character for displaying on the console. (Console does not support unicode, and the character cannot be converted to 8-bit encoding.) You can write it to the file instead of printing to console. Then you will see it works.

Use r'raw string' literals for regular expression patterns to avoid doubling backslashes (similar to Perl's /rawstring/ in patterns, if I recall correctly).
0
 

Author Comment

by:cpeters5
ID: 40418134
Hi pepr,
I did just that. No deal.  Problem with both print and write to file.  
At this point, I will just drop this part of the question.  I need to finish the code soon before my attention goes somewhere else :-)

I am taking a workaround by simply detecting specieshybrid string, and if exist, then just write the "×" to that record.  This works (I am 95% sure the specieshybrid tag is consistent with the existence of the ×)  I still wish to learn how to handle these non ascii characters properly in python.  

Thank you all for your help.
0
 

Author Closing Comment

by:cpeters5
ID: 40418142
Thank you all for the help.  Thanks also to  clockwatcher, although I didn't try his suggestion to use an HTML parser, being new to Python (I haven't gotten very deep into the tutorial yet.)  This is just a quick and dirty code that I need to finish in a short time.  I will revisit it later and by then I will have more time and skill to do it properly, perhaps with BeautifulSoup.
pax
0
 
LVL 29

Expert Comment

by:pepr
ID: 40418234
Try the following:
s = '\xd7'    # the same as '\u00d7'
with open('test.txt', 'w', encoding='utf-8') as f:
    f.write(s)
print(s)    

Open in new window

The test.txt file should contain two bytes with hexa values c3 97. When observing it with your favourite UTF-8 capable editor, you will see the special 'x'. The next print will fail in your case.
0
 

Author Comment

by:cpeters5
ID: 40424455
Hello,
I just post another related question:

http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_28551363.html
Thanks!
pax
0
 

Author Comment

by:cpeters5
ID: 40424510
Never mind, I got it!
Thanks
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question