?
Solved

Strings, Binary Data - Regular Expressions

Posted on 2009-12-17
7
Medium Priority
?
388 Views
Last Modified: 2012-05-08
Hi I'm new to Python, I tried to pick a project that would help me learn the language so I decided on a script that would allow me to search google.

For the most part i've got the basics, however, I'm having some confusion with strings, my strings are prefixed with b, what does that mean b'string'. I had to do that to my regular expression to get it to work properly. But what does it mean exactly, is it really necessary?


import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(b'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        for res in result:
            print(res)
            
c = GoogleQuery()
if c.query("blah"):
    c.parse()
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

0
Comment
Question by:BrianGEFF719
  • 4
  • 3
7 Comments
 
LVL 19

Author Comment

by:BrianGEFF719
ID: 26070678
Also, I'm familiar with PHP and we call a list such as {'Item' => 'Value', 'Item2' => 'Value2'} an associative array, would this be the same thing as a Dictionary in Python, and how would I instead of an index array for url, desc like I have now, make it a "Dictionary"
0
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 26071656
The b prefix for literals is new in Python 3.x.  The Python 3 makes a difference between strings (always unicode) and sequence of bytes.  The later is related to the type 'bytes' (see http://docs.python.org/3.1/library/functions.html#bytes and  http://docs.python.org/3.1/reference/lexical_analysis.html#string-and-bytes-literals for details).

For regular expression patterns, you want to use the r'raw strings'  (with r in frong of the opening quote).  For example,

reobj = re.compile(r'<a href="([^"]*)" class=l>(.*?)</a>')

The raw string literals are parsed by Python so, that backslashes and the characters after are not interpreted as special sequences.  Otherwise, you can use normal string literals; however, you would be forced to double backslashes (and you probably do not want to do that).
0
 
LVL 29

Expert Comment

by:pepr
ID: 26071773
... and yes. The associative array (or hash table in other languages) is the Python dictionary.

To correct my above statement.  If you get some data as bytes, you cannot apply a regular expression compiled for a string pattern. You have to use also the pattern of the bytes type.  I have no deep experience with Python 3 and regular expression with bytes; however, you can probably use br'raw bytes' -- i.e. br prefix for the patterns.
0
Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

 
LVL 29

Expert Comment

by:pepr
ID: 26071887
For the last part of your question... Because your regular expression defines two groups, the findall() returns results with tuples of size 2.  The first part is the URL, the second part is the displayed text.  Try the following snippet...
import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(br'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        d = {}   # empty dictionary
        for res in result:
            d[res[0]] = res[1]  # insert the value for the key
        return d
            
c = GoogleQuery()
if c.query("blah"):
    d = c.parse()
    for k in d:
        print(k, ' --> ', d[k])
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

0
 
LVL 19

Author Comment

by:BrianGEFF719
ID: 26072111
Excellent answer, thank you.
0
 
LVL 19

Author Comment

by:BrianGEFF719
ID: 26072206
Oh one last thing, is there anyway to covert the bytes to a string, and resolve that whole issue?
0
 
LVL 29

Expert Comment

by:pepr
ID: 26074021
There is a built in functions str() in Python, that is used for conversion of an object to the string.  As string in Python 3 must be unambiguous (concerning the interpretation), you must supply also the encoding when converting an object of the bytes type (see http://docs.python.org/3.1/library/functions.html#str).  This means that you must know the encoding of the downloaded data.

For your GoogleQuery class (and the like), you may want to implement the special method named __str__ (see http://docs.python.org/3.1/reference/datamodel.html#basic-customization and http://docs.python.org/3.1/reference/datamodel.html#object.__str__). This method of the object is called by the built-in function str() when the object is passed as its argument. It is also used when you print() the object.
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Suggested Courses

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question