asked on

Strings, Binary Data - Regular Expressions

Hi I'm new to Python, I tried to pick a project that would help me learn the language so I decided on a script that would allow me to search google.

For the most part i've got the basics, however, I'm having some confusion with strings, my strings are prefixed with b, what does that mean b'string'. I had to do that to my regular expression to get it to work properly. But what does it mean exactly, is it really necessary?

import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(b'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        for res in result:
            print(res)
            
c = GoogleQuery()
if c.query("blah"):
    c.parse()
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

BrianGEFF719

ASKER

Also, I'm familiar with PHP and we call a list such as {'Item' => 'Value', 'Item2' => 'Value2'} an associative array, would this be the same thing as a Dictionary in Python, and how would I instead of an index array for url, desc like I have now, make it a "Dictionary"

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pepr

... and yes. The associative array (or hash table in other languages) is the Python dictionary.

To correct my above statement. If you get some data as bytes, you cannot apply a regular expression compiled for a string pattern. You have to use also the pattern of the bytes type. I have no deep experience with Python 3 and regular expression with bytes; however, you can probably use br'raw bytes' -- i.e. br prefix for the patterns.

pepr

For the last part of your question... Because your regular expression defines two groups, the findall() returns results with tuples of size 2. The first part is the URL, the second part is the displayed text. Try the following snippet...

import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(br'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        d = {}   # empty dictionary
        for res in result:
            d[res[0]] = res[1]  # insert the value for the key
        return d
            
c = GoogleQuery()
if c.query("blah"):
    d = c.parse()
    for k in d:
        print(k, ' --> ', d[k])
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

BrianGEFF719

ASKER

Excellent answer, thank you.

BrianGEFF719

ASKER

Oh one last thing, is there anyway to covert the bytes to a string, and resolve that whole issue?

pepr

There is a built in functions str() in Python, that is used for conversion of an object to the string. As string in Python 3 must be unambiguous (concerning the interpretation), you must supply also the encoding when converting an object of the bytes type (see http://docs.python.org/3.1/library/functions.html#str). This means that you must know the encoding of the downloaded data.

For your GoogleQuery class (and the like), you may want to implement the special method named __str__ (see http://docs.python.org/3.1/reference/datamodel.html#basic-customization and http://docs.python.org/3.1/reference/datamodel.html#object.__str__). This method of the object is called by the built-in function str() when the object is passed as its argument. It is also used when you print() the object.