Strings, Binary Data - Regular Expressions

Hi I'm new to Python, I tried to pick a project that would help me learn the language so I decided on a script that would allow me to search google.

For the most part i've got the basics, however, I'm having some confusion with strings, my strings are prefixed with b, what does that mean b'string'. I had to do that to my regular expression to get it to work properly. But what does it mean exactly, is it really necessary?


import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(b'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        for res in result:
            print(res)
            
c = GoogleQuery()
if c.query("blah"):
    c.parse()
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

LVL 19
BrianGEFF719Asked:
Who is Participating?
 
peprConnect With a Mentor Commented:
The b prefix for literals is new in Python 3.x.  The Python 3 makes a difference between strings (always unicode) and sequence of bytes.  The later is related to the type 'bytes' (see http://docs.python.org/3.1/library/functions.html#bytes and  http://docs.python.org/3.1/reference/lexical_analysis.html#string-and-bytes-literals for details).

For regular expression patterns, you want to use the r'raw strings'  (with r in frong of the opening quote).  For example,

reobj = re.compile(r'<a href="([^"]*)" class=l>(.*?)</a>')

The raw string literals are parsed by Python so, that backslashes and the characters after are not interpreted as special sequences.  Otherwise, you can use normal string literals; however, you would be forced to double backslashes (and you probably do not want to do that).
0
 
BrianGEFF719Author Commented:
Also, I'm familiar with PHP and we call a list such as {'Item' => 'Value', 'Item2' => 'Value2'} an associative array, would this be the same thing as a Dictionary in Python, and how would I instead of an index array for url, desc like I have now, make it a "Dictionary"
0
 
peprCommented:
... and yes. The associative array (or hash table in other languages) is the Python dictionary.

To correct my above statement.  If you get some data as bytes, you cannot apply a regular expression compiled for a string pattern. You have to use also the pattern of the bytes type.  I have no deep experience with Python 3 and regular expression with bytes; however, you can probably use br'raw bytes' -- i.e. br prefix for the patterns.
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
peprCommented:
For the last part of your question... Because your regular expression defines two groups, the findall() returns results with tuples of size 2.  The first part is the URL, the second part is the displayed text.  Try the following snippet...
import http.client
import re


class GoogleQuery:
    def query(self,q):
        conn = http.client.HTTPConnection("www.google.com")
        conn.request("GET", "/search?q=" + q)
        r1 = conn.getresponse()
        self.status = r1.status
        self.reason = r1.reason
        if self.status == 200:
            self.data = r1.read()
            return True
        else:
            return False
        
    def parse(self):
        reobj = re.compile(br'<a href="([^"]*)" class=l>(.*?)</a>')
        result = reobj.findall(self.data)
        d = {}   # empty dictionary
        for res in result:
            d[res[0]] = res[1]  # insert the value for the key
        return d
            
c = GoogleQuery()
if c.query("blah"):
    d = c.parse()
    for k in d:
        print(k, ' --> ', d[k])
else:
    print("Unable to query google, got error: ",c.status," -- ", c.reason)

Open in new window

0
 
BrianGEFF719Author Commented:
Excellent answer, thank you.
0
 
BrianGEFF719Author Commented:
Oh one last thing, is there anyway to covert the bytes to a string, and resolve that whole issue?
0
 
peprCommented:
There is a built in functions str() in Python, that is used for conversion of an object to the string.  As string in Python 3 must be unambiguous (concerning the interpretation), you must supply also the encoding when converting an object of the bytes type (see http://docs.python.org/3.1/library/functions.html#str).  This means that you must know the encoding of the downloaded data.

For your GoogleQuery class (and the like), you may want to implement the special method named __str__ (see http://docs.python.org/3.1/reference/datamodel.html#basic-customization and http://docs.python.org/3.1/reference/datamodel.html#object.__str__). This method of the object is called by the built-in function str() when the object is passed as its argument. It is also used when you print() the object.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.