Uncompress xml gz file

Posted on 2010-01-11
Medium Priority
Last Modified: 2012-05-08
Hi experts,

I work on Linux. Here is a quick question.

Please refer to the attached code.  The first three lines of scripts downloaded a compress xml file. Its name is something like "myFile.US.gz".  There is no "xml" suffix in this name. I uncompressed this file with "gunzip", then parsed it. However, soup.findAll('keyword') just retrieved "[]".  Where is wrong?

I need to retrieve following information from the xml file.


<Keyword mincpv="0.012000" maxcpv="0.022000 type="BOTH">loan%20home</Keyword>
<Keyword mincpv="0.012000" maxcpv="0.022000 type="QS">canada</Keyword>
<Keyword mincpv="0.012000" maxcpv="0.020200 type="BOTH">poker</Keyword>
<Keyword mincpv="0.01000" maxcpv="0.02000 type="SEKW" >morgage</Keyword>

Thanks for your ideas.
query_url = 'http://.......'
    req = urllib2.Request(query_url)
    xml_gz = urllib2.urlopen(req).read()

    command      = ('gunzip %s') % xml_gz

    soup        = BeautifulStoneSoup(xml_gz)
    keywords      = soup.findAll('Keyword')

Open in new window

Question by:davidw88
  • 2
LVL 29

Expert Comment

ID: 26291728
If xml_gz contains the name of the .gz file, you have to pass the result of unzipping the file to the BeautifulStoneSoup() parser, not the .gz file.  In other words, you do not use the result of gunzip at all.

Author Comment

ID: 26295667
Thanks prpr. I think you are right.  Now I changed my script however



Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument must have 'read' attribute

xml_file is not a file. It is a string therefore it gave an error.

any ideas to fix this?

req = urllib2.Request(query_url)
xml_file = urllib2.urlopen(req).read()

p = xml.parsers.expat.ParserCreate()

Open in new window

LVL 29

Accepted Solution

pepr earned 500 total points
ID: 26301405
The urllib2.urlopen() returns file-like object.  You call its method .read() that reads the content into a string variable.  The xml parser method .ParseFile() expects the file-like object (that supports .read(n) method); however, you pass the string variable.

There are basically two ways to fix it.  Or remove the .read() method from the end of the line 2 to get file-like object instead of string, or use a different parser method at line 6 that expects a strin:

p.Parse(s)    # see http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.Parse

You probably should also close the file-like object returned by urllib2.urlopen().

Featured Post

Become an Android App Developer

Ready to kick start your career in 2018? Learn how to build an Android app in January’s Course of the Month and open the door to new opportunities.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Here I am using Python IDLE(GUI) to write a simple program and save it, so that we can just execute it in future. Because when we write any program and exit from Python then program that we have written will be lost. So for not losing our program we…
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …

569 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question