?
Solved

Uncompress xml gz file

Posted on 2010-01-11
3
Medium Priority
?
1,078 Views
Last Modified: 2012-05-08
Hi experts,

I work on Linux. Here is a quick question.

Please refer to the attached code.  The first three lines of scripts downloaded a compress xml file. Its name is something like "myFile.US.gz".  There is no "xml" suffix in this name. I uncompressed this file with "gunzip", then parsed it. However, soup.findAll('keyword') just retrieved "[]".  Where is wrong?

I need to retrieve following information from the xml file.

<KeywordSplitters>%80|%3A</KeywordSplitters>

<Keywords>
<Keyword mincpv="0.012000" maxcpv="0.022000 type="BOTH">loan%20home</Keyword>
<Keyword mincpv="0.012000" maxcpv="0.022000 type="QS">canada</Keyword>
<Keyword mincpv="0.012000" maxcpv="0.020200 type="BOTH">poker</Keyword>
&
<Keyword mincpv="0.01000" maxcpv="0.02000 type="SEKW" >morgage</Keyword>
</Keywords>
....

Thanks for your ideas.
query_url = 'http://.......'
    req = urllib2.Request(query_url)
    xml_gz = urllib2.urlopen(req).read()


    command      = ('gunzip %s') % xml_gz
    os.system(command)


    soup        = BeautifulStoneSoup(xml_gz)
    keywords      = soup.findAll('Keyword')

Open in new window

0
Comment
Question by:davidw88
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 26291728
If xml_gz contains the name of the .gz file, you have to pass the result of unzipping the file to the BeautifulStoneSoup() parser, not the .gz file.  In other words, you do not use the result of gunzip at all.
0
 

Author Comment

by:davidw88
ID: 26295667
Thanks prpr. I think you are right.  Now I changed my script however

p.ParseFile(xml_file)

gave

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument must have 'read' attribute

xml_file is not a file. It is a string therefore it gave an error.

any ideas to fix this?

Thanks.
req = urllib2.Request(query_url)
xml_file = urllib2.urlopen(req).read()


p = xml.parsers.expat.ParserCreate()
p.ParseFile(xml_file)

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 26301405
The urllib2.urlopen() returns file-like object.  You call its method .read() that reads the content into a string variable.  The xml parser method .ParseFile() expects the file-like object (that supports .read(n) method); however, you pass the string variable.

There are basically two ways to fix it.  Or remove the .read() method from the end of the line 2 to get file-like object instead of string, or use a different parser method at line 6 that expects a strin:

p.Parse(s)    # see http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.Parse

You probably should also close the file-like object returned by urllib2.urlopen().
0

Featured Post

Enroll in August's Course of the Month

August's CompTIA IT Fundamentals course includes 19 hours of basic computer principle modules and prepares you for the certification exam. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question