Solved

Parse local html file with python and beautifulsoup

Posted on 2010-08-30
16
2,745 Views
Last Modified: 2012-05-10
Hello,

I am trying to extract some data from an html file using python with beautiful soup, the ultimate aim is to extract the data into a csv / excel file.
The data that I want to extract is in the following format:

<h1 class="title">Title Information</h1>

<div id="letter">
<div id="a">
      <h2>A Info</h2>
      <div>
      Start Info,<br>
More Info
      </div>
      <div>
      Some More Info
      </div>
      <div>Even More info</div>
</div>
<div id="b">
      <h2>B Info</h2>
      <p>
      B1:
            B1 Info
      <br>
      B2:
            B2 Info
      <br>
      B3E:
            <a href="mailto:<a href='mailto:some@one.com'>some@one.com</a></a>
      <br>
      B3W:
            <a href="http://www.one.com">http://www.one.com</a>
      </p>
</div>
<div id="c">
      <h2>C Info</h2>
      <dl id="C DIV">
            <dt><a href='c.pl?u=1'>1st Info</a></dt>
            <dd
            >
            &nbsp;</dd>
            <dt><a href='c.pl?u=2'>2nd Info</a></dt>
            <dd
            >
            &nbsp;</dd>
      </dl>
</div>
</div>

Breaking down the elements I want to extract and how I want them outputted to excel

Title:

The Title – Column A

div a:

start info, Some More Info, Even More Info  - Column B

div b:

B1 Info – Column C
B2 Info – Column D
B3E – Column E
B3W – Column F

div c:

1st Info – Column G
2nd Info- Column H

The code I have at the moment is:

import urllib

file = urllib.urlopen("file:///c://x/y/z/test.htm")
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file)

title = soup.title
g¬_a = soup.findAll("div", {"id" : "a"})

for row in g_a:
      cat = row.find('div');

g_b = soup.findAll("div", {"id" : "b"})

for row in g_b:
      con = row.find('p');

g_c = soup.findAll("div", {"id" : "c"})

for row in g_c:
    pers = row.findAll('a', href=True);


print title, cat, con, pers

My problem is that I cannot split out and concatenate all the information from div a, I cannot parse the information between <br> in div b and I cannot extract the info between the links in div c. If you could also show me how to output it into the excel columns mentioned above it would be much appreciated.
0
Comment
Question by:newbey
  • 10
  • 6
16 Comments
 
LVL 28

Expert Comment

by:pepr
ID: 33564670
Does one file contain only one set of such information?

A side note, you should not freely use the 'file' identifier as it is the class of file objects in Python 2.x.
0
 
LVL 28

Expert Comment

by:pepr
ID: 33564798
Here is the first fragment that shows how to extract the title and the wanted information from inside the div a element.  No formatting to the columns yet.  The test.html was stored locally and opened as a file (no urllib).  It prints on my screen:

c:\tmp\___python\newbey\Q_26439956>a.py
[<h1 class="title">Title Information</h1>]
Title Information
[u'\n      Start Info,', <br />, u'\nMore Info\n      ']
[u'\n      Some More Info\n      ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
from BeautifulSoup import BeautifulSoup

f = open("test.html")      # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()

titleLst = soup.findAll("h1", {"class": "title"})
print titleLst             # the list with single

title = titleLst[0].contents[0]
print title                # the string from the h1 element

g_a = soup.findAll("div", {"id" : "a"})  # the elements from inside the div a element
alst = []                                # the future result list
for x in g_a:
    for elem in x.findAll('div'):        # the divs inside the div a
        print elem.contents              # just to see what is inside
        alst.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
        
print alst               # wanted result in a technical form            
print ', '.join(alst)    # wanted result as a string (items separated by comma and space

Open in new window

0
 
LVL 28

Expert Comment

by:pepr
ID: 33566214
Here is the second version that extracts all the info into the list of strings.  (The input around mailto: seems corrupted -- not solved here.)  Notice the simplified searching for the information. The script prints:

c:\tmp\___python\newbey\Q_26439956>b.py
<h1 class="title">Title Information</h1>
Title Information
[u'\n      Start Info,', <br />, u'\nMore Info\n      ']
[u'\n      Some More Info\n      ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
[u'B1 Info', u'B2 Info', u'', u'mailto:some@one.com', u'\n', u'', u'http://www.o
ne.com', u'\n']
[u'B1 Info', u'B2 Info', u'some@one.com', u'http://www.one.com']
<a href="c.pl?u=1">1st Info</a>
<a href="c.pl?u=2">2nd Info</a>
[u'Title Information', u'Start Info, Some More Info, Even More info', u'B1 Info'
, u'B2 Info', u'some@one.com', u'http://www.one.com', u'1st Info', u'2nd Info']

from BeautifulSoup import BeautifulSoup
import re

f = open('test.html')      # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()

h1_title = soup.find('h1', {'class': 'title'})
print h1_title             # the header element

title = h1_title.contents[0]
print title                # the string from the h1 element

rowLst = [ title ]
div_a = soup.find('div', {'id' : 'a'})   # the div a element
lst_a = []                               # the future result list
for elem in div_a.findAll('div'):        # the divs inside the div a
    print elem.contents                  # just to see what is inside
    lst_a.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
        
print lst_a              # wanted result in a technical form
cat = ', '.join(lst_a)   # wanted result as a string (items separated by comma and space            
print cat                # (I do not know what meaning has 'cat' for you.)
rowLst.append(cat)

div_b = soup.find('div', {'id' : 'b'})
rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
p = div_b.find('p')      # The p element from div b
lst_b = []
for x in p.contents:
    if getattr(x, 'name', None) == 'br':
        continue                          # ignote the br elements
    elif getattr(x, 'name', None) == 'a':
        lst_b.append(x['href'])           # href attribute here (not the enclosed string)
    else:    
        lst_b.append(rexClean.sub(r'\2', x))  # only the info without whitespaces
print lst_b    

# Remove the empty strings and newline strings, leave out the 'mailto:'.
lst_b2 = []
for s in lst_b:
    if s == u'' or s == u'\n':
        continue
    else:
        lst_b2.append(s.replace('mailto:', ''))
print lst_b2    

# Append all elements from lst_b2 to the column list.
rowLst.extend(lst_b2)

# Extraction from the div c.
div_c = soup.find('div', {'id' : 'c'})
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
print cu1
rowLst.append(cu1.contents[0])

cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
print cu2
rowLst.append(cu2.contents[0])

# Show the rowLst -- this should be the Excel line in future.
print rowLst

Open in new window

0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 33566315
And the last snippet writes the extracted info into one line of a csv file.
from BeautifulSoup import BeautifulSoup
import csv
import re

f = open('test.html')      # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()

# Extract the title.
h1_title = soup.find('h1', {'class': 'title'})  # the header element
title = h1_title.contents[0]             # the string from the h1 element

rowLst = [ title ]                       # first column in the result list

# Extract the div a info as one string.
div_a = soup.find('div', {'id' : 'a'})   # the div a element
lst_a = []                               # the future result list
for elem in div_a.findAll('div'):        # the divs inside the div a
    lst_a.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
cat = ', '.join(lst_a)   # wanted result as a string (items separated by comma and space            
rowLst.append(cat)

# Extract the div b info as several strings.
div_b = soup.find('div', {'id' : 'b'})
rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
p = div_b.find('p')      # The p element from div b
lst_b = []
for x in p.contents:
    if getattr(x, 'name', None) == 'br':
        continue                          # ignote the br elements
    elif getattr(x, 'name', None) == 'a':
        lst_b.append(x['href'])           # href attribute here (not the enclosed string)
    else:    
        lst_b.append(rexClean.sub(r'\2', x))  # only the info without whitespaces

# Remove the empty strings and newline strings, leave out the 'mailto:'.
lst_b2 = []
for s in lst_b:
    if s == u'' or s == u'\n':
        continue
    else:
        lst_b2.append(s.replace('mailto:', ''))

# Append all elements from lst_b2 to the column list.
rowLst.extend(lst_b2)

# Extraction from the div c.
div_c = soup.find('div', {'id' : 'c'})
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
rowLst.append(cu1.contents[0])

cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
rowLst.append(cu2.contents[0])

# Save the result as csv file for Excel.  Beware the Excel.  The dialect
# may not be enough for your purpose.  Some experiments may be neccessary.
f = open('result.csv', 'wb')              # must be in binary mode
writer = csv.writer(f, dialect='excel')   # csv writer wrap around the f
writer.writerow(rowLst)
f.close()

Open in new window

0
 

Author Comment

by:newbey
ID: 33571713
Thanks Pepr, fantastic!

I am trying this now, I think there may be 1 tweak required but I will post back
0
 

Author Closing Comment

by:newbey
ID: 33650664
Thanks for your help, just what I was looking for.

Not sure if this is easy or not but if I had more than 1 file that were sequential in number i.e. test1.html, test2.html, test3.html, what would be the easiest way to loop through these files (using the code above) and saving them to excel?
0
 
LVL 28

Expert Comment

by:pepr
ID: 33654896
This is a logical step ahead.  See the code below.  The code for extraction of the data from one file became the body of the new function (Just indent all the related lines one level left--good editor is capable to do that with selected line by one TAB or something like Alt+right arrow.)

The body now contain the new for-loop that calls the function with the names ;)

However, this does not solve how you  get the HTML files to your local-disk directory.

from BeautifulSoup import BeautifulSoup
import csv
import glob
import re

def getRowFromFilename(fname):
    f = open(fname)              # simplified for the example (no urllib)
    soup = BeautifulSoup(f)
    f.close()

    # Extract the title.
    h1_title = soup.find('h1', {'class': 'title'})  # the header element
    title = h1_title.contents[0]             # the string from the h1 element

    rowLst = [ title ]                       # first column in the result list

    # Extract the div a info as one string.
    div_a = soup.find('div', {'id' : 'a'})   # the div a element
    lst_a = []                               # the future result list
    for elem in div_a.findAll('div'):        # the divs inside the div a
        lst_a.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
    cat = ', '.join(lst_a)                   # wanted result as a string
    rowLst.append(cat)

    # Extract the div b info as several strings.
    div_b = soup.find('div', {'id' : 'b'})
    rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
    p = div_b.find('p')      # The p element from div b
    lst_b = []
    for x in p.contents:
        if getattr(x, 'name', None) == 'br':
            continue                          # ignote the br elements
        elif getattr(x, 'name', None) == 'a':
            lst_b.append(x['href'])           # href attribute here
        else:    
            lst_b.append(rexClean.sub(r'\2', x))  # only the info without whitespaces

    # Remove the empty strings and newline strings, leave out the 'mailto:'.
    lst_b2 = []
    for s in lst_b:
        if s == u'' or s == u'\n':
            continue
        else:
            lst_b2.append(s.replace('mailto:', ''))

    # Append all elements from lst_b2 to the column list.
    rowLst.extend(lst_b2)

    # Extraction from the div c.
    div_c = soup.find('div', {'id' : 'c'})
    cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
    rowLst.append(cu1.contents[0])

    cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
    rowLst.append(cu2.contents[0])
    
    # Return the resulting row.
    return rowLst


#-------------------------------------------------------------------------
# Get the sequence of the processed filenames, extract the row from each
# of them, and write it to the csv file for Excel.  Beware the Excel.  The dialect
# may not be enough for your purpose.  Some experiments may be neccessary.

f = open('result.csv', 'wb')                  # must be in binary mode
writer = csv.writer(f, dialect='excel')       # csv writer wrap around the f

for fname in sorted(glob.glob('test*.html')): # can be modified to fit your needs
    print fname                               # just to see what is processed
    rowLst = getRowFromFilename(fname)        # extract the info from one file
    writer.writerow(rowLst)                   # write it to one row of the result

f.close()            # do not forget to close the result file

Open in new window

0
 
LVL 28

Expert Comment

by:pepr
ID: 33660038
Well, the left hand is the one where the thumb is on the right side.  The above should say "indent... one level to the right".  ;) Also the return command was added to the end of the body.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:newbey
ID: 33666680
Thanks again!

Before I was getting the extended error message but now it works perfect. Thanks again, really helpful.

0
 

Author Comment

by:newbey
ID: 33666736
Also, if I ever wanted to go directly to the web rather than parsing local files what way could I do this? I would still like to be able to loop like www.test.com/100.html, www.test.com/101.html....
0
 
LVL 28

Expert Comment

by:pepr
ID: 33675435
The urlllib is OK for the purpose. The question is how to get the list of the HTML files.  This is a kind of active task on the server side.  Because of that, it should be quite easy if the site supports say PHP, or ActiveX.  However, I am not good in it.  I cannot give you the exact answer.
0
 

Author Comment

by:newbey
ID: 33679951
The other html files do follow a logical structure like I mentioned above so I already know the names. I was thinking something along the lines of:

For x in range (100,120)
        url =  'www.test.com' + 'x' + 'html'
      f = urllib.urlopen(url)
      .......run the code

Would this work or is there a better way?
0
 

Author Comment

by:newbey
ID: 33764332
Sorry to ask another question on this but I have noticed that on some of the files there are no contact details and the code falls over because of it:

File "C:\Python27\LoopFilesHTML", line 61, in getRowFromFilename
    for x in cu1.contents:
AttributeError: 'NoneType' object has no attribute 'contents'

Can you let me know how to miss out the line is there is no contact details in file?
0
 
LVL 28

Expert Comment

by:pepr
ID: 33767848
For the earlier question on generating the url's, try the following.  Read the documentation about string interpolation (http://docs.python.org/library/stdtypes.html?highlight=string%20interpolation#string-formatting-operations)
for n in range(1, 122):
    url = 'http://www.test.com/%d.html' % n
    print url

Open in new window

0
 
LVL 28

Expert Comment

by:pepr
ID: 33767901
I cannot see it directly and your script is a bit modified already.  However, it is probably because the

    cu1 = div_c.find('a', {'href': 'c.pl?u=1'})

did not find anything and filled the cu1 by None for that reason.  The last shown script used

    rowLst.append(cu1.contents[0])

which would fail for exactly the same reason, because None object has no .contents.  In such case, you can simply avoid the error -- you simply do not want to access the contents attribute if there is nothing like that.  You only have to decide what will be appended to the rowLst instead.  Try:

if cu1 is not None:
        rowLst.append(cu1.contents[0])
    else:
        rowLst.append('')             # i.e. empty string

Open in new window

0
 
LVL 28

Expert Comment

by:pepr
ID: 33767915
(The EE has broken the indentation. The first line was intended to be indented the same way as the line 3.)
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now