asked on

Parse local html file with python and beautifulsoup

Hello,

I am trying to extract some data from an html file using python with beautiful soup, the ultimate aim is to extract the data into a csv / excel file.
The data that I want to extract is in the following format:

<h1 class="title">Title Information</h1>

<div id="letter">
<div id="a">
      <h2>A Info</h2>
      <div>
      Start Info,<br>
More Info
      </div>
      <div>
      Some More Info
      </div>
      <div>Even More info</div>
</div>
<div id="b">
      <h2>B Info</h2>
      <p>
      B1:
            B1 Info
      <br>
      B2:
            B2 Info
      <br>
      B3E:
            <a href="mailto:<a href='mailto:some@one.com'>some@one.com</a></a>
      <br>
      B3W:
            <a href="http://www.one.com">http://www.one.com</a>
      </p>
</div>
<div id="c">
      <h2>C Info</h2>
      <dl id="C DIV">
            <dt><a href='c.pl?u=1'>1st Info</a></dt>
            <dd
            >
             </dd>
            <dt><a href='c.pl?u=2'>2nd Info</a></dt>
            <dd
            >
             </dd>
      </dl>
</div>
</div>

Breaking down the elements I want to extract and how I want them outputted to excel

Title:

The Title – Column A

div a:

start info, Some More Info, Even More Info - Column B

div b:

B1 Info – Column C
B2 Info – Column D
B3E – Column E
B3W – Column F

div c:

1st Info – Column G
2nd Info- Column H

The code I have at the moment is:

import urllib

file = urllib.urlopen("file:///c://x/y/z/test.htm")
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file)

title = soup.title
g¬_a = soup.findAll("div", {"id" : "a"})

for row in g_a:
cat = row.find('div');

g_b = soup.findAll("div", {"id" : "b"})

for row in g_b:
con = row.find('p');

g_c = soup.findAll("div", {"id" : "c"})

for row in g_c:
pers = row.findAll('a', href=True);

print title, cat, con, pers

My problem is that I cannot split out and concatenate all the information from div a, I cannot parse the information between <br> in div b and I cannot extract the info between the links in div c. If you could also show me how to output it into the excel columns mentioned above it would be much appreciated.

pepr

Does one file contain only one set of such information?

A side note, you should not freely use the 'file' identifier as it is the class of file objects in Python 2.x.

pepr

Here is the first fragment that shows how to extract the title and the wanted information from inside the div a element. No formatting to the columns yet. The test.html was stored locally and opened as a file (no urllib). It prints on my screen:

c:\tmp\___python\newbey\Q_26439956>a.py
[<h1 class="title">Title Information</h1>]
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info

from BeautifulSoup import BeautifulSoup

f = open("test.html")      # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()

titleLst = soup.findAll("h1", {"class": "title"})
print titleLst             # the list with single

title = titleLst[0].contents[0]
print title                # the string from the h1 element

g_a = soup.findAll("div", {"id" : "a"})  # the elements from inside the div a element
alst = []                                # the future result list
for x in g_a:
    for elem in x.findAll('div'):        # the divs inside the div a
        print elem.contents              # just to see what is inside
        alst.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
        
print alst               # wanted result in a technical form            
print ', '.join(alst)    # wanted result as a string (items separated by comma and space

Open in new window

pepr

Here is the second version that extracts all the info into the list of strings. (The input around mailto: seems corrupted -- not solved here.) Notice the simplified searching for the information. The script prints:

c:\tmp\___python\newbey\Q_26439956>b.py
<h1 class="title">Title Information</h1>
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
[u'B1 Info', u'B2 Info', u'', u'mailto:some@one.com', u'\n', u'', u'http://www.o
ne.com', u'\n']
[u'B1 Info', u'B2 Info', u'some@one.com', u'http://www.one.com']
<a href="c.pl?u=1">1st Info</a>
<a href="c.pl?u=2">2nd Info</a>
[u'Title Information', u'Start Info, Some More Info, Even More info', u'B1 Info'
, u'B2 Info', u'some@one.com', u'http://www.one.com', u'1st Info', u'2nd Info']

from BeautifulSoup import BeautifulSoup
import re

f = open('test.html')      # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()

h1_title = soup.find('h1', {'class': 'title'})
print h1_title             # the header element

title = h1_title.contents[0]
print title                # the string from the h1 element

rowLst = [ title ]
div_a = soup.find('div', {'id' : 'a'})   # the div a element
lst_a = []                               # the future result list
for elem in div_a.findAll('div'):        # the divs inside the div a
    print elem.contents                  # just to see what is inside
    lst_a.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
        
print lst_a              # wanted result in a technical form
cat = ', '.join(lst_a)   # wanted result as a string (items separated by comma and space            
print cat                # (I do not know what meaning has 'cat' for you.)
rowLst.append(cat)

div_b = soup.find('div', {'id' : 'b'})
rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
p = div_b.find('p')      # The p element from div b
lst_b = []
for x in p.contents:
    if getattr(x, 'name', None) == 'br':
        continue                          # ignote the br elements
    elif getattr(x, 'name', None) == 'a':
        lst_b.append(x['href'])           # href attribute here (not the enclosed string)
    else:    
        lst_b.append(rexClean.sub(r'\2', x))  # only the info without whitespaces
print lst_b    

# Remove the empty strings and newline strings, leave out the 'mailto:'.
lst_b2 = []
for s in lst_b:
    if s == u'' or s == u'\n':
        continue
    else:
        lst_b2.append(s.replace('mailto:', ''))
print lst_b2    

# Append all elements from lst_b2 to the column list.
rowLst.extend(lst_b2)

# Extraction from the div c.
div_c = soup.find('div', {'id' : 'c'})
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
print cu1
rowLst.append(cu1.contents[0])

cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
print cu2
rowLst.append(cu2.contents[0])

# Show the rowLst -- this should be the Excel line in future.
print rowLst

Open in new window

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

newbey

ASKER

Thanks Pepr, fantastic!

I am trying this now, I think there may be 1 tweak required but I will post back

newbey

ASKER

Thanks for your help, just what I was looking for.

Not sure if this is easy or not but if I had more than 1 file that were sequential in number i.e. test1.html, test2.html, test3.html, what would be the easiest way to loop through these files (using the code above) and saving them to excel?

pepr

This is a logical step ahead. See the code below. The code for extraction of the data from one file became the body of the new function (Just indent all the related lines one level left--good editor is capable to do that with selected line by one TAB or something like Alt+right arrow.)

The body now contain the new for-loop that calls the function with the names ;)

However, this does not solve how you get the HTML files to your local-disk directory.

from BeautifulSoup import BeautifulSoup
import csv
import glob
import re

def getRowFromFilename(fname):
    f = open(fname)              # simplified for the example (no urllib)
    soup = BeautifulSoup(f)
    f.close()

    # Extract the title.
    h1_title = soup.find('h1', {'class': 'title'})  # the header element
    title = h1_title.contents[0]             # the string from the h1 element

    rowLst = [ title ]                       # first column in the result list

    # Extract the div a info as one string.
    div_a = soup.find('div', {'id' : 'a'})   # the div a element
    lst_a = []                               # the future result list
    for elem in div_a.findAll('div'):        # the divs inside the div a
        lst_a.append(elem.contents[0].strip('\n\t ,'))  # collect the wanted info
    cat = ', '.join(lst_a)                   # wanted result as a string
    rowLst.append(cat)

    # Extract the div b info as several strings.
    div_b = soup.find('div', {'id' : 'b'})
    rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
    p = div_b.find('p')      # The p element from div b
    lst_b = []
    for x in p.contents:
        if getattr(x, 'name', None) == 'br':
            continue                          # ignote the br elements
        elif getattr(x, 'name', None) == 'a':
            lst_b.append(x['href'])           # href attribute here
        else:    
            lst_b.append(rexClean.sub(r'\2', x))  # only the info without whitespaces

    # Remove the empty strings and newline strings, leave out the 'mailto:'.
    lst_b2 = []
    for s in lst_b:
        if s == u'' or s == u'\n':
            continue
        else:
            lst_b2.append(s.replace('mailto:', ''))

    # Append all elements from lst_b2 to the column list.
    rowLst.extend(lst_b2)

    # Extraction from the div c.
    div_c = soup.find('div', {'id' : 'c'})
    cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
    rowLst.append(cu1.contents[0])

    cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
    rowLst.append(cu2.contents[0])
    
    # Return the resulting row.
    return rowLst


#-------------------------------------------------------------------------
# Get the sequence of the processed filenames, extract the row from each
# of them, and write it to the csv file for Excel.  Beware the Excel.  The dialect
# may not be enough for your purpose.  Some experiments may be neccessary.

f = open('result.csv', 'wb')                  # must be in binary mode
writer = csv.writer(f, dialect='excel')       # csv writer wrap around the f

for fname in sorted(glob.glob('test*.html')): # can be modified to fit your needs
    print fname                               # just to see what is processed
    rowLst = getRowFromFilename(fname)        # extract the info from one file
    writer.writerow(rowLst)                   # write it to one row of the result

f.close()            # do not forget to close the result file

Open in new window

pepr

Well, the left hand is the one where the thumb is on the right side. The above should say "indent... one level to the right". ;) Also the return command was added to the end of the body.

newbey

ASKER

Thanks again!

Before I was getting the extended error message but now it works perfect. Thanks again, really helpful.

newbey

ASKER

Also, if I ever wanted to go directly to the web rather than parsing local files what way could I do this? I would still like to be able to loop like www.test.com/100.html, www.test.com/101.html....

pepr

The urlllib is OK for the purpose. The question is how to get the list of the HTML files. This is a kind of active task on the server side. Because of that, it should be quite easy if the site supports say PHP, or ActiveX. However, I am not good in it. I cannot give you the exact answer.

newbey

ASKER

The other html files do follow a logical structure like I mentioned above so I already know the names. I was thinking something along the lines of:

For x in range (100,120)
url = 'www.test.com' + 'x' + 'html'
f = urllib.urlopen(url)
.......run the code

Would this work or is there a better way?

newbey

ASKER

Sorry to ask another question on this but I have noticed that on some of the files there are no contact details and the code falls over because of it:

File "C:\Python27\LoopFilesHTML", line 61, in getRowFromFilename
for x in cu1.contents:
AttributeError: 'NoneType' object has no attribute 'contents'

Can you let me know how to miss out the line is there is no contact details in file?

pepr

For the earlier question on generating the url's, try the following. Read the documentation about string interpolation (http://docs.python.org/library/stdtypes.html?highlight=string%20interpolation#string-formatting-operations)

for n in range(1, 122):
    url = 'http://www.test.com/%d.html' % n
    print url

Open in new window

pepr

I cannot see it directly and your script is a bit modified already. However, it is probably because the

cu1 = div_c.find('a', {'href': 'c.pl?u=1'})

did not find anything and filled the cu1 by None for that reason. The last shown script used

rowLst.append(cu1.contents[0])

which would fail for exactly the same reason, because None object has no .contents. In such case, you can simply avoid the error -- you simply do not want to access the contents attribute if there is nothing like that. You only have to decide what will be appended to the rowLst instead. Try:

if cu1 is not None:
        rowLst.append(cu1.contents[0])
    else:
        rowLst.append('')             # i.e. empty string

Open in new window

pepr

(The EE has broken the indentation. The first line was intended to be indented the same way as the line 3.)