newbey
asked on
Parse local html file with python and beautifulsoup
Hello,
I am trying to extract some data from an html file using python with beautiful soup, the ultimate aim is to extract the data into a csv / excel file.
The data that I want to extract is in the following format:
<h1 class="title">Title Information</h1>
<div id="letter">
<div id="a">
<h2>A Info</h2>
<div>
Start Info,<br>
More Info
</div>
<div>
Some More Info
</div>
<div>Even More info</div>
</div>
<div id="b">
<h2>B Info</h2>
<p>
B1:
B1 Info
<br>
B2:
B2 Info
<br>
B3E:
<a href="mailto:<a href='mailto:some@one.com' >some@one. com</a></a >
<br>
B3W:
<a href="http://www.one.com">http://www.one.com</a>
</p>
</div>
<div id="c">
<h2>C Info</h2>
<dl id="C DIV">
<dt><a href='c.pl?u=1'>1st Info</a></dt>
<dd
>
</dd>
<dt><a href='c.pl?u=2'>2nd Info</a></dt>
<dd
>
</dd>
</dl>
</div>
</div>
Breaking down the elements I want to extract and how I want them outputted to excel
Title:
The Title – Column A
div a:
start info, Some More Info, Even More Info - Column B
div b:
B1 Info – Column C
B2 Info – Column D
B3E – Column E
B3W – Column F
div c:
1st Info – Column G
2nd Info- Column H
The code I have at the moment is:
import urllib
file = urllib.urlopen("file:///c: //x/y/z/te st.htm")
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file)
title = soup.title
g¬_a = soup.findAll("div", {"id" : "a"})
for row in g_a:
cat = row.find('div');
g_b = soup.findAll("div", {"id" : "b"})
for row in g_b:
con = row.find('p');
g_c = soup.findAll("div", {"id" : "c"})
for row in g_c:
pers = row.findAll('a', href=True);
print title, cat, con, pers
My problem is that I cannot split out and concatenate all the information from div a, I cannot parse the information between <br> in div b and I cannot extract the info between the links in div c. If you could also show me how to output it into the excel columns mentioned above it would be much appreciated.
I am trying to extract some data from an html file using python with beautiful soup, the ultimate aim is to extract the data into a csv / excel file.
The data that I want to extract is in the following format:
<h1 class="title">Title Information</h1>
<div id="letter">
<div id="a">
<h2>A Info</h2>
<div>
Start Info,<br>
More Info
</div>
<div>
Some More Info
</div>
<div>Even More info</div>
</div>
<div id="b">
<h2>B Info</h2>
<p>
B1:
B1 Info
<br>
B2:
B2 Info
<br>
B3E:
<a href="mailto:<a href='mailto:some@one.com'
<br>
B3W:
<a href="http://www.one.com">http://www.one.com</a>
</p>
</div>
<div id="c">
<h2>C Info</h2>
<dl id="C DIV">
<dt><a href='c.pl?u=1'>1st Info</a></dt>
<dd
>
</dd>
<dt><a href='c.pl?u=2'>2nd Info</a></dt>
<dd
>
</dd>
</dl>
</div>
</div>
Breaking down the elements I want to extract and how I want them outputted to excel
Title:
The Title – Column A
div a:
start info, Some More Info, Even More Info - Column B
div b:
B1 Info – Column C
B2 Info – Column D
B3E – Column E
B3W – Column F
div c:
1st Info – Column G
2nd Info- Column H
The code I have at the moment is:
import urllib
file = urllib.urlopen("file:///c:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file)
title = soup.title
g¬_a = soup.findAll("div", {"id" : "a"})
for row in g_a:
cat = row.find('div');
g_b = soup.findAll("div", {"id" : "b"})
for row in g_b:
con = row.find('p');
g_c = soup.findAll("div", {"id" : "c"})
for row in g_c:
pers = row.findAll('a', href=True);
print title, cat, con, pers
My problem is that I cannot split out and concatenate all the information from div a, I cannot parse the information between <br> in div b and I cannot extract the info between the links in div c. If you could also show me how to output it into the excel columns mentioned above it would be much appreciated.
Here is the first fragment that shows how to extract the title and the wanted information from inside the div a element. No formatting to the columns yet. The test.html was stored locally and opened as a file (no urllib). It prints on my screen:
c:\tmp\___python\newbey\Q_ 26439956>a .py
[<h1 class="title">Title Information</h1>]
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
c:\tmp\___python\newbey\Q_
[<h1 class="title">Title Information</h1>]
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
from BeautifulSoup import BeautifulSoup
f = open("test.html") # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()
titleLst = soup.findAll("h1", {"class": "title"})
print titleLst # the list with single
title = titleLst[0].contents[0]
print title # the string from the h1 element
g_a = soup.findAll("div", {"id" : "a"}) # the elements from inside the div a element
alst = [] # the future result list
for x in g_a:
for elem in x.findAll('div'): # the divs inside the div a
print elem.contents # just to see what is inside
alst.append(elem.contents[0].strip('\n\t ,')) # collect the wanted info
print alst # wanted result in a technical form
print ', '.join(alst) # wanted result as a string (items separated by comma and space
Here is the second version that extracts all the info into the list of strings. (The input around mailto: seems corrupted -- not solved here.) Notice the simplified searching for the information. The script prints:
c:\tmp\___python\newbey\Q_ 26439956>b .py
<h1 class="title">Title Information</h1>
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
[u'B1 Info', u'B2 Info', u'', u'mailto:some@one.com', u'\n', u'', u'http://www.o
ne.com', u'\n']
[u'B1 Info', u'B2 Info', u'some@one.com', u'http://www.one.com']
<a href="c.pl?u=1">1st Info</a>
<a href="c.pl?u=2">2nd Info</a>
[u'Title Information', u'Start Info, Some More Info, Even More info', u'B1 Info'
, u'B2 Info', u'some@one.com', u'http://www.one.com', u'1st Info', u'2nd Info']
c:\tmp\___python\newbey\Q_
<h1 class="title">Title Information</h1>
Title Information
[u'\n Start Info,', <br />, u'\nMore Info\n ']
[u'\n Some More Info\n ']
[u'Even More info']
[u'Start Info', u'Some More Info', u'Even More info']
Start Info, Some More Info, Even More info
[u'B1 Info', u'B2 Info', u'', u'mailto:some@one.com', u'\n', u'', u'http://www.o
ne.com', u'\n']
[u'B1 Info', u'B2 Info', u'some@one.com', u'http://www.one.com']
<a href="c.pl?u=1">1st Info</a>
<a href="c.pl?u=2">2nd Info</a>
[u'Title Information', u'Start Info, Some More Info, Even More info', u'B1 Info'
, u'B2 Info', u'some@one.com', u'http://www.one.com', u'1st Info', u'2nd Info']
from BeautifulSoup import BeautifulSoup
import re
f = open('test.html') # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()
h1_title = soup.find('h1', {'class': 'title'})
print h1_title # the header element
title = h1_title.contents[0]
print title # the string from the h1 element
rowLst = [ title ]
div_a = soup.find('div', {'id' : 'a'}) # the div a element
lst_a = [] # the future result list
for elem in div_a.findAll('div'): # the divs inside the div a
print elem.contents # just to see what is inside
lst_a.append(elem.contents[0].strip('\n\t ,')) # collect the wanted info
print lst_a # wanted result in a technical form
cat = ', '.join(lst_a) # wanted result as a string (items separated by comma and space
print cat # (I do not know what meaning has 'cat' for you.)
rowLst.append(cat)
div_b = soup.find('div', {'id' : 'b'})
rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
p = div_b.find('p') # The p element from div b
lst_b = []
for x in p.contents:
if getattr(x, 'name', None) == 'br':
continue # ignote the br elements
elif getattr(x, 'name', None) == 'a':
lst_b.append(x['href']) # href attribute here (not the enclosed string)
else:
lst_b.append(rexClean.sub(r'\2', x)) # only the info without whitespaces
print lst_b
# Remove the empty strings and newline strings, leave out the 'mailto:'.
lst_b2 = []
for s in lst_b:
if s == u'' or s == u'\n':
continue
else:
lst_b2.append(s.replace('mailto:', ''))
print lst_b2
# Append all elements from lst_b2 to the column list.
rowLst.extend(lst_b2)
# Extraction from the div c.
div_c = soup.find('div', {'id' : 'c'})
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
print cu1
rowLst.append(cu1.contents[0])
cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
print cu2
rowLst.append(cu2.contents[0])
# Show the rowLst -- this should be the Excel line in future.
print rowLst
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks Pepr, fantastic!
I am trying this now, I think there may be 1 tweak required but I will post back
I am trying this now, I think there may be 1 tweak required but I will post back
ASKER
Thanks for your help, just what I was looking for.
Not sure if this is easy or not but if I had more than 1 file that were sequential in number i.e. test1.html, test2.html, test3.html, what would be the easiest way to loop through these files (using the code above) and saving them to excel?
Not sure if this is easy or not but if I had more than 1 file that were sequential in number i.e. test1.html, test2.html, test3.html, what would be the easiest way to loop through these files (using the code above) and saving them to excel?
This is a logical step ahead. See the code below. The code for extraction of the data from one file became the body of the new function (Just indent all the related lines one level left--good editor is capable to do that with selected line by one TAB or something like Alt+right arrow.)
The body now contain the new for-loop that calls the function with the names ;)
However, this does not solve how you get the HTML files to your local-disk directory.
The body now contain the new for-loop that calls the function with the names ;)
However, this does not solve how you get the HTML files to your local-disk directory.
from BeautifulSoup import BeautifulSoup
import csv
import glob
import re
def getRowFromFilename(fname):
f = open(fname) # simplified for the example (no urllib)
soup = BeautifulSoup(f)
f.close()
# Extract the title.
h1_title = soup.find('h1', {'class': 'title'}) # the header element
title = h1_title.contents[0] # the string from the h1 element
rowLst = [ title ] # first column in the result list
# Extract the div a info as one string.
div_a = soup.find('div', {'id' : 'a'}) # the div a element
lst_a = [] # the future result list
for elem in div_a.findAll('div'): # the divs inside the div a
lst_a.append(elem.contents[0].strip('\n\t ,')) # collect the wanted info
cat = ', '.join(lst_a) # wanted result as a string
rowLst.append(cat)
# Extract the div b info as several strings.
div_b = soup.find('div', {'id' : 'b'})
rexClean = re.compile(r'^\s*(?P<id>\w+:)\s*(?P<text>.*?)\s*$', re.MULTILINE)
p = div_b.find('p') # The p element from div b
lst_b = []
for x in p.contents:
if getattr(x, 'name', None) == 'br':
continue # ignote the br elements
elif getattr(x, 'name', None) == 'a':
lst_b.append(x['href']) # href attribute here
else:
lst_b.append(rexClean.sub(r'\2', x)) # only the info without whitespaces
# Remove the empty strings and newline strings, leave out the 'mailto:'.
lst_b2 = []
for s in lst_b:
if s == u'' or s == u'\n':
continue
else:
lst_b2.append(s.replace('mailto:', ''))
# Append all elements from lst_b2 to the column list.
rowLst.extend(lst_b2)
# Extraction from the div c.
div_c = soup.find('div', {'id' : 'c'})
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
rowLst.append(cu1.contents[0])
cu2 = div_c.find('a', {'href': 'c.pl?u=2'})
rowLst.append(cu2.contents[0])
# Return the resulting row.
return rowLst
#-------------------------------------------------------------------------
# Get the sequence of the processed filenames, extract the row from each
# of them, and write it to the csv file for Excel. Beware the Excel. The dialect
# may not be enough for your purpose. Some experiments may be neccessary.
f = open('result.csv', 'wb') # must be in binary mode
writer = csv.writer(f, dialect='excel') # csv writer wrap around the f
for fname in sorted(glob.glob('test*.html')): # can be modified to fit your needs
print fname # just to see what is processed
rowLst = getRowFromFilename(fname) # extract the info from one file
writer.writerow(rowLst) # write it to one row of the result
f.close() # do not forget to close the result file
Well, the left hand is the one where the thumb is on the right side. The above should say "indent... one level to the right". ;) Also the return command was added to the end of the body.
ASKER
Thanks again!
Before I was getting the extended error message but now it works perfect. Thanks again, really helpful.
Before I was getting the extended error message but now it works perfect. Thanks again, really helpful.
ASKER
Also, if I ever wanted to go directly to the web rather than parsing local files what way could I do this? I would still like to be able to loop like www.test.com/100.html, www.test.com/101.html....
The urlllib is OK for the purpose. The question is how to get the list of the HTML files. This is a kind of active task on the server side. Because of that, it should be quite easy if the site supports say PHP, or ActiveX. However, I am not good in it. I cannot give you the exact answer.
ASKER
The other html files do follow a logical structure like I mentioned above so I already know the names. I was thinking something along the lines of:
For x in range (100,120)
url = 'www.test.com' + 'x' + 'html'
f = urllib.urlopen(url)
.......run the code
Would this work or is there a better way?
For x in range (100,120)
url = 'www.test.com' + 'x' + 'html'
f = urllib.urlopen(url)
.......run the code
Would this work or is there a better way?
ASKER
Sorry to ask another question on this but I have noticed that on some of the files there are no contact details and the code falls over because of it:
File "C:\Python27\LoopFilesHTML ", line 61, in getRowFromFilename
for x in cu1.contents:
AttributeError: 'NoneType' object has no attribute 'contents'
Can you let me know how to miss out the line is there is no contact details in file?
File "C:\Python27\LoopFilesHTML
for x in cu1.contents:
AttributeError: 'NoneType' object has no attribute 'contents'
Can you let me know how to miss out the line is there is no contact details in file?
For the earlier question on generating the url's, try the following. Read the documentation about string interpolation (http://docs.python.org/library/stdtypes.html?highlight=string%20interpolation#string-formatting-operations)
for n in range(1, 122):
url = 'http://www.test.com/%d.html' % n
print url
I cannot see it directly and your script is a bit modified already. However, it is probably because the
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
did not find anything and filled the cu1 by None for that reason. The last shown script used
rowLst.append(cu1.contents [0])
which would fail for exactly the same reason, because None object has no .contents. In such case, you can simply avoid the error -- you simply do not want to access the contents attribute if there is nothing like that. You only have to decide what will be appended to the rowLst instead. Try:
cu1 = div_c.find('a', {'href': 'c.pl?u=1'})
did not find anything and filled the cu1 by None for that reason. The last shown script used
rowLst.append(cu1.contents
which would fail for exactly the same reason, because None object has no .contents. In such case, you can simply avoid the error -- you simply do not want to access the contents attribute if there is nothing like that. You only have to decide what will be appended to the rowLst instead. Try:
if cu1 is not None:
rowLst.append(cu1.contents[0])
else:
rowLst.append('') # i.e. empty string
(The EE has broken the indentation. The first line was intended to be indented the same way as the line 3.)
A side note, you should not freely use the 'file' identifier as it is the class of file objects in Python 2.x.