?
Solved

extract string from html

Posted on 2008-11-16
12
Medium Priority
?
637 Views
Last Modified: 2012-06-10
Hi! I need to extract data from a web page using this pattern:

the data I need starts after

<tr height="20"><td style="width: 80px;" height="80" rowspan="2">

<a href="/test1/" title="Test1 title"><img class="individual" alt="Test1name" src="/test1/images/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/testg/" title="Test1 Name">Test1</a></b></td>
<td style="width: 35px;">22.02.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">432</td>


I need to extract the two href with relative tile, the date (22.02.08) and the number (432) in the last column and put them in a single row, pipe-delimited. The result should look like this:

/test1/|Test1 title|/testg/|Test1 Name|Test1|22.02.08|432

there is more than one row for each html file.

Any help is really appreciated! Thanks!!!!



0
Comment
Question by:catalini
  • 6
  • 6
12 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 22971198
I suggest to use a decent parser rather than to rely on a string pattern. Can you attach some realistic example of the file?
0
 

Author Comment

by:catalini
ID: 22971229
here we go...


in this case it should return two rows

/test1/|Test1 title|/testg/|Test1 Name|Test1|22.02.08|432
/test1/|Test2 title|/testh/|Test2 Name and details|Test2|16.04.08|999

as you see the structure always repeats itself.
thanks!
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<body>
<table width="900" cellspacing=0 cellpadding=0 margin=0 border=0>
<tbody id="data">
<tr height="20"><td style="width: 80px;" height="80" rowspan="2">
<a href="/test1/" title="Test1 title"><img class="individual" alt="Test1name" src="/test1/images/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/testg/" title="Test1 Name">Test1</a></b></td>
<td style="width: 35px;">22.02.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">432</td>
<tr height="20"><td style="width: 80px;" height="80" rowspan="2">
<a href="/test2/" title="Test2 title"><img class="artist" alt="Test2name" src="/test2/images/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/testh/" title="Test2 Name and details">Test2</a></b></td>
<td style="width: 35px;">16.04.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">999</td>
</tr>
</tbody></table>
</body>

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 22971231
When using Python, then one of the parsers that probably is very suitable for the task is Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/#Download). When downloade, you get something like 70 KB text file BeautifulSoup.py that you can put to the directory with your script.

Is the line <tr height="20"><td style="width: 80px;" height="80" rowspan="2"> introducing a single table or is it one of the tables that is rather specific?
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 29

Expert Comment

by:pepr
ID: 22971236
When using Python, then one of the parsers that probably is very suitable for the task is Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/#Download). When downloade, you get something like 70 KB text file BeautifulSoup.py that you can put to the directory with your script.

Is the line <tr height="20"><td style="width: 80px;" height="80" rowspan="2"> introducing a single row or is it one of the rows that is rather specific?
0
 

Author Comment

by:catalini
ID: 22971249
Every row of data I need to parse always starts with
<tr height="20"><td style="width: 80px;" height="80" rowspan="2">
and ends with
</tr>
0
 
LVL 29

Expert Comment

by:pepr
ID: 22971297
Try the following snippet and ask for the modifications.
from BeautifulSoup import BeautifulSoup
import re
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    print '|'.join(result)

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 22971317
I forgot to extract one string before the date (the modified snippet below). For the sample stored in b.html and placed in the working directory, you should see the following result:

/test1/|Test1 title|/testg/|Test1 Name|Test1|22.02.08|432
/test2/|Test2 title|/testh/|Test2 Name and details|Test2|16.04.08|999
from BeautifulSoup import BeautifulSoup
import re
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    print '|'.join(result)

Open in new window

0
 

Author Comment

by:catalini
ID: 22971343
works perfectly on the test file, but on a more complex one it returns...

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    result.append(lst[0].a['title'])
  File "/home/me/parse/BeautifulSoup.py", line 536, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'title'


I've uploaded one of these more complex files here
www.catalini. com/b.html

the pattern is always the same

your help is really appreciated! thanks!
0
 

Author Comment

by:catalini
ID: 22975633
I've posted a more detailed list of the patterns here

http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23910724.html
0
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 22992513
Sorry for the delay. The reason is that the file contains more tables with table rows of a different pattern. I do not know how much you know Python. Anyway, BeautifulSoup extract the HTML content so that each element behaves as a Python dictionary (look up table, a.k.a. hash table, associative array, ...). The above error says that the <a ....>...</a> element does not contain the 'title' attribute.

See the modified script below. It simply restricts extraction only to rows that have some specific features. It should work with you b.html. Because of using UNICODE, the output is produced in the form of output.txt in UTF-8. Otherwise, Python complains that some characters cannot be converted to the output encoding used by the console.

Some explanation:  tr.get('height', '')  is almost the same as tr['height'], but if the attribute does not exists, it returns the given default value '' (empty string).

Tested with Python 2.6.
import codecs
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect only the rows with specific attributes.
    if tr.get('height', '') != '20':
        continue
 
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # If lst does not have at least 4 elements, then it is unexpected.
    # Skip it.
    if len(lst) != 4:
        continue
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    f.write('|'.join(result) + '\n')
 
f.close()

Open in new window

0
 

Author Comment

by:catalini
ID: 22993798
this is simply amazing!!! thanks pepr!!!!!

P.S. could you please post the solution also here http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23910724.html

0
 

Author Closing Comment

by:catalini
ID: 31517240
amazing answer!!! thanks
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses

588 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question