asked on

extract data from a recurring pattern

I'm following up on my previous question
https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html?cid=239

The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).

The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>

e.g.
for pattern 1

filenameparsed|pattern|/abcdefg/|Abcd Degea|510

for pattern 2

filenameparsed|pattern2|/fffffg/|ffff hhhh|06.09.08|40

for pattern 3

filenameparsed|pattern2|/fydsdfs/|asdas asdasdsadas|14.10.07|285

thanks!!!!!!!!!!!

Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">285</td>

Open in new window

ozo

$_ = <<HERE;
Pattern 1

<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>

Pattern 2

<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>

Pattern 3

<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;"">285</td>
HERE

print "filenameparsed|pattern$1|$2|$3|$4|$5\n" while /(\d+)\s*<.*?href="([^"]*)".*?title="([^"]*?)["<].*?([\d.]+)<\/td>.*?([\d]+)<\/td>/gs;

catalini

ASKER

thanks ozo! one more question, how do I need to adapt the starting part to have a file that processes all documents in a folder (filling the information in the field "filenameparsed")? (see example below)

Please post your solution also under https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html

thanks!!!

local $/;
 
foreach (<*>) {
 
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
 
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
 
    my $data=<$in>;
 
    $data =~ s/.*?<tr height(.*?)<\/td>.*/$1/s;
 
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
 
        print $out "$_|$1|$2|$3|$4|$5\n";
 
    }
 
    close($in);
 
    close($out);
 
}

Open in new window

catalini

ASKER

I forgot to mention that the source files are html files, where the patterns (1-3) appear inside.

pepr

Try the snippet below. See the explanation at http:Q_23908943.html#22992513

import codecs
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect only the rows with specific attributes.
    if tr.get('height', '') != '20':
        continue
 
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # If lst does not have at least 4 elements, then it is unexpected.
    # Skip it.
    if len(lst) != 4:
        continue
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    f.write('|'.join(result) + '\n')
 
f.close()

Open in new window

catalini

ASKER

pepr, how could I automate the script to run on all files in a given folder and return a filename.results file for each of them (single file)?

thanks

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

catalini

ASKER

perfect! wonderful code! thank you soooo much!

catalini

ASKER

wonderful solution, great code!