Link to home
Start Free TrialLog in
Avatar of catalini
catalini

asked on

extract data from a recurring pattern

I'm following up on my previous question
https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html?cid=239

The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).

The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>

e.g.
for pattern 1

filenameparsed|pattern|/abcdefg/|Abcd  Degea|510

for pattern 2

filenameparsed|pattern2|/fffffg/|ffff hhhh|06.09.08|40

for pattern 3

filenameparsed|pattern2|/fydsdfs/|asdas asdasdsadas|14.10.07|285

thanks!!!!!!!!!!!

Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">285</td>

Open in new window

Avatar of ozo
ozo
Flag of United States of America image

$_ = <<HERE;
Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;"">285</td>
HERE

print "filenameparsed|pattern$1|$2|$3|$4|$5\n" while /(\d+)\s*<.*?href="([^"]*)".*?title="([^"]*?)["<].*?([\d.]+)<\/td>.*?([\d]+)<\/td>/gs;
Avatar of catalini
catalini

ASKER

thanks ozo! one more question, how do I need to adapt the starting part to have a file that processes all documents in a folder (filling the information in the field "filenameparsed")? (see example below)

Please post your solution also under https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html

thanks!!!
local $/;
 
foreach (<*>) {
 
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
 
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
 
    my $data=<$in>;
 
    $data =~ s/.*?<tr height(.*?)<\/td>.*/$1/s;
 
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
 
        print $out "$_|$1|$2|$3|$4|$5\n";
 
    }
 
    close($in);
 
    close($out);
 
}

Open in new window

I forgot to mention that the source files are html files, where the patterns (1-3) appear inside.
Try the snippet below. See the explanation at http:Q_23908943.html#22992513
import codecs
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect only the rows with specific attributes.
    if tr.get('height', '') != '20':
        continue
 
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # If lst does not have at least 4 elements, then it is unexpected.
    # Skip it.
    if len(lst) != 4:
        continue
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    f.write('|'.join(result) + '\n')
 
f.close()

Open in new window

pepr, how could I automate the script to run on all files in a given folder and return a filename.results file for each of them (single file)?

thanks
ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
perfect! wonderful code! thank you soooo much!
wonderful solution, great code!