catalini
asked on
extract data from a recurring pattern
I'm following up on my previous question
https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html?cid=239
The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).
The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>
e.g.
for pattern 1
filenameparsed|pattern|/ab cdefg/|Abc d Degea|510
for pattern 2
filenameparsed|pattern2|/f ffffg/|fff f hhhh|06.09.08|40
for pattern 3
filenameparsed|pattern2|/f ydsdfs/|as das asdasdsadas|14.10.07|285
thanks!!!!!!!!!!!
https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html?cid=239
The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).
The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>
e.g.
for pattern 1
filenameparsed|pattern|/ab
for pattern 2
filenameparsed|pattern2|/f
for pattern 3
filenameparsed|pattern2|/f
thanks!!!!!!!!!!!
Pattern 1
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
Pattern 2
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
Pattern 3
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">285</td>
ASKER
thanks ozo! one more question, how do I need to adapt the starting part to have a file that processes all documents in a folder (filling the information in the field "filenameparsed")? (see example below)
Please post your solution also under https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html
thanks!!!
Please post your solution also under https://www.experts-exchange.com/questions/23908943/extract-string-from-html.html
thanks!!!
local $/;
foreach (<*>) {
open(my $in, "<$_") or warn "could not open $_: $!\n",next;
open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
my $data=<$in>;
$data =~ s/.*?<tr height(.*?)<\/td>.*/$1/s;
while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
print $out "$_|$1|$2|$3|$4|$5\n";
}
close($in);
close($out);
}
ASKER
I forgot to mention that the source files are html files, where the patterns (1-3) appear inside.
Try the snippet below. See the explanation at http:Q_23908943.html#22992513
import codecs
from BeautifulSoup import BeautifulSoup
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
# Parse the string.
soup = BeautifulSoup(page)
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
# Find and process all the table rows.
for tr in soup.findAll('tr'):
# Collect only the rows with specific attributes.
if tr.get('height', '') != '20':
continue
# Collect all td's in one row into the auxiliary list.
lst = [] # list of td in one row
for td in tr.findAll('td'):
lst.append(td)
# If lst does not have at least 4 elements, then it is unexpected.
# Skip it.
if len(lst) != 4:
continue
# Process the info from one row into the result list.
# The first two data cells contain hrefs with titles,
# the other two the date and the number.
result = []
result.append(lst[0].a['href'])
result.append(lst[0].a['title'])
result.append(lst[1].a['href'])
result.append(lst[1].a['title'])
result.append(lst[1].a.string)
result.append(lst[2].string)
result.append(lst[3].string)
# Joint the resulting list into one string and print it.
f.write('|'.join(result) + '\n')
f.close()
ASKER
pepr, how could I automate the script to run on all files in a given folder and return a filename.results file for each of them (single file)?
thanks
thanks
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
perfect! wonderful code! thank you soooo much!
ASKER
wonderful solution, great code!
Pattern 1
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
Pattern 2
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td><
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
Pattern 3
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/sea
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;"">285</td>
HERE
print "filenameparsed|pattern$1|