extract data from a recurring pattern

I'm following up on my previous question
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23908943.html?cid=239

The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).

The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>

e.g.
for pattern 1

filenameparsed|pattern|/abcdefg/|Abcd  Degea|510

for pattern 2

filenameparsed|pattern2|/fffffg/|ffff hhhh|06.09.08|40

for pattern 3

filenameparsed|pattern2|/fydsdfs/|asdas asdasdsadas|14.10.07|285

thanks!!!!!!!!!!!

Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">285</td>

Open in new window

cataliniAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
$_ = <<HERE;
Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;"">285</td>
HERE

print "filenameparsed|pattern$1|$2|$3|$4|$5\n" while /(\d+)\s*<.*?href="([^"]*)".*?title="([^"]*?)["<].*?([\d.]+)<\/td>.*?([\d]+)<\/td>/gs;
0
cataliniAuthor Commented:
thanks ozo! one more question, how do I need to adapt the starting part to have a file that processes all documents in a folder (filling the information in the field "filenameparsed")? (see example below)

Please post your solution also under http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23908943.html

thanks!!!
local $/;
 
foreach (<*>) {
 
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
 
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
 
    my $data=<$in>;
 
    $data =~ s/.*?<tr height(.*?)<\/td>.*/$1/s;
 
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
 
        print $out "$_|$1|$2|$3|$4|$5\n";
 
    }
 
    close($in);
 
    close($out);
 
}

Open in new window

0
cataliniAuthor Commented:
I forgot to mention that the source files are html files, where the patterns (1-3) appear inside.
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

peprCommented:
Try the snippet below. See the explanation at http:Q_23908943.html#22992513
import codecs
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect only the rows with specific attributes.
    if tr.get('height', '') != '20':
        continue
 
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # If lst does not have at least 4 elements, then it is unexpected.
    # Skip it.
    if len(lst) != 4:
        continue
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    f.write('|'.join(result) + '\n')
 
f.close()

Open in new window

0
cataliniAuthor Commented:
pepr, how could I automate the script to run on all files in a given folder and return a filename.results file for each of them (single file)?

thanks
0
peprCommented:
Read the comment in the source below. In the case like that, it is usually the best idea to wrap some reusable code to the function definition -- here extract_info(). Notice that it does not use any global arguments (only the imported modules, but it is quite usual approach). The input filename and the output filename is just passed as arguments. This way you may decide later what files will be processed, how the output filenames will be constructed, etc.

The sys.argv can be used to get the arguments passed from command line. If you name the script say extractor.py, then you can run it like:

    python extractor.py ./my/path/to/html/filenames/

The os.path.join() is the preferred way for joining parts of the path.

The glob.glob(mask) returns the list of filenames that fits with the mask.

Feel free to ask for details/modifications.
 
    Petr
from BeautifulSoup import BeautifulSoup
import codecs
import glob
import os
import sys 
 
def extract_info(fnameIn, fnameOut):
 
    # Get the content of your document (somehow) into one string.
    f = open(fnameIn)
    page = f.read()
    f.close()
     
    # Parse the string.
    soup = BeautifulSoup(page)
     
    # Output in Unicode may be difficult to display on console.
    # Therefore, the output strings will be written to the output
    # file.
    f = codecs.open(fnameOut, 'w', 'utf-8-sig')
     
    # Find and process all the table rows.
    for tr in soup.findAll('tr'):
        # Collect only the rows with specific attributes.
        if tr.get('height', '') != '20':
            continue
     
        # Collect all td's in one row into the auxiliary list.
        lst = []                      # list of td in one row
        for td in tr.findAll('td'):
            lst.append(td)
     
        # If lst does not have at least 4 elements, then it is unexpected.
        # Skip it.
        if len(lst) != 4:
            continue
     
        # Process the info from one row into the result list. 
        # The first two data cells contain hrefs with titles, 
        # the other two the date and the number.
        result = []
        result.append(lst[0].a['href'])
        result.append(lst[0].a['title'])
     
        result.append(lst[1].a['href'])
        result.append(lst[1].a['title'])
        result.append(lst[1].a.string)
     
        result.append(lst[2].string)
        result.append(lst[3].string)
     
        # Joint the resulting list into one string and print it.
        f.write('|'.join(result) + '\n')
     
    f.close()
    
 
if __name__ == '__main__':    # then this was executed as a script
    
    # Get the command line argument -- the path to be searched for .html
    # files. No error checking here (i.e. quick hack). Or you could assign
    # myPath your favourite constant string.
    myPath = sys.argv[1]
    
    # Use the glob module and the glob() function from inside to get
    # the *.html filenames.
    mask = os.path.join(myPath, '*.html')
    print "Searching for '%s' files..." % mask
    
    for fnameIn in glob.glob(mask):
        # Construct the output filename. The simplest way is just to append
        # the .result extension.
        fnameOut = fnameIn + '.result'
        print fnameIn + ' --> ' + fnameOut
        
        # Call the above function to extract the needed info.
        extract_info(fnameIn, fnameOut)
        
    print '(finished)'    

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
cataliniAuthor Commented:
perfect! wonderful code! thank you soooo much!
0
cataliniAuthor Commented:
wonderful solution, great code!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.