• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 656
  • Last Modified:

extract data from a recurring pattern

I'm following up on my previous question
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23908943.html?cid=239

The problem is I need to extract the data from a web page which is always presented in the same way. The 3 patterns I'm interested in (which occur multiple times in the same page) are the following and NEVER change (see below).

The data that I would need to extract (pipe-delimited) the pattern they come from the second href (with title and source), the date and the number in the final <td>

e.g.
for pattern 1

filenameparsed|pattern|/abcdefg/|Abcd  Degea|510

for pattern 2

filenameparsed|pattern2|/fffffg/|ffff hhhh|06.09.08|40

for pattern 3

filenameparsed|pattern2|/fydsdfs/|asdas asdasdsadas|14.10.07|285

thanks!!!!!!!!!!!

Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">285</td>

Open in new window

0
catalini
Asked:
catalini
  • 5
  • 2
1 Solution
 
ozoCommented:
$_ = <<HERE;
Pattern 1
 
<tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/abcdefg/" title="abcdefg ghe"><img class="artist" alt="abcdefg ghe" src="/abcdefg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/abcdefg/" title="abcdfegs">Abcd  Degea</a></b></td>
<td style="width: 35px;">12.05.07</td>
<td width="40" style="text-align: right; padding-right: 4px;">510</td>
 
Pattern 2
 
<td colspan="7" style="margin-top: 4px"><div style="overflow-y: auto; padding: 2px; margin-top: 4px">asdasdasdas asdasdasd asdasdsadasasd</div></td></tr><tr height="20"><td style="width: 90px;" height="82" rowspan="2"><a href="/fffffg/" title="ffff hhhh"><img class="individual" alt="ffff hhhh" src="/fffffg/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fffffg/" title="ffff hhhh</a></b></td>
<td style="width: 35px;">06.09.08</td>
<td width="40" style="text-align: right; padding-right: 4px;">40</td>
 
Pattern 3
 
<td style="width: 90px;" height="82" rowspan="2"><a href="/fydsdfs/" titPle="fydsdfs asdasdas"><img class="individual" alt="asdasdas sdasdsad" src="/individual/photo/search.jpeg" /></a></td>
<td style="width: 300px;"><b><a href="/fydsdfs/" title="asdas asdasdsadas</a></b></td>
<td style="width: 35px;">14.10.07</td>
<td width="40" style="text-align: right; padding-right: 4px;"">285</td>
HERE

print "filenameparsed|pattern$1|$2|$3|$4|$5\n" while /(\d+)\s*<.*?href="([^"]*)".*?title="([^"]*?)["<].*?([\d.]+)<\/td>.*?([\d]+)<\/td>/gs;
0
 
cataliniAuthor Commented:
thanks ozo! one more question, how do I need to adapt the starting part to have a file that processes all documents in a folder (filling the information in the field "filenameparsed")? (see example below)

Please post your solution also under http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23908943.html

thanks!!!
local $/;
 
foreach (<*>) {
 
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
 
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
 
    my $data=<$in>;
 
    $data =~ s/.*?<tr height(.*?)<\/td>.*/$1/s;
 
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
 
        print $out "$_|$1|$2|$3|$4|$5\n";
 
    }
 
    close($in);
 
    close($out);
 
}

Open in new window

0
 
cataliniAuthor Commented:
I forgot to mention that the source files are html files, where the patterns (1-3) appear inside.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
peprCommented:
Try the snippet below. See the explanation at http:Q_23908943.html#22992513
import codecs
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
# Output in Unicode may be difficult to display on console.
# Therefore, the output strings will be written to the output
# file.
f = codecs.open('output.txt', 'w', 'utf-8-sig')
 
# Find and process all the table rows.
for tr in soup.findAll('tr'):
    # Collect only the rows with specific attributes.
    if tr.get('height', '') != '20':
        continue
 
    # Collect all td's in one row into the auxiliary list.
    lst = []                      # list of td in one row
    for td in tr.findAll('td'):
        lst.append(td)
 
    # If lst does not have at least 4 elements, then it is unexpected.
    # Skip it.
    if len(lst) != 4:
        continue
 
    # Process the info from one row into the result list. 
    # The first two data cells contain hrefs with titles, 
    # the other two the date and the number.
    result = []
    result.append(lst[0].a['href'])
    result.append(lst[0].a['title'])
 
    result.append(lst[1].a['href'])
    result.append(lst[1].a['title'])
    result.append(lst[1].a.string)
 
    result.append(lst[2].string)
    result.append(lst[3].string)
 
    # Joint the resulting list into one string and print it.
    f.write('|'.join(result) + '\n')
 
f.close()

Open in new window

0
 
cataliniAuthor Commented:
pepr, how could I automate the script to run on all files in a given folder and return a filename.results file for each of them (single file)?

thanks
0
 
peprCommented:
Read the comment in the source below. In the case like that, it is usually the best idea to wrap some reusable code to the function definition -- here extract_info(). Notice that it does not use any global arguments (only the imported modules, but it is quite usual approach). The input filename and the output filename is just passed as arguments. This way you may decide later what files will be processed, how the output filenames will be constructed, etc.

The sys.argv can be used to get the arguments passed from command line. If you name the script say extractor.py, then you can run it like:

    python extractor.py ./my/path/to/html/filenames/

The os.path.join() is the preferred way for joining parts of the path.

The glob.glob(mask) returns the list of filenames that fits with the mask.

Feel free to ask for details/modifications.
 
    Petr
from BeautifulSoup import BeautifulSoup
import codecs
import glob
import os
import sys 
 
def extract_info(fnameIn, fnameOut):
 
    # Get the content of your document (somehow) into one string.
    f = open(fnameIn)
    page = f.read()
    f.close()
     
    # Parse the string.
    soup = BeautifulSoup(page)
     
    # Output in Unicode may be difficult to display on console.
    # Therefore, the output strings will be written to the output
    # file.
    f = codecs.open(fnameOut, 'w', 'utf-8-sig')
     
    # Find and process all the table rows.
    for tr in soup.findAll('tr'):
        # Collect only the rows with specific attributes.
        if tr.get('height', '') != '20':
            continue
     
        # Collect all td's in one row into the auxiliary list.
        lst = []                      # list of td in one row
        for td in tr.findAll('td'):
            lst.append(td)
     
        # If lst does not have at least 4 elements, then it is unexpected.
        # Skip it.
        if len(lst) != 4:
            continue
     
        # Process the info from one row into the result list. 
        # The first two data cells contain hrefs with titles, 
        # the other two the date and the number.
        result = []
        result.append(lst[0].a['href'])
        result.append(lst[0].a['title'])
     
        result.append(lst[1].a['href'])
        result.append(lst[1].a['title'])
        result.append(lst[1].a.string)
     
        result.append(lst[2].string)
        result.append(lst[3].string)
     
        # Joint the resulting list into one string and print it.
        f.write('|'.join(result) + '\n')
     
    f.close()
    
 
if __name__ == '__main__':    # then this was executed as a script
    
    # Get the command line argument -- the path to be searched for .html
    # files. No error checking here (i.e. quick hack). Or you could assign
    # myPath your favourite constant string.
    myPath = sys.argv[1]
    
    # Use the glob module and the glob() function from inside to get
    # the *.html filenames.
    mask = os.path.join(myPath, '*.html')
    print "Searching for '%s' files..." % mask
    
    for fnameIn in glob.glob(mask):
        # Construct the output filename. The simplest way is just to append
        # the .result extension.
        fnameOut = fnameIn + '.result'
        print fnameIn + ' --> ' + fnameOut
        
        # Call the above function to extract the needed info.
        extract_info(fnameIn, fnameOut)
        
    print '(finished)'    

Open in new window

0
 
cataliniAuthor Commented:
perfect! wonderful code! thank you soooo much!
0
 
cataliniAuthor Commented:
wonderful solution, great code!
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now