Solved

How can I extract parts from website with PHP

Posted on 2015-01-21
2
165 Views
Last Modified: 2015-01-22
Hi guys,

I need to get some data once or twice a week from different websites . I usually do this by hand but these websites have way too much information displayed so I thought it would be easier to make a script that runs with a cron job that fetches me exactly what I want and list them for me on a PHP page. I did something alike like 10 years ago but my coding skills are quite rusted so I would appreciate a hand here.

One of the sites list the information like this:

<table width="100%" border="1" cellpadding="0" cellspacing="0" bordercolor="#cccccc"> 
        <tr> 
          <td><table width="100%" border="0" cellpadding="1" cellspacing="2" bordercolor="0"> 
              <tr>  
                <td colspan="3">  
                    <p align="center">   
                    <img src="/images/classes/49.png" alt="Item 1 Number 49" width="40" height="40">  
                      
                    <img src="/images/classes/118.png" alt="Item 2 Number 118" width="40" height="40">  
                      
                    <img src="/images/classes/491.png" alt="Item 3 Number 491" width="40" height="40">  
                      
                    <img src="/images/classes/24.png" alt="Item 4 Number 24" width="40" height="40">  
                     
                    </p> 
                </td> 
              </tr>
......

Open in new window


What I would like to get is something like this:
Item 1 - 49 - 49.png
Item 2 - 118 - 118.png
Item 3 - 491- 491.png
Item 4 - 24- 24.png

or something alike in an array or something so I can even make a DB out from this. How can be done?

Thanks in advance!
0
Comment
Question by:Caracena
2 Comments
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 40563356
Might be helpful for us to see this "in situ."  What is the URL of the site?

Something like this should work.
http://iconoun.com/demo/temp_caracena.php

<?php // demo/temp_caracena.php

/**
 * See: http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28601108.html
 *
 *  Item 1 - 49 - 49.png
 *
 *  Item 2 - 118 - 118.png
 *
 *  Item 3 - 491- 491.png
 *
 *  Item 4 - 24- 24.png
 *
 * or something alike in an array or something so I can even make a DB out from this. How can be done?
 *
 */

error_reporting(E_ALL);
echo '<pre>';

// THE TEST DATA
$htm = <<<EOD
<table width="100%" border="1" cellpadding="0" cellspacing="0" bordercolor="#cccccc">
        <tr>
          <td><table width="100%" border="0" cellpadding="1" cellspacing="2" bordercolor="0">
              <tr>
                <td colspan="3">
                    <p align="center">
                    <img src="/images/classes/49.png" alt="Item 1 Number 49" width="40" height="40">

                    <img src="/images/classes/118.png" alt="Item 2 Number 118" width="40" height="40">

                    <img src="/images/classes/491.png" alt="Item 3 Number 491" width="40" height="40">

                    <img src="/images/classes/24.png" alt="Item 4 Number 24" width="40" height="40">

                    </p>
                </td>
              </tr>
......
EOD;

// GET IMAGE NAMES
$rgx
= '#'         // REGEX DELIMITER
. 'ses/'      // LITERAL STRING
. '('         // CAPTURE GROUP
. '.*?'       // ANYTHING OR NOTHING
. ')'         // END CAPTURE GROUP
. '"'         // LITERAL STRING
. '#'         // REGEX DELIMITER
;
preg_match_all($rgx, $htm, $images);

// GET ALT TEXT STRINGS
$rgx
= '#'         // REGEX DELIMITER
. 'alt="'     // LITERAL STRING
. '('         // CAPTURE GROUP
. '.*?'       // ANYTHING OR NOTHING
. ')'         // END CAPTURE GROUP
. '"'         // LITERAL STRING
. '#'         // REGEX DELIMITER
;
preg_match_all($rgx, $htm, $alts);

// PREPARE THE OUTPUT
$out = array();
foreach ($alts[1] as $key => $val)
{
    $val = str_replace('Number', '-', $val);

    $out[$key] = $val . ' - ' . $images[1][$key];
}

// SHOW THE WORK PRODUCT
print_r($out);

Open in new window

0
 
LVL 6

Author Comment

by:Caracena
ID: 40564108
`Hi Ray, thanks for the answer. It works perfectly! Unfortunately the sites I fetch the data from are very private university (research) sites.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
In this tutorial viewers will learn how to embed Flash content in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <object> tag to embed Flash content.: To specify that the object is Flash content, d…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now