Solved

Getting product block with simple html dom parser

Posted on 2014-04-29
5
1,172 Views
Last Modified: 2014-05-01
I try to to scrap some information from web, but I am stuck with getting product block. Source off web you can find [here](/www.topocentras.lt/Telefonai-Navigacijos/Ismanieji-telefonai-GSM/)

My code is looking like this

    <h2>Telefonai topocentras</h2>
     </br>
    <?php
     include_once('simple_html_dom.php');
     $url = "http://www.topocentras.lt/Telefonai-Navigacijos/Ismanieji-telefonai-/";
     // Start from the main page
      $nextLink = $url;
     // Loop on each next Link as long as it exsists
     while ($nextLink) {
     echo "<hr>nextLink: $nextLink<br>";
     //Create a DOM object
     $html = new simple_html_dom();
     // Load HTML from a url
     $html->load_file($nextLink);
     //Try to find phone block
     $phones = $html->find('li#product-picture img[src]');
      
      foreach($phones as $phone) {
      //Try to find phone cost
        $cost = $phone->find('strong[class=price]', 0)->plaintext;
         //Try to find phone link
        $link = $phone->href;
         //Try to find phone name
        $name=$phone->find('li[a=title]',0)->plaintext;
         //Try to find phonGSMe photo source
        $photo= $phone->find('img[src]',0);
        echo $name, " #----# ", $cost, " #----# ", $link, " #----# ", $photo, "<br>";
        }
      $nextLink = ( ($temp = $html->find('a.href span[=""]', 0)) ?    "https://www.topocentras.lt".$temp->href : NULL );
    // Clear DOM object
    $html->clear();
    unset($html);
     }
     ?>

So problem is to get block of phone. I need all phone name, cost, price, link. I tried a lot of varies, but nothing is working. Maybe someone can tell how correctly get phone block?
0
Comment
Question by:Nekasas
  • 3
  • 2
5 Comments
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 40029958
Please see http://iconoun.com/demo/temp_nekasas.php

Simple_HTML_DOM is only one way to scrape the document.  I find it easier to use conventional PHP instructions, rather than the object.

I am going to recommend that you do not do this at all.  If the publisher wants to share this information with you in a programmatic manner, they will expose an API that will give you structured data.  If you depend on scraping a web page to get the data you will find that your application is very brittle - it will break as soon as the publisher changes the format or tags inside the document.  And it follows that if the publisher discovers that you're using automation to capture their data, and the publisher does not want you to have this information from web scraping, it will be very easy for them to break your script without notice.

<?php // demo/temp_nekasas.php
error_reporting(E_ALL);

// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28422236.html

// THE HTML DOCUMENT
$htm = file_get_contents('http://www.topocentras.lt/Telefonai-Navigacijos/Ismanieji-telefonai-GSM/');

// FIRST SIGNAL STRING
$ssa = <<<EOD
<div class="pages">
EOD;

// LAST SIGNAL STRING
$ssz = <<<EOD
<div class="pages bottom-locator">
EOD;

// PRODUCT SEPARATORS
$ssp = <<<EOD
<div class="product-links clear">
EOD;

// MINIMIZE WHITESPACE
$htm = preg_replace('/\s\s+/', ' ', $htm);

// USE SIGNAL STRINGS TO DISCARD UNWANTED PARTS OF THE HTML DOCUMENT
$arr = explode($ssa, $htm);
$arr = explode($ssz, $arr[1]);

// LOCATE THE PRODUCTS DETAIL
$arr = explode($ssp, $arr[0]);
unset($arr[0]);

echo '<pre>';

foreach ($arr as $str)
{
    // GET TITLE
    $xyz = explode('title="', $str);
    $xyz = explode('">', $xyz[1]);
    $title = $xyz[0];

    // GET PRICE
    $xyz = explode('<strong class="price">', $str);
    $xyz = explode('<sup', $xyz[1]);
    $price = $xyz[0];

    // GET LINK
    $xyz = explode('<div class="list-ref">', $str);

    // AT END OF FILE
    if (empty($xyz[1])) break;

    $xyz = explode('</a>', $xyz[1]);
    $link  = trim($xyz[0]) . '</a>';

    // SHOW THE EXTRACTED DATA ELEMENTS
    echo PHP_EOL . 'TITLE: ' . $title;
    echo PHP_EOL . 'PRICE: ' . $price;
    echo PHP_EOL . 'AHREF: ' . htmlentities($link);
    echo PHP_EOL;
}

Open in new window

0
 

Author Comment

by:Nekasas
ID: 40034134
I am very thankfully and it will be wonderfull if you can write to how to get another page. I mean same website but i will get info phone from all page. I need pagination too
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 40034277
... wonderfull if you can write to how to get another page.
That seems like a separate question, but in reality it's not a "question" so much as a requirement for application development.  You should consider hiring a professional application developer if you want to pursue this line of business.  A better approach would be to contact the web publisher and ask them to expose an API that gives you the data.  That will lead to a better web application and it will also keep you out of legal trouble!

Thanks for the points and thanks for using EE, ~Ray
0
 

Author Comment

by:Nekasas
ID: 40034291
I am student and I need this to my bachelor work, so professional application developer must be I or page's like this or stackoverflow
0
 

Author Comment

by:Nekasas
ID: 40034316
Sorry for duplicate comment, but I want to ask about your code. How about cutting link that I will get ony link without title. You are using this
$xyz = explode('<sup', $xyz[1]);"

Open in new window

to show from where to start, but how to show where to stop?
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Consider the following scenario: You are working on a website and make something great - something that lets the server work with information submitted by your users. This could be anything, from a simple guestbook to a e-Money solution. But what…
This article discusses four methods for overlaying images in a container on a web page
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to dynamically set the form action using jQuery.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now