curl php gurus

Hi,

We exported products from Y store , but it dont have options to get categories..

i have url for all product page in a txt file.

Is it possible to grab the category using php curl , eg : see the below url. Remove the word (remove)

http://www.vinyl(remove)disorder.com/01faceh5.html

O/p:
Col A (URL)                                                                                     ColB(Category)                  
http://www.vinyl(remove)disorder.com/01faceh5.html Home > Wall Quotes > Holiday Quotes > Halloween Quotes

PS: While replying pls dont use the URL ( change it as VD please)

Thanks
LVL 5
magentoAsked:
Who is Participating?
 
Julian HansenCommented:
I can't see anything glaring obvious that could be causing it - however the script does suppress warnings on the dom load - and there are a lot of warnings as these pages are far from being valid xml docs. It is possible that something in the page is causing the error.

From a scripting perspective the script is correct and wil work on a valid XML doc - what you need to do is try and find out what is different between those two pages.

I did a file compare and there are not that many differences. What I would do is save a working page and this page (you can do this by adding the following line to the code
  ...
  $page = get_page($url);
  file_put_contents('filename.txt', $page);
  ...

Open in new window

Run that for a working URL and the broken one.
Then use this script to test the output
<?php
$page = file_get_contents('filename.txt');
$dom = new DOMDocument();
@$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
$result = $xpath->query('//div[@class="scBreadcrumbs"]');
echo $result->length;

fnDump($result->item(0)->nodeValue);
function fnDump(&$obj)
{
  echo "<pre>";
  print_r($obj);
  echo "</pre>";
}
?>

Open in new window

For each line that is different change that line one at a time from a working page to the page that does not work. At some point the broken page will work - which will tell you what line is causing the problem.
You can the further narrow it down by making small changes in the offending line until you find the problem
0
 
Ray PaseurCommented:
What is the question?  Do you want to take a string of characters and delete some of the characters?  If that is the question, this is the general format of the solution.

$url = 'http://www.vinyl(remove)disorder.com/01faceh5.html';
$new = str_replace('(remove)', NULL, $url);

Open in new window

Does that help? ~Ray
0
 
Julian HansenCommented:
I think I understand the question
You want to curl to each of the URL's in the list and pull the breadcrumb from the list

Here is some code that does what you want - assumes a \n separated list of urls in a file urls.txt
<?php

$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  echo $u . ',' . $category . "<br/>";
}

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
magentoAuthor Commented:
Hi Ray,

Thanks for the code.

Exactly what i'm looking for except i need this in an output file .

I can copy the o/p and insert into Excel for formatting , but the url list is ~ 50k so it will be good to have the o/p generated to a file aswell.

Thanks
0
 
Ray PaseurCommented:
You can write an output file with file_put_contents().  If you have an array and you want to make a text-type file you can use implode(PHP_EOL) to collapse the array into a string.
0
 
Julian HansenCommented:
@magento - just to clarify are you following Ray_Paseur's code (the first post) or mine (second post)?

I understood your post to be you want to curl the pages and pull the category value from those pages - and the (remove) thing was a way to protect the URL's in this post?

Or am I missing something?
0
 
magentoAuthor Commented:
Hi ,

I'm sorry .

My reply  is to JulianH.

Thanks
0
 
Julian HansenCommented:
You can use Ray's suggestion and put the items into an array and then do a file_put_contents - or do it this way.
<?php
$fp = fopen("output.txt" ,"wt");
$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
 
magentoAuthor Commented:
Hi ,

I tried the code, but its not putting contents properly.

I opened the file in excel and it comes in different row.

I need $u in ColA and $category value in ColB.

Thanks
0
 
Julian HansenCommented:
Try this - script was not stripping "\r\n" from end of lines read from urls file
<?php

$fp = fopen("output.txt" ,"wt");
$urls = file("urls.txt", FILE_IGNORE_NEW_LINES);
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
 
Ray PaseurCommented:
FWIW, trim() or FILE_IGNORE_NEW_LINES is almost always necessary since the line endings from different OS may consist of \r, \n, or \r\n.  In the instant context, PHP_EOL will generate a context-appropriate line ending, however once the data is sent to a different context, the line endings may not be right.  Its a Mac/Windows/Linux thing...
http://php.net/manual/en/function.trim.php

Best to all, ~Ray
0
 
magentoAuthor Commented:
Hi JulianH,

The o/p is coming for most of the URLS except few.

http://www.vinyl(remove)disorder.com/dogpawns001.html --> Working fine
http://www.vinyl(remove)disorder.com/helmet.html --> Not working fine

Eg: for the 1st html(page source) its coming and for 2nd its coming as ">&>>>>"

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="animals.html">Animals</a> &gt; <a href="animals-dogs.html">Dogs</a> &gt; Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</div></div><h1 class="pagebanner">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Dogs</span><br /><span class="fn">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="beermugs.html">Cups & Mugs</a> &gt; <a href="cuyodrglorco.html">Customize your Drinking Glasses or  Coffee Mugs with Any custom text! </a> &gt; <a href="sports.html">Sports</a> &gt; <a href="sports-hunting.html">Hunting</a> &gt; Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</div></div><h1 class="pagebanner">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Hunting</span><br /><span class="fn">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

Open in new window

0
 
magentoAuthor Commented:
Hi ,

I have created a file with 2 urls (1 -working and another - not working)  and run ur code, it shows the output as 0 in the screen .

Dont understand what does it means? Is the problem with the pages?

Thanks
0
 
Julian HansenCommented:
Did you read my previous post?

There seems to be something specific with that particular page.
What you need to do is

1. Save a working page
2. Save the page that does not work
3. Compare the two and find the differences
4. Line by line replace the lines in the non-working page with the corresponding line from the working page testing in between each change.

At some point the non-working page will work. That will tell you what line the problem is on - you can then narrow from there.

The script I posted above was to help you with the above. If you save each page to a file and then run the second script against those pages it will dump the category for that page.
0
 
Ray PaseurCommented:
Can we step back from the technical details here and take a higher view of the problem?  The right way to assess and solve the problem is to look at the inputs and outputs.

Please post a sample of the input you have and show us exactly, line-for-line what output you want to get from that input.  When we see these things, we can show you ways that the programming can be written to make that transformation.

Then if the programmatic transformation does not work for all test cases we can move on to the next test case and, using the first and second test cases together, we can modify the programming so it works for both test cases.  And we can add test cases one at a time until we have coverage of all or most of the data set.

This article explains the thinking behind this sort of process.  It's not rocket science; it is simple, plodding grunt work, testing until the results are achieved.  The TDD strategy brings discipline to our thought processes.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
0
 
magentoAuthor Commented:
Thanks Ray , i am checking it.
0
 
Julian HansenCommented:
@Ray - I have already been through some of the testing on this. The script is correct for what the author wants and is performing as expected.

The problem comes from the fact that he essentially wants to screen scrape another site - the solution uses XML parsing of curl'd pages that are not properly formed xml documents.

Although the source pages are generated on the target site, there appears to be this one annomoly where the data in the page is present, just the xml parsing is not pulling it out.

Given that each page is substantially almost identical to another page - I believe it is something specific in the content of the problem page that is the issue.

Finding it is fairly straight forward but it involves a lot of grunt work which I felt the autoher would be better placed to pursue. However, I believe there is an understanding gap between my suggested approach to finding the culprit and his interpretation.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.