Link to home
Start Free TrialLog in
Avatar of magento
magento

asked on

curl php gurus

Hi,

We exported products from Y store , but it dont have options to get categories..

i have url for all product page in a txt file.

Is it possible to grab the category using php curl , eg : see the below url. Remove the word (remove)

http://www.vinyl(remove)disorder.com/01faceh5.html

O/p:
Col A (URL)                                                                                     ColB(Category)                  
http://www.vinyl(remove)disorder.com/01faceh5.html Home > Wall Quotes > Holiday Quotes > Halloween Quotes

PS: While replying pls dont use the URL ( change it as VD please)

Thanks
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

What is the question?  Do you want to take a string of characters and delete some of the characters?  If that is the question, this is the general format of the solution.

$url = 'http://www.vinyl(remove)disorder.com/01faceh5.html';
$new = str_replace('(remove)', NULL, $url);

Open in new window

Does that help? ~Ray
Avatar of Julian Hansen
I think I understand the question
You want to curl to each of the URL's in the list and pull the breadcrumb from the list

Here is some code that does what you want - assumes a \n separated list of urls in a file urls.txt
<?php

$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  echo $u . ',' . $category . "<br/>";
}

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

Avatar of magento
magento

ASKER

Hi Ray,

Thanks for the code.

Exactly what i'm looking for except i need this in an output file .

I can copy the o/p and insert into Excel for formatting , but the url list is ~ 50k so it will be good to have the o/p generated to a file aswell.

Thanks
You can write an output file with file_put_contents().  If you have an array and you want to make a text-type file you can use implode(PHP_EOL) to collapse the array into a string.
@magento - just to clarify are you following Ray_Paseur's code (the first post) or mine (second post)?

I understood your post to be you want to curl the pages and pull the category value from those pages - and the (remove) thing was a way to protect the URL's in this post?

Or am I missing something?
Avatar of magento

ASKER

Hi ,

I'm sorry .

My reply  is to JulianH.

Thanks
You can use Ray's suggestion and put the items into an array and then do a file_put_contents - or do it this way.
<?php
$fp = fopen("output.txt" ,"wt");
$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

Avatar of magento

ASKER

Hi ,

I tried the code, but its not putting contents properly.

I opened the file in excel and it comes in different row.

I need $u in ColA and $category value in ColB.

Thanks
Try this - script was not stripping "\r\n" from end of lines read from urls file
<?php

$fp = fopen("output.txt" ,"wt");
$urls = file("urls.txt", FILE_IGNORE_NEW_LINES);
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

FWIW, trim() or FILE_IGNORE_NEW_LINES is almost always necessary since the line endings from different OS may consist of \r, \n, or \r\n.  In the instant context, PHP_EOL will generate a context-appropriate line ending, however once the data is sent to a different context, the line endings may not be right.  Its a Mac/Windows/Linux thing...
http://php.net/manual/en/function.trim.php

Best to all, ~Ray
Avatar of magento

ASKER

Hi JulianH,

The o/p is coming for most of the URLS except few.

http://www.vinyl(remove)disorder.com/dogpawns001.html --> Working fine
http://www.vinyl(remove)disorder.com/helmet.html --> Not working fine

Eg: for the 1st html(page source) its coming and for 2nd its coming as ">&>>>>"

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="animals.html">Animals</a> &gt; <a href="animals-dogs.html">Dogs</a> &gt; Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</div></div><h1 class="pagebanner">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Dogs</span><br /><span class="fn">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="beermugs.html">Cups & Mugs</a> &gt; <a href="cuyodrglorco.html">Customize your Drinking Glasses or  Coffee Mugs with Any custom text! </a> &gt; <a href="sports.html">Sports</a> &gt; <a href="sports-hunting.html">Hunting</a> &gt; Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</div></div><h1 class="pagebanner">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Hunting</span><br /><span class="fn">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of magento

ASKER

Hi ,

I have created a file with 2 urls (1 -working and another - not working)  and run ur code, it shows the output as 0 in the screen .

Dont understand what does it means? Is the problem with the pages?

Thanks
Did you read my previous post?

There seems to be something specific with that particular page.
What you need to do is

1. Save a working page
2. Save the page that does not work
3. Compare the two and find the differences
4. Line by line replace the lines in the non-working page with the corresponding line from the working page testing in between each change.

At some point the non-working page will work. That will tell you what line the problem is on - you can then narrow from there.

The script I posted above was to help you with the above. If you save each page to a file and then run the second script against those pages it will dump the category for that page.
Can we step back from the technical details here and take a higher view of the problem?  The right way to assess and solve the problem is to look at the inputs and outputs.

Please post a sample of the input you have and show us exactly, line-for-line what output you want to get from that input.  When we see these things, we can show you ways that the programming can be written to make that transformation.

Then if the programmatic transformation does not work for all test cases we can move on to the next test case and, using the first and second test cases together, we can modify the programming so it works for both test cases.  And we can add test cases one at a time until we have coverage of all or most of the data set.

This article explains the thinking behind this sort of process.  It's not rocket science; it is simple, plodding grunt work, testing until the results are achieved.  The TDD strategy brings discipline to our thought processes.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
Avatar of magento

ASKER

Thanks Ray , i am checking it.
@Ray - I have already been through some of the testing on this. The script is correct for what the author wants and is performing as expected.

The problem comes from the fact that he essentially wants to screen scrape another site - the solution uses XML parsing of curl'd pages that are not properly formed xml documents.

Although the source pages are generated on the target site, there appears to be this one annomoly where the data in the page is present, just the xml parsing is not pulling it out.

Given that each page is substantially almost identical to another page - I believe it is something specific in the content of the problem page that is the issue.

Finding it is fairly straight forward but it involves a lot of grunt work which I felt the autoher would be better placed to pursue. However, I believe there is an understanding gap between my suggested approach to finding the culprit and his interpretation.