[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

curl php gurus

Posted on 2012-08-16
17
Medium Priority
?
237 Views
Last Modified: 2013-04-07
Hi,

We exported products from Y store , but it dont have options to get categories..

i have url for all product page in a txt file.

Is it possible to grab the category using php curl , eg : see the below url. Remove the word (remove)

http://www.vinyl(remove)disorder.com/01faceh5.html

O/p:
Col A (URL)                                                                                     ColB(Category)                  
http://www.vinyl(remove)disorder.com/01faceh5.html Home > Wall Quotes > Holiday Quotes > Halloween Quotes

PS: While replying pls dont use the URL ( change it as VD please)

Thanks
0
Comment
Question by:magento
  • 7
  • 6
  • 4
17 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 38303205
What is the question?  Do you want to take a string of characters and delete some of the characters?  If that is the question, this is the general format of the solution.

$url = 'http://www.vinyl(remove)disorder.com/01faceh5.html';
$new = str_replace('(remove)', NULL, $url);

Open in new window

Does that help? ~Ray
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38303812
I think I understand the question
You want to curl to each of the URL's in the list and pull the breadcrumb from the list

Here is some code that does what you want - assumes a \n separated list of urls in a file urls.txt
<?php

$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  echo $u . ',' . $category . "<br/>";
}

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
 
LVL 5

Author Comment

by:magento
ID: 38305045
Hi Ray,

Thanks for the code.

Exactly what i'm looking for except i need this in an output file .

I can copy the o/p and insert into Excel for formatting , but the url list is ~ 50k so it will be good to have the o/p generated to a file aswell.

Thanks
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 38305187
You can write an output file with file_put_contents().  If you have an array and you want to make a text-type file you can use implode(PHP_EOL) to collapse the array into a string.
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38305218
@magento - just to clarify are you following Ray_Paseur's code (the first post) or mine (second post)?

I understood your post to be you want to curl the pages and pull the category value from those pages - and the (remove) thing was a way to protect the URL's in this post?

Or am I missing something?
0
 
LVL 5

Author Comment

by:magento
ID: 38305652
Hi ,

I'm sorry .

My reply  is to JulianH.

Thanks
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38305736
You can use Ray's suggestion and put the items into an array and then do a file_put_contents - or do it this way.
<?php
$fp = fopen("output.txt" ,"wt");
$urls = explode("\n", file_get_contents("urls.txt"));
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page3($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
 
LVL 5

Author Comment

by:magento
ID: 38307074
Hi ,

I tried the code, but its not putting contents properly.

I opened the file in excel and it comes in different row.

I need $u in ColA and $category value in ColB.

Thanks
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38307714
Try this - script was not stripping "\r\n" from end of lines read from urls file
<?php

$fp = fopen("output.txt" ,"wt");
$urls = file("urls.txt", FILE_IGNORE_NEW_LINES);
foreach($urls as $u) {
  set_time_limit(60);
  $page = get_page($u);
  $category = parse_category($page);
  fputs($fp, $u . ',' . $category . "\n");
}
fclose($fp);

function parse_category(&$page)
{
  $category = 'not found';
  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $result = $xpath->query('//div[@class="scBreadcrumbs"]');
  if ($result->length > 0) {
    $category = $result->item(0)->nodeValue;
  }
  return $category;
}

function get_page($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $server_output = curl_exec ($ch);

  curl_close ($ch);
  return $server_output;
}
?>

Open in new window

0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 38307727
FWIW, trim() or FILE_IGNORE_NEW_LINES is almost always necessary since the line endings from different OS may consist of \r, \n, or \r\n.  In the instant context, PHP_EOL will generate a context-appropriate line ending, however once the data is sent to a different context, the line endings may not be right.  Its a Mac/Windows/Linux thing...
http://php.net/manual/en/function.trim.php

Best to all, ~Ray
0
 
LVL 5

Author Comment

by:magento
ID: 38328073
Hi JulianH,

The o/p is coming for most of the URLS except few.

http://www.vinyl(remove)disorder.com/dogpawns001.html --> Working fine
http://www.vinyl(remove)disorder.com/helmet.html --> Not working fine

Eg: for the 1st html(page source) its coming and for 2nd its coming as ">&>>>>"

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="animals.html">Animals</a> &gt; <a href="animals-dogs.html">Dogs</a> &gt; Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</div></div><h1 class="pagebanner">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Dogs</span><br /><span class="fn">Dog Paw NS001 Animal animals xxx Decal Wall Art Sticker Mural</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

<a href="http://twitter.com/#!/xxxDisorder" target="_blank"><img src="http://lib.store.yahoo.net/lib/yhst/twitter.png" alt="Follow us on Twitter" /></a></div></div><div id="bd"><div class="bdinner"><div id="yui-main"><div class="yui-b"><div class="yui-g"><div class="outer-breadcrumbs"><div class="scBreadcrumbs"><a href="index.html">Home</a> &gt; <a href="beermugs.html">Cups & Mugs</a> &gt; <a href="cuyodrglorco.html">Customize your Drinking Glasses or  Coffee Mugs with Any custom text! </a> &gt; <a href="sports.html">Sports</a> &gt; <a href="sports-hunting.html">Hunting</a> &gt; Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</div></div><h1 class="pagebanner">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</h1><div class="hproduct"><span class="brand"></span><br /><span class="category">Hunting</span><br /><span class="fn">Hunting In A Tree Hunter Sport Sports xxx Decal Stickers 001</span><br /><span class="description">Apply this sticker to your car, truck, boat or anywhere you want... We can make this in a variety of sizes and colors. If you want a custom larger or smaller size please just drop us an email.

Open in new window

0
 
LVL 60

Accepted Solution

by:
Julian Hansen earned 2000 total points
ID: 38328200
I can't see anything glaring obvious that could be causing it - however the script does suppress warnings on the dom load - and there are a lot of warnings as these pages are far from being valid xml docs. It is possible that something in the page is causing the error.

From a scripting perspective the script is correct and wil work on a valid XML doc - what you need to do is try and find out what is different between those two pages.

I did a file compare and there are not that many differences. What I would do is save a working page and this page (you can do this by adding the following line to the code
  ...
  $page = get_page($url);
  file_put_contents('filename.txt', $page);
  ...

Open in new window

Run that for a working URL and the broken one.
Then use this script to test the output
<?php
$page = file_get_contents('filename.txt');
$dom = new DOMDocument();
@$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
$result = $xpath->query('//div[@class="scBreadcrumbs"]');
echo $result->length;

fnDump($result->item(0)->nodeValue);
function fnDump(&$obj)
{
  echo "<pre>";
  print_r($obj);
  echo "</pre>";
}
?>

Open in new window

For each line that is different change that line one at a time from a working page to the page that does not work. At some point the broken page will work - which will tell you what line is causing the problem.
You can the further narrow it down by making small changes in the offending line until you find the problem
0
 
LVL 5

Author Comment

by:magento
ID: 38328246
Hi ,

I have created a file with 2 urls (1 -working and another - not working)  and run ur code, it shows the output as 0 in the screen .

Dont understand what does it means? Is the problem with the pages?

Thanks
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38328284
Did you read my previous post?

There seems to be something specific with that particular page.
What you need to do is

1. Save a working page
2. Save the page that does not work
3. Compare the two and find the differences
4. Line by line replace the lines in the non-working page with the corresponding line from the working page testing in between each change.

At some point the non-working page will work. That will tell you what line the problem is on - you can then narrow from there.

The script I posted above was to help you with the above. If you save each page to a file and then run the second script against those pages it will dump the category for that page.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 38329060
Can we step back from the technical details here and take a higher view of the problem?  The right way to assess and solve the problem is to look at the inputs and outputs.

Please post a sample of the input you have and show us exactly, line-for-line what output you want to get from that input.  When we see these things, we can show you ways that the programming can be written to make that transformation.

Then if the programmatic transformation does not work for all test cases we can move on to the next test case and, using the first and second test cases together, we can modify the programming so it works for both test cases.  And we can add test cases one at a time until we have coverage of all or most of the data set.

This article explains the thinking behind this sort of process.  It's not rocket science; it is simple, plodding grunt work, testing until the results are achieved.  The TDD strategy brings discipline to our thought processes.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
0
 
LVL 5

Author Comment

by:magento
ID: 38329071
Thanks Ray , i am checking it.
0
 
LVL 60

Expert Comment

by:Julian Hansen
ID: 38329293
@Ray - I have already been through some of the testing on this. The script is correct for what the author wants and is performing as expected.

The problem comes from the fact that he essentially wants to screen scrape another site - the solution uses XML parsing of curl'd pages that are not properly formed xml documents.

Although the source pages are generated on the target site, there appears to be this one annomoly where the data in the page is present, just the xml parsing is not pulling it out.

Given that each page is substantially almost identical to another page - I believe it is something specific in the content of the problem page that is the issue.

Finding it is fairly straight forward but it involves a lot of grunt work which I felt the autoher would be better placed to pursue. However, I believe there is an understanding gap between my suggested approach to finding the culprit and his interpretation.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
A while back, I ran into a situation where I was trying to use the calculated columns feature in SharePoint 2013 to do some simple math using values in two lists. Between certain data types not being accessible, and also with trying to make a one to…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Suggested Courses
Course of the Month20 days, 3 hours left to enroll

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question