Link to home
Create AccountLog in
Avatar of Fernanditos

asked on

Getting a portion of HTML code from url to database Using CURL


I need help from some expert using CURL on how to get this HTML TABLE CONTENT UPDATED it in my database where id_tbl = 1

Page with Table

All the HTML table with its content:

<table ... id="ctl00_CP1_grdFDACalendar"> .... </table>

Can some one please help me with right CODE ?
  `id_tbl` int(11) NOT NULL auto_increment,
  `content_tbl` text NOT NULL,
  `lastUpdated_tbl` timestamp NOT NULL default '0000-00-00 00:00:00' on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id_tbl`)

Open in new window

Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Hi, F.   Happy New Year!  What web site do you want to grab with CURL?
Avatar of Fernanditos


Hi Ray, thank you and happy new year for you too :)

I posted the link above:

Thank you for having a look.
I would like only to grab the TABLE with id="ctl00_CP1_grdFDACalendar" not the whole page. Is it possible?
$ch = curl_init();
$timeout = 10; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, '');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);

$pattern = '/<table cellspacing=\"2\" cellpadding=\"6\" align=\"Center\" border=\"0\" id=\"ctl00_CP1_grdFDACalendar\".*?>.*?<\/table>/s';
echo "matches = " . preg_match($pattern, $contents, $matches) . "<br>\n ";
echo $matches[0];


Open in new window

Avatar of hpierson

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
hpierson: It is great to know that it is possible! Your solution is close but see the result:

It stop at the first </table> found. Since every cell on the last column here: has 1 small table I need to tell my script to get untill the 10th </table>

How can I specify to get the code untill the 10th </table> instead of stopping at first one?
I thought you only wanted the first one since every table should had a unique ID, and you only specified one id.

I'll take another look, it should be a simple change.
Yes, I want all that table rows (15), it should be this:
So, it should find untill this code:

you can use sql replace statement
Replace the original value of $pattern with:

$pattern = '/<div>\s+<table cellspacing=\"2\" cellpadding=\"6\" align=\"Center\" border=\"0\" id=\"ctl00_CP1_grdFDACalendar\".*?>.*?<\/table>\s+<\/div>/s';

Open in new window

If necessary, strip out the enclosing div.

Tested and works:
BTW, PHP is an awful, clumsy way to do it.

Using perl and LWP is far easier. It's specifically written to handle manipulation of web pages, finding elements is simple and clean.

But that's an answer for a different question :)
its a good answer if i posted the question i would have given you points

Thank you for the kind words. Since the code change provides Fernanditos EXACTLY what he asked for, and I provided a link to show it woks, I expect credit for the answer.
Sure hpierson, your solutions works awesome and I will award you with the points. I im just trying to get it inserted on my database as the question mention. :)
Great, it's been a pleasure.

Best Regards,

Harry Pierson
It's not all that hard to get this information with PHP.  REGEX may be making it "harder" as well as the fact that the web site is made from invalid HTML.

One thing I have found to make "scraping" easier is that you can use explode() on the HTML string.  Between explode(), strip_tags(), strpos() and substr() you take just about any HTML page apart and isolate the data elements you need.  It's easier if the source is XHTML or at least valid HTML because then the DOM functions will work.

Anyway, glad you've got it moving in the right direction, ~Ray
<?php // RAY_temp_fernanditos.php

$htm = file_get_contents('');

echo htmlentities($htm);

Open in new window


1) He asked for curl. On many hosted sites, file_get_contents is disabled for remote content, curl is necessary.
2) In no way does my regex "make it harder." To the contrary, it works perfectly DESPITE the fact the page doesn't validate, and he has no control over that page to make it validate.
3) As I said, this is FAR easier and cleaner in perl with LWP than PHP. With LWP, there are native functions to handle precisely this type of extraction. The guts of the program is 4 lines. No struggling with explode, strip_tags, strpos, or substr.

#!/usr/local/bin/perl -w

use LWP::Simple;
use HTML::TreeBuilder;
use CGI qw(:standard);

print header;
print start_html( -title=>'Using perl and LWP to Extract Table');

my $url = '';
my $contents = get($url) || die "Can't download $url: $!\n";
$root = HTML::TreeBuilder->new_from_content($contents);
my $table = $root->find_by_attribute("id","ctl00_CP1_grdFDACalendar");

print $table->as_HTML();

Open in new window

Hi, Harry.  I was not talking about YOUR regex, so much as regex in general.  From my experience here at EE I find a lot of people are confounded by a language that is almost nothing but punctuation (little wonder).  And some are amazed, too, when the web page scraper they have meticulously put together suddenly breaks because of a little change in a site they do not control.

You're entitled to your opinion, of course, but I do not find using explode(), strip_tags(), strpos() and substr() to be a struggle!

best regards, ~Ray
You might be interested in the fact that after the entire page is loaded into $root, it only takes the one line. I could have loaded the page into $root in one line, but I thought it was clearer to the reader to do it in several (I grant you, not the typical perl approach).

my $table = $root->find_by_attribute("id","ctl00_CP1_grdFDACalendar");

Gets exactly what he needs. Compared to that, having to parse the content with PHP IS a struggle. If you have anything comparable in PHP, I'd be interested in hearing about it.

I've been programming for 44 years, in perhaps 40 different programming and scripting languages.  I worked at the IBM Research Center in Poughkeepsie when they were putting APL onto mainframes .I use whatever tool is most appropriate to the job. I don't remember the last time I wasn't able to accomplish whatever I needed to, usually without "struggling."