Link to home
Start Free TrialLog in
Avatar of Cybervanes
Cybervanes

asked on

PHP screen scraping from page that requires you to be loged in.

i have a URL that when used in a regular browser writes a log-in cookie and then allows me to access a different page with info i need to scrape for use in other applications.

Loging URL:
http://theirDomain.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
contentURL:
http://theirDomain.com/ActualMain.aspx?sid=d4&pvsid=18950

i'm using PHP Simple HTML DOM Parser to scrape the information i need in other applications that don't require a log in.

I'll probably have to use php curl to accomplish this but i cant find any easily understandable information.

does anybody have any ideas that may help my get to the data i need?
Avatar of HackneyCab
HackneyCab
Flag of United Kingdom of Great Britain and Northern Ireland image

Don't screen scrape from a site before talking to the site's owner. If you get their agreement, they may be able to help you access the data.

On the other hand, if you're caught scraping content from a site without permission the owner of that content can have your pages delisted from Google.
Avatar of Cybervanes
Cybervanes

ASKER

Thanks Hackney... I have permission and it the only way i could get the information that i need. the said information is updated hourly and they don't have the capability to send a feed.

anybody else?
Well, I believe you can use cURL (needs to be compiled into PHP as it's not part of the default build) with cookie options so that it can handle this sort of thing. This isn't something I've done, but while you're waiting for other experts to suggest something, take a look at these links:

http://php.net/manual/en/book.curl.php

http://www.electrictoolbox.com/php-curl-cookies/
Sad to say, but there is no one-size-fits-all functionality here.  You must read the login page, extract the form vars, fill in the correct fields and submit the page.  This can be done with CURL and a custom PHP script.

Please post the ACTUAL URLs involved and I will show you how to get started.  There are a lot of moving parts and it is a brittle implementation when you do it with CURL, but it is better than nothing.  

A better answer is one that uses an API provided by the data source.  A REST implementation of this that relies on a URL GET string to send credentials and request parameters and that returns either XML or CSV is a very common and well-understood interface.  For examples, look up the Yahoo and Google geocoders.

Standing by, ~Ray
Thanks Ray,

log-in URL:
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
the above URL writes a cookie that the next URL looks for.

URL containing the data:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950
the above page updates it's information every 5 minutes.

I'm needing to get to the information contained in the span id="ctl00_MainContent_LblDayEnergy" on line#167 = Its a current daily kWh production reading.

and

span id="ctl00_MainContent_LblTotalEnergy" also on line#167 which is a total production reading in mWh.

Thank You very much for your help.
(Be aware that those links will now let anyone login and take a look. If that's a problem, you'll need to change the password once you've got your code working.)

How many panels is it taking to produce 31kW?
yes i am aware the log in is for a guest with no administrator privileges.

there are quite a few panels on up on the roof... I'm just the IT guy i don't know the count. I'll try to attach an image.
They produce an average of 480 kWh per day!
DSC00271.JPG
That is impressive. In what country are they based? I know that solar panels are a waste of money in the UK because our solar flux is too weak and frequently interrupted.
Colorado USA... Awesome incentives right now.
OK, I clicked on the link to this page:
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811

And got redirected to this page:
http://solarweb.fronius.com/PVSystem.aspx?sid=m3_1&pvsid=18950

So next I clicked on this link:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950

... and got a page that contained this on line 170 (deconstructed here to make it easy to read)
<span id="ctl00_MainContent_LblDayEnergy" 
      style="font-weight:bold;Z-INDEX: 103; LEFT: 535px; POSITION: absolute; TOP: 429px">
	  471 kWh
</span>
<span id="ctl00_MainContent_LblTotalEnergy" 
      style="font-weight:bold;Z-INDEX: 104; LEFT: 535px; POSITION: absolute; TOP: 456px">
	  10.441 MWh
</span> 
<span id="ctl00_MainContent_LblTemp1" 
      style="font-weight:bold;Z-INDEX: 105; LEFT: 587px; POSITION: absolute; TOP: 214px">
      ---
</span> 
<img id="ctl00_MainContent_ImgThermometer1" 
     src="internet/img/thermometer_small_1.jpg" 
	 style="border-width:0px;Z-INDEX: 106; LEFT: 558px; POSITION: absolute; TOP: 193px" />
<span id="ctl00_MainContent_LblTemp2" 
      style="font-weight:bold;Z-INDEX: 107; LEFT: 766px; POSITION: absolute; TOP: 373px">
	  ---
</span> 
<BR />
<span id="ctl00_MainContent_lblInverters" 
      style="font-weight:bold;Z-INDEX: 108; LEFT: 382px; POSITION: absolute; TOP: 325px">
	  6 IG Plus 10.0-1 UNI
<br>
</span>

Open in new window

Right!

I'm needing a way to scrape this information with PHP.

I'm needing to get to the information contained in the span id="ctl00_MainContent_LblDayEnergy"

and

span id="ctl00_MainContent_LblTotalEnergy"
It seems like a couple of things are happening.  First, the SID argument in the URL indicates that there is a persistent login, and when I used the old SID by clicking the link I got into an already-logged-in web page.

The time stamps in the data page looked good.  And we found the kWh and mWh values.  Next I am going to try closing all the browser instances, removing the cookies and trying the data page again.
Alright...

F.Y.I.
I'm using code provided by http://simplehtmldom.sourceforge.net/ to scrape other inverter sites to obtain the information i require but they don't require a log-in cookie.
It does not look like there is a login cookie - there were no "fronius" domain cookies on my browser, and the presence of the SID argument in the URL probably means that they are transmitting the session ID in the URL links.  Nevertheless, we still need to hit the login page and provide the appropriate credentials, then follow any redirects.  If you want, you can email me the login and password you want me to test with.  Please use my GMail address shown in my public profile here at EE.

Scraping the page is probably as simple as using a REGEX or two, once we get logged in and read the HTML with CURL.
You da man Ray!
did you get my email?
Yes, I logged in, but I got some anomalous results that I am still sorting out.  Some screenshots follow.  This is what I saw after Iogged in.  On the left you can see that it says, "There is no PV system assigned..."
after-login.png
So I clicked on that area in the left where it says, "No PV system" and got this.
abbruch.png
I tried using the admin links, but got this page...  I am accepting cookies, so there is something else wrong - possibly a server redirect loop.
try-pvsystem.png
I will continue to tinker around with it a little bit more - it looks like the login does not use CAPTCHA which is good news, nor any form tokens.  All the JS validation seems to be about acceptable character strings and empty fields, so we may not need to deal with that.  

Still, the best way to handle this is to get an API from the web site owners.

More to follow...
sorry Ray, I forgot to enable the PV system for your log in. its now enabled.

I've tried to get some type of a csv or xml feed and they do not offer it or any type of API at this time.

thanks.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
looks awesome Ray! Your like a PHP Jedi... Impressive!

I've modified you code to target the specific data I needed.( below )

Your The Man Ray, Thanks!
// REFINE THE DATA SOME
$xyz = strip_tags($xyz, '<div><td><span>');
$xyz = str_replace('>', '> ', $xyz);
$xyz = str_replace('<', ' <', $xyz);
$load_arr = explode(' <span id="ctl00_MainContent_lblCurrentPower" style="Z-INDEX: 101; LEFT: 552px; POSITION: absolute; TOP: 387px"> ', $xyz);

$load = explode(" kW ", $load_arr[1]);
$load = $load[0];

$kWh_arr = explode(' <span id="ctl00_MainContent_LblTotalEnergy" style="Z-INDEX: 104; LEFT: 535px; POSITION: absolute; TOP: 456px"> ', $xyz);
$kWh = explode(" ",$kWh_arr[1]);
$unit = $kWh[1];
$kWh = $kWh[0];

// convert the MWH to Kwh
if($unit == "MWh"){
$kWh = $kWh*1000;
}

echo "<b>Total kWhs = ". $kWh . ' kWhs</b><br/>';

echo "<b>curent KW Load = " . $load . ' kW</b><br/>';

Open in new window

Awesome!
Your solution looks great!  Thanks for the points and for your kind words, and good luck with the project, ~Ray
Just caught a code fault - please replace lines 64 through 68 of my last script post above with this... It might never come up (error handler) but better to produce an accurate message than a false positive for an error page!
if ($xyz === FALSE)
{
    echo "\nCURL 2ND GET FAIL: $nexturl CURL_ERRNO=$err ";
    var_dump($inf);
}

Open in new window