Cybervanes
asked on
PHP screen scraping from page that requires you to be loged in.
i have a URL that when used in a regular browser writes a log-in cookie and then allows me to access a different page with info i need to scrape for use in other applications.
Loging URL:
http://theirDomain.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
contentURL:
http://theirDomain.com/ActualMain.aspx?sid=d4&pvsid=18950
i'm using PHP Simple HTML DOM Parser to scrape the information i need in other applications that don't require a log in.
I'll probably have to use php curl to accomplish this but i cant find any easily understandable information.
does anybody have any ideas that may help my get to the data i need?
Loging URL:
http://theirDomain.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
contentURL:
http://theirDomain.com/ActualMain.aspx?sid=d4&pvsid=18950
i'm using PHP Simple HTML DOM Parser to scrape the information i need in other applications that don't require a log in.
I'll probably have to use php curl to accomplish this but i cant find any easily understandable information.
does anybody have any ideas that may help my get to the data i need?
ASKER
Thanks Hackney... I have permission and it the only way i could get the information that i need. the said information is updated hourly and they don't have the capability to send a feed.
anybody else?
anybody else?
Well, I believe you can use cURL (needs to be compiled into PHP as it's not part of the default build) with cookie options so that it can handle this sort of thing. This isn't something I've done, but while you're waiting for other experts to suggest something, take a look at these links:
http://php.net/manual/en/book.curl.php
http://www.electrictoolbox.com/php-curl-cookies/
http://php.net/manual/en/book.curl.php
http://www.electrictoolbox.com/php-curl-cookies/
Sad to say, but there is no one-size-fits-all functionality here. You must read the login page, extract the form vars, fill in the correct fields and submit the page. This can be done with CURL and a custom PHP script.
Please post the ACTUAL URLs involved and I will show you how to get started. There are a lot of moving parts and it is a brittle implementation when you do it with CURL, but it is better than nothing.
A better answer is one that uses an API provided by the data source. A REST implementation of this that relies on a URL GET string to send credentials and request parameters and that returns either XML or CSV is a very common and well-understood interface. For examples, look up the Yahoo and Google geocoders.
Standing by, ~Ray
Please post the ACTUAL URLs involved and I will show you how to get started. There are a lot of moving parts and it is a brittle implementation when you do it with CURL, but it is better than nothing.
A better answer is one that uses an API provided by the data source. A REST implementation of this that relies on a URL GET string to send credentials and request parameters and that returns either XML or CSV is a very common and well-understood interface. For examples, look up the Yahoo and Google geocoders.
Standing by, ~Ray
ASKER
Thanks Ray,
log-in URL:
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
the above URL writes a cookie that the next URL looks for.
URL containing the data:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950
the above page updates it's information every 5 minutes.
I'm needing to get to the information contained in the span id="ctl00_MainContent_LblD ayEnergy" on line#167 = Its a current daily kWh production reading.
and
span id="ctl00_MainContent_LblT otalEnergy " also on line#167 which is a total production reading in mWh.
Thank You very much for your help.
log-in URL:
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
the above URL writes a cookie that the next URL looks for.
URL containing the data:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950
the above page updates it's information every 5 minutes.
I'm needing to get to the information contained in the span id="ctl00_MainContent_LblD
and
span id="ctl00_MainContent_LblT
Thank You very much for your help.
(Be aware that those links will now let anyone login and take a look. If that's a problem, you'll need to change the password once you've got your code working.)
How many panels is it taking to produce 31kW?
How many panels is it taking to produce 31kW?
ASKER
yes i am aware the log in is for a guest with no administrator privileges.
there are quite a few panels on up on the roof... I'm just the IT guy i don't know the count. I'll try to attach an image.
there are quite a few panels on up on the roof... I'm just the IT guy i don't know the count. I'll try to attach an image.
ASKER
They produce an average of 480 kWh per day!
DSC00271.JPG
DSC00271.JPG
That is impressive. In what country are they based? I know that solar panels are a waste of money in the UK because our solar flux is too weak and frequently interrupted.
ASKER
Colorado USA... Awesome incentives right now.
OK, I clicked on the link to this page:
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
And got redirected to this page:
http://solarweb.fronius.com/PVSystem.aspx?sid=m3_1&pvsid=18950
So next I clicked on this link:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950
... and got a page that contained this on line 170 (deconstructed here to make it easy to read)
http://solarweb.fronius.com/customer_login.aspx?SID=1981be913628e1f7f70e5a574c1b4811
And got redirected to this page:
http://solarweb.fronius.com/PVSystem.aspx?sid=m3_1&pvsid=18950
So next I clicked on this link:
http://solarweb.fronius.com/ActualMain.aspx?sid=d4&pvsid=18950
... and got a page that contained this on line 170 (deconstructed here to make it easy to read)
<span id="ctl00_MainContent_LblDayEnergy"
style="font-weight:bold;Z-INDEX: 103; LEFT: 535px; POSITION: absolute; TOP: 429px">
471 kWh
</span>
<span id="ctl00_MainContent_LblTotalEnergy"
style="font-weight:bold;Z-INDEX: 104; LEFT: 535px; POSITION: absolute; TOP: 456px">
10.441 MWh
</span>
<span id="ctl00_MainContent_LblTemp1"
style="font-weight:bold;Z-INDEX: 105; LEFT: 587px; POSITION: absolute; TOP: 214px">
---
</span>
<img id="ctl00_MainContent_ImgThermometer1"
src="internet/img/thermometer_small_1.jpg"
style="border-width:0px;Z-INDEX: 106; LEFT: 558px; POSITION: absolute; TOP: 193px" />
<span id="ctl00_MainContent_LblTemp2"
style="font-weight:bold;Z-INDEX: 107; LEFT: 766px; POSITION: absolute; TOP: 373px">
---
</span>
<BR />
<span id="ctl00_MainContent_lblInverters"
style="font-weight:bold;Z-INDEX: 108; LEFT: 382px; POSITION: absolute; TOP: 325px">
6 IG Plus 10.0-1 UNI
<br>
</span>
ASKER
Right!
I'm needing a way to scrape this information with PHP.
I'm needing to get to the information contained in the span id="ctl00_MainContent_LblD ayEnergy"
and
span id="ctl00_MainContent_LblT otalEnergy "
I'm needing a way to scrape this information with PHP.
I'm needing to get to the information contained in the span id="ctl00_MainContent_LblD
and
span id="ctl00_MainContent_LblT
It seems like a couple of things are happening. First, the SID argument in the URL indicates that there is a persistent login, and when I used the old SID by clicking the link I got into an already-logged-in web page.
The time stamps in the data page looked good. And we found the kWh and mWh values. Next I am going to try closing all the browser instances, removing the cookies and trying the data page again.
The time stamps in the data page looked good. And we found the kWh and mWh values. Next I am going to try closing all the browser instances, removing the cookies and trying the data page again.
ASKER
Alright...
F.Y.I.
I'm using code provided by http://simplehtmldom.sourceforge.net/ to scrape other inverter sites to obtain the information i require but they don't require a log-in cookie.
F.Y.I.
I'm using code provided by http://simplehtmldom.sourceforge.net/ to scrape other inverter sites to obtain the information i require but they don't require a log-in cookie.
It does not look like there is a login cookie - there were no "fronius" domain cookies on my browser, and the presence of the SID argument in the URL probably means that they are transmitting the session ID in the URL links. Nevertheless, we still need to hit the login page and provide the appropriate credentials, then follow any redirects. If you want, you can email me the login and password you want me to test with. Please use my GMail address shown in my public profile here at EE.
Scraping the page is probably as simple as using a REGEX or two, once we get logged in and read the HTML with CURL.
Scraping the page is probably as simple as using a REGEX or two, once we get logged in and read the HTML with CURL.
ASKER
You da man Ray!
ASKER
did you get my email?
Yes, I logged in, but I got some anomalous results that I am still sorting out. Some screenshots follow. This is what I saw after Iogged in. On the left you can see that it says, "There is no PV system assigned..."
after-login.png
after-login.png
So I clicked on that area in the left where it says, "No PV system" and got this.
abbruch.png
abbruch.png
I tried using the admin links, but got this page... I am accepting cookies, so there is something else wrong - possibly a server redirect loop.
try-pvsystem.png
try-pvsystem.png
I will continue to tinker around with it a little bit more - it looks like the login does not use CAPTCHA which is good news, nor any form tokens. All the JS validation seems to be about acceptable character strings and empty fields, so we may not need to deal with that.
Still, the best way to handle this is to get an API from the web site owners.
More to follow...
Still, the best way to handle this is to get an API from the web site owners.
More to follow...
ASKER
sorry Ray, I forgot to enable the PV system for your log in. its now enabled.
I've tried to get some type of a csv or xml feed and they do not offer it or any type of API at this time.
thanks.
I've tried to get some type of a csv or xml feed and they do not offer it or any type of API at this time.
thanks.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
looks awesome Ray! Your like a PHP Jedi... Impressive!
I've modified you code to target the specific data I needed.( below )
Your The Man Ray, Thanks!
I've modified you code to target the specific data I needed.( below )
Your The Man Ray, Thanks!
// REFINE THE DATA SOME
$xyz = strip_tags($xyz, '<div><td><span>');
$xyz = str_replace('>', '> ', $xyz);
$xyz = str_replace('<', ' <', $xyz);
$load_arr = explode(' <span id="ctl00_MainContent_lblCurrentPower" style="Z-INDEX: 101; LEFT: 552px; POSITION: absolute; TOP: 387px"> ', $xyz);
$load = explode(" kW ", $load_arr[1]);
$load = $load[0];
$kWh_arr = explode(' <span id="ctl00_MainContent_LblTotalEnergy" style="Z-INDEX: 104; LEFT: 535px; POSITION: absolute; TOP: 456px"> ', $xyz);
$kWh = explode(" ",$kWh_arr[1]);
$unit = $kWh[1];
$kWh = $kWh[0];
// convert the MWH to Kwh
if($unit == "MWh"){
$kWh = $kWh*1000;
}
echo "<b>Total kWhs = ". $kWh . ' kWhs</b><br/>';
echo "<b>curent KW Load = " . $load . ' kW</b><br/>';
ASKER
Awesome!
Your solution looks great! Thanks for the points and for your kind words, and good luck with the project, ~Ray
Just caught a code fault - please replace lines 64 through 68 of my last script post above with this... It might never come up (error handler) but better to produce an accurate message than a false positive for an error page!
if ($xyz === FALSE)
{
echo "\nCURL 2ND GET FAIL: $nexturl CURL_ERRNO=$err ";
var_dump($inf);
}
On the other hand, if you're caught scraping content from a site without permission the owner of that content can have your pages delisted from Google.