Solved

powershell HTML parse

Posted on 2014-11-29
6
275 Views
Last Modified: 2014-12-04
My foreach loop is not filtering out the innertext array with the string used to filter. Seems like there are 2 sets for each game because of 2 box scores. I want the results from the first set.  

so it looks like this works for the first line works but then code falls apart.

Result should like this for every line:
jets 4, Blue Jackets 2


Add-Type -path C:\PStemp\HtmlAgilityPack\Net40\htmlagilitypack.dll
CLS

	$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
	$wc = New-Object System.Net.WebClient;
	$doc = New-Object HtmlAgilityPack.HtmlDocument
	$doc.LoadHtml($wc.DownloadString($Website))
	
	$game = $doc.DocumentNode.SelectNodes('.//table["mod-container mod-no-header-footer mod-scorebox final mod-scorebox-final"]') | select -first 4
	$scores = @()
	$i = 0

	ForEach ($innerHTML in $game.InnerHTML | Where-Object { $_.InnerHTML -notlike "*-totalScoreHome*" }) #-or $game.InnerHTML -notlike "*-totalScoreAway*"
	{
		
		$Teams = $innerHTML -split "`"><a href=`""
		
		$Team1 = $Teams[1].Substring($Teams[1].IndexOf("http://espn.go.com") + 48, $Teams[1].IndexOf("</a>") - $Teams[1].IndexOf("http://espn.go.com") - 53).Replace("/", "").Replace("`"", "")
		$Team2 = $Teams[2].Substring($Teams[2].IndexOf("http://espn.go.com") + 48, $Teams[2].IndexOf("</a>") - $Teams[2].IndexOf("http://espn.go.com") - 53).Replace("/", "").Replace("`"", "")
	
		$Score1 = $Teams[1].Substring($Teams[1].IndexOf("-awayHeaderScore`">") + 18, 2).Replace("<", "").Replace("/", "-1")
		$Score2 = $Teams[2].Substring($Teams[2].IndexOf("-homeHeaderScore`">") + 18, 2).Replace("<", "").Replace("/", "-1")
	
		$TeamScore = $Team1 + ' ' + $Score1 + ', ' + $Team2 + ' ' + $Score2
		
		$scores += New-Object PsObject -Property @{ Scores = $TeamScore; }
		$i = $i + 2
	}
	$scores | select Scores | Format-Table -AutoSize

Open in new window

0
Comment
Question by:Leo Torres
  • 3
  • 2
6 Comments
 
LVL 10

Expert Comment

by:JoeKlimis
ID: 40472870
Hi

using PowerShell 3  or above , I would do something like the following instead of using html agility pack

$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$Request = Invoke-WebRequest -URI $webSite
$h = $request.ParsedHtml.getElementsByTagName("div")
$h | where classname -eq 'team-name' | select InnerText
$a = $h | where classname -eq 'span-2' | select innerhtml
$teama = ($a.innerHTML -split "</A>")[0].split(">")[11]
$scorea =  ($a.innerHTML -split "</A>")[1].split("<")[4].split(">")[1]
$teamb = (($a.innerHTML -split "</A>")[1] -split ">")[17]
$scoreb = ($a.innerHTML -split "</A>")[2].split(">")[4].split("<")

write-output $teama , $scorea , $teamb , $scoreb

Open in new window




If you detail you requirements , I can help you using this method.

Regards
Joe
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 40472988
The requirement is just to extract team name and score. For the day in question.

Out put by your code is this
Jets
4
Blue Jackets
2
/SPAN

Open in new window


This is only one game I need all results for that day and dont bring back "/SPAN".


Just so I know why would you not use the Agility pack? Is there a draw back? I used it because I thought it was easier but what ever works is fine with me. I like taking different approaches servers as a teaching point for myself.
0
 
LVL 10

Accepted Solution

by:
JoeKlimis earned 500 total points
ID: 40473440
Hi Leo

I have never used the agility pack , perhaps I should take a look  :-), but not all sites I work on allow download of additional tools, so I usually try and make things work using out the box features.

This i think will do what you want
$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$Request = Invoke-WebRequest -URI $webSite   #  fetch web page
$h = $request.ParsedHtml.getElementsByTagName("table")  #  split page by tag  to isolate the required information
$results = ($h | where classname -eq "game-header-table" | select innerhtml) #  create an array of game results

foreach ( $result in  $Results )   # loop through each result , extracting the required information.
{
	$a = $result.innerhtml
	$teama = ($a -split "</A>")[0].split(">")[5]
	$scorea  = ($a  -split "</A>")[1].split("<")[4].split(">")[1]
	$teamb = (($a -split "</A>")[1] -split ">")[17]
	$scoreb = ($a -split "</A>")[2].split(">")[4].split("<").split("/")[0]
	write-output "$teama  $scorea    VS  $teamb  $scoreb "
}

Open in new window

0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 8

Author Comment

by:Leo Torres
ID: 40474411
Wow, indeed it works thank you!
0
 
LVL 8

Author Closing Comment

by:Leo Torres
ID: 40474414
thanks
0
 
LVL 69

Expert Comment

by:Qlemo
ID: 40481166
Coming late, but here it is. Had to use dummy vars to ignore some content as I was not able to filter that stuff appropriately via XPath:
Add-Type -path C:\temp\HtmlAgilityPack\Net40\htmlagilitypack.dll
CLS

$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$wc = New-Object System.Net.WebClient;
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($wc.DownloadString($Website))

$games = $doc.DocumentNode.SelectNodes('//*[@class="team-name"]|//*[@class="team-score"]') | select -Expand InnerText

while ($games)
{
  $Team1, $Score1, $dummy, $Team2, $Score2, $dummy, $dummy, $dummy, $games = $games
  Write-Host $Team1 $Score1', '$Team2 $Score2
}

Open in new window

0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

"Migrate" an SMTP relay receive connector to a new server using info from an old server.
This article explains how to prepare an HTML email signature template file containing dynamic placeholders for users' Azure AD data. Furthermore, it explains how to use this file to remotely set up a department-wide email signature policy in Office …
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question