Solved

powershell HTML parse

Posted on 2014-11-29
6
294 Views
Last Modified: 2014-12-04
My foreach loop is not filtering out the innertext array with the string used to filter. Seems like there are 2 sets for each game because of 2 box scores. I want the results from the first set.  

so it looks like this works for the first line works but then code falls apart.

Result should like this for every line:
jets 4, Blue Jackets 2


Add-Type -path C:\PStemp\HtmlAgilityPack\Net40\htmlagilitypack.dll
CLS

	$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
	$wc = New-Object System.Net.WebClient;
	$doc = New-Object HtmlAgilityPack.HtmlDocument
	$doc.LoadHtml($wc.DownloadString($Website))
	
	$game = $doc.DocumentNode.SelectNodes('.//table["mod-container mod-no-header-footer mod-scorebox final mod-scorebox-final"]') | select -first 4
	$scores = @()
	$i = 0

	ForEach ($innerHTML in $game.InnerHTML | Where-Object { $_.InnerHTML -notlike "*-totalScoreHome*" }) #-or $game.InnerHTML -notlike "*-totalScoreAway*"
	{
		
		$Teams = $innerHTML -split "`"><a href=`""
		
		$Team1 = $Teams[1].Substring($Teams[1].IndexOf("http://espn.go.com") + 48, $Teams[1].IndexOf("</a>") - $Teams[1].IndexOf("http://espn.go.com") - 53).Replace("/", "").Replace("`"", "")
		$Team2 = $Teams[2].Substring($Teams[2].IndexOf("http://espn.go.com") + 48, $Teams[2].IndexOf("</a>") - $Teams[2].IndexOf("http://espn.go.com") - 53).Replace("/", "").Replace("`"", "")
	
		$Score1 = $Teams[1].Substring($Teams[1].IndexOf("-awayHeaderScore`">") + 18, 2).Replace("<", "").Replace("/", "-1")
		$Score2 = $Teams[2].Substring($Teams[2].IndexOf("-homeHeaderScore`">") + 18, 2).Replace("<", "").Replace("/", "-1")
	
		$TeamScore = $Team1 + ' ' + $Score1 + ', ' + $Team2 + ' ' + $Score2
		
		$scores += New-Object PsObject -Property @{ Scores = $TeamScore; }
		$i = $i + 2
	}
	$scores | select Scores | Format-Table -AutoSize

Open in new window

0
Comment
Question by:Leo Torres
  • 3
  • 2
6 Comments
 
LVL 10

Expert Comment

by:Joe Klimis
ID: 40472870
Hi

using PowerShell 3  or above , I would do something like the following instead of using html agility pack

$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$Request = Invoke-WebRequest -URI $webSite
$h = $request.ParsedHtml.getElementsByTagName("div")
$h | where classname -eq 'team-name' | select InnerText
$a = $h | where classname -eq 'span-2' | select innerhtml
$teama = ($a.innerHTML -split "</A>")[0].split(">")[11]
$scorea =  ($a.innerHTML -split "</A>")[1].split("<")[4].split(">")[1]
$teamb = (($a.innerHTML -split "</A>")[1] -split ">")[17]
$scoreb = ($a.innerHTML -split "</A>")[2].split(">")[4].split("<")

write-output $teama , $scorea , $teamb , $scoreb

Open in new window




If you detail you requirements , I can help you using this method.

Regards
Joe
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 40472988
The requirement is just to extract team name and score. For the day in question.

Out put by your code is this
Jets
4
Blue Jackets
2
/SPAN

Open in new window


This is only one game I need all results for that day and dont bring back "/SPAN".


Just so I know why would you not use the Agility pack? Is there a draw back? I used it because I thought it was easier but what ever works is fine with me. I like taking different approaches servers as a teaching point for myself.
0
 
LVL 10

Accepted Solution

by:
Joe Klimis earned 500 total points
ID: 40473440
Hi Leo

I have never used the agility pack , perhaps I should take a look  :-), but not all sites I work on allow download of additional tools, so I usually try and make things work using out the box features.

This i think will do what you want
$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$Request = Invoke-WebRequest -URI $webSite   #  fetch web page
$h = $request.ParsedHtml.getElementsByTagName("table")  #  split page by tag  to isolate the required information
$results = ($h | where classname -eq "game-header-table" | select innerhtml) #  create an array of game results

foreach ( $result in  $Results )   # loop through each result , extracting the required information.
{
	$a = $result.innerhtml
	$teama = ($a -split "</A>")[0].split(">")[5]
	$scorea  = ($a  -split "</A>")[1].split("<")[4].split(">")[1]
	$teamb = (($a -split "</A>")[1] -split ">")[17]
	$scoreb = ($a -split "</A>")[2].split(">")[4].split("<").split("/")[0]
	write-output "$teama  $scorea    VS  $teamb  $scoreb "
}

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 8

Author Comment

by:Leo Torres
ID: 40474411
Wow, indeed it works thank you!
0
 
LVL 8

Author Closing Comment

by:Leo Torres
ID: 40474414
thanks
0
 
LVL 69

Expert Comment

by:Qlemo
ID: 40481166
Coming late, but here it is. Had to use dummy vars to ignore some content as I was not able to filter that stuff appropriately via XPath:
Add-Type -path C:\temp\HtmlAgilityPack\Net40\htmlagilitypack.dll
CLS

$Website = "http://scores.espn.go.com/nhl/scoreboard?date=20141125"
$wc = New-Object System.Net.WebClient;
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($wc.DownloadString($Website))

$games = $doc.DocumentNode.SelectNodes('//*[@class="team-name"]|//*[@class="team-score"]') | select -Expand InnerText

while ($games)
{
  $Team1, $Score1, $dummy, $Team2, $Score2, $dummy, $dummy, $dummy, $games = $games
  Write-Host $Team1 $Score1', '$Team2 $Score2
}

Open in new window

0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
The following article is intended as a guide to using PowerShell as a more versatile and reliable form of application detection in SCCM.
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question