Eddie Shipman
asked on
Help parsing HTML using preg_match_all
I need some help parsing an HTML table with PHP.
The table below is from the ESPN College Football final Standings page at http://www.espn.com/college-football/rankings
I want to get the team NAMES into an array so that I can loop over them. It appears that they are in the nodes that look like this:
Here is a partial table:
I previously parsed their BCS standings using this code:
Now, of course, the page no longer exists because there is no more BCS. Wanting to update this parser for the new standings.
I KNOW I can use HTMLTidy or some other tool but I was hoping that I could do it using preg_match_all like before.
The table below is from the ESPN College Football final Standings page at http://www.espn.com/college-football/rankings
I want to get the team NAMES into an array so that I can loop over them. It appears that they are in the nodes that look like this:
<a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
Clemson
</a>
Here is a partial table:
<table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
<colgroup span="7" class="Table2__colgroup">
<col class="Table2__col">
<col class="Table2__col">
<col class="Table2__col">
<col class="Table2__col">
<col class="Table2__col">
<col class="Table2__col">
<col class="Table2__col">
</colgroup>
<thead class="Table2__thead">
<tr class="Table2__header-row Table2__tr Table2__even">
<th title="" class="Table2__th">RK</th>
<th title="" class="tl Table2__th">
<div class="tl">
<!-- -->Team<!-- -->
</div>
</th>
<th title="" class="Table2__th">
<div>
<!-- -->REC<!-- -->
</div>
</th>
<th title="" class="Table2__th">
<div>
<!-- -->PTS<!-- -->
</div>
</th>
<th title="" class="tc Table2__th">
<div class="tc">
<!-- -->TREND<!-- -->
</div>
</th>
<th title="" class="tl Table2__th">
<div class="tl">
<!-- -->
</div>
</th>
<th title="" class="tl Table2__th">
<div class="tl">
<!-- -->
</div>
</th>
</tr>
</thead>
<tbody class="Table2__tbody">
<tr class="Table2__tr Table2__tr--sm Table2__even" data-idx="0">
<td class="tight-cell Table2__td">1</td>
<td class="Table2__td">
<div class="flex justify-start">
<span class="tc dib pr4" style="width:20px;height:14px">
<a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
<img alt="Clemson" style="width:20px;height:20px" title="Clemson" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/228.png&w=40&h=40">
</a>
</span>
<span class="pl2 pr3 dn show-mobile clr-link underline-hover">
<a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
<abbr style="text-decoration:none" title="Clemson">
CLEM
</abbr>
</a>
</span>
<span class="pl3 hide-mobile clr-link underline-hover">
<a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
Clemson
</a>
</span>
<span class="ml2">
(<!-- -->64<!-- -->)
</span>
</div>
</td>
<td class="Table2__td">
<div class="">15-0</div>
</td>
<td class="Table2__td">
<div class="">1600</div>
</td>
<td class="Table2__td">
<div class="tc" style="color:#009444">
<svg class="icon__svg" style="height:14px;fill:#009444" viewBox="0 0 24 24">
<use xlink:href="#icon__arrow__up"></use>
</svg>
1
</div>
</td>
<td class="tl Table2__td">
<div class="tl "></div>
</td>
<td class="tl Table2__td">
<div class="tl "></div>
</td>
</tr>
</tbody>
</table>
I previously parsed their BCS standings using this code:
<?
// Got this code from someone on StackOverflow in 2014.
$str = file_get_contents('http://sports.espn.go.com/ncf/BCSStandings');
preg_match_all('#<a href="http://sports\.espn\.go\.com/ncf/clubhouse\?teamId=\d{1,2700}">([^<]+)</a>#', $str, $m, PREG_PATTERN_ORDER);
$m=$m[1];
?>
Now, of course, the page no longer exists because there is no more BCS. Wanting to update this parser for the new standings.
I KNOW I can use HTMLTidy or some other tool but I was hoping that I could do it using preg_match_all like before.
Note that if there are any team names with hyphens, apostrophes or other non-alpha characters then the pattern will need a little adjustment to handle that.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
You do realize that your asking to violate Disney's Terms of Service, right?
You may not circumvent or disable any content protection system or digital rights management technology used with any Disney Service; decompile, reverse engineer, disassemble or otherwise reduce any Disney Service to a human-readable form; remove identification, copyright or other proprietary notices; or access or use any Disney Service in an unlawful or unauthorized manner or in a manner that suggests an association with our products, services or brands.
ASKER
@kaufmed, this is just for a one-time presentation and I plan on attributing the data source.
@Julian, I'll test later today.
@Terry, Yes, it misses Texas A & M, can it be readjusted to catch the & sign?
@Julian, I'll test later today.
@Terry, Yes, it misses Texas A & M, can it be readjusted to catch the & sign?
Pattern adjusted to allow ampersands:
https://regex101.com/r/vRTAqj/2
<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&]+?)\s*<
https://regex101.com/r/vRTAqj/2
ASKER
@Terry, I tried several ways to get it but not that one ;-).
Had to adjust to this to pick up "Texas A&M", which is how it is in the full HTML...
@Julian, your solution works great, too.
However, there are 2 tables on the page and both of them have the same class and their parents all have the same class name, too. I want the list from the second one only. The only thing that differentiates the 2 is a sibling with a class of "Table2_title" that has a different value, the left one "AP Top 25", the right one "Coaches Poll". I want the team names from the Coaches Poll. Can that be done with the XPath query?
Had to adjust to this to pick up "Texas A&M", which is how it is in the full HTML...
<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&]+?)\s*<
@Julian, your solution works great, too.
However, there are 2 tables on the page and both of them have the same class and their parents all have the same class name, too. I want the list from the second one only. The only thing that differentiates the 2 is a sibling with a class of "Table2_title" that has a different value, the left one "AP Top 25", the right one "Coaches Poll". I want the team names from the Coaches Poll. Can that be done with the XPath query?
<section class="Table2__responsiveTable Table2__table-outer-wrap Table2__responsiveTable--hasFooter">
<div class="Table2__Title">AP Top 25</div>
<table class="Table2__table__wrapper">
<tbody>
<tr>
<td class="v-top">
<div class="Table2__shadow-container">
<div class="Table2__shadow-wrapper">
<div class="Table2__shadow--left" style="opacity:0"></div>
<div class="Table2__shadow-scroller">
<table cellpadding="0" cellspacing="0" class="Table2__table-scroll">
<tbody>
<tr>
<td>
<table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
<colgroup span="7" class="Table2__colgroup">
</colgroup>
<thead class="Table2__thead">
</thead>
<tbody class="Table2__tbody">
<!-- this tbody contains the team names in the left side -->
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>
<div class="Table2__shadow--right" style="opacity:0"></div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</section>
<section class="Table2__responsiveTable Table2__table-outer-wrap Table2__responsiveTable--hasFooter">
<div class="Table2__Title">Coaches Poll</div>
<table class="Table2__table__wrapper">
<tbody>
<tr>
<td class="v-top">
<div class="Table2__shadow-container">
<div class="Table2__shadow-wrapper">
<div class="Table2__shadow--left" style="opacity:0"></div>
<div class="Table2__shadow-scroller">
<table cellpadding="0" cellspacing="0" class="Table2__table-scroll">
<tbody>
<tr>
<td>
<table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
<colgroup span="7" class="Table2__colgroup">
</colgroup>
<thead class="Table2__thead">
</thead>
<tbody class="Table2__tbody">
<!-- this tbody contains the team names in the right side -->
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>
<div class="Table2__shadow--right" style="opacity:0"></div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</section>
Can that be done with the XPath queryIt can - but if you need me to look at it I can only pick this up on Friday.
You will need to do something like
//table[2]/td/*/span ....
Find the second instance of table and then work relative to that.
In case you're interested in learning a little more about regular expression patterns:
In this pattern that you tried, the \w within the square brackets matches any alphabetic character (or underscore):
In this pattern that you tried, the \w within the square brackets matches any alphabetic character (or underscore):
<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&]+?)\s*<
So to match & you only need to adjust the pattern to match the & and ; character. You don't need to worry about the a, m and p characters (though it won't hurt). So this pattern would work too:<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&;]+?)\s*<
ASKER
Thanks guys, will post link to the "Playoff" scenario when I get it done...
Open in new window
Demo:https://regex101.com/r/vRTAqj/1