asked on

Help parsing HTML using preg_match_all

I need some help parsing an HTML table with PHP.

The table below is from the ESPN College Football final Standings page at http://www.espn.com/college-football/rankings
I want to get the team NAMES into an array so that I can loop over them. It appears that they are in the nodes that look like this:

<a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
    Clemson
</a>

Open in new window

Here is a partial table:

<table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
    <colgroup span="7" class="Table2__colgroup">
        <col class="Table2__col">
        <col class="Table2__col">
        <col class="Table2__col">
        <col class="Table2__col">
        <col class="Table2__col">
        <col class="Table2__col">
        <col class="Table2__col">
    </colgroup>
    <thead class="Table2__thead">
        <tr class="Table2__header-row Table2__tr Table2__even">
            <th title="" class="Table2__th">RK</th>
            <th title="" class="tl Table2__th">
                <div class="tl">
                    <!-- -->Team<!-- --> 
                </div>
            </th>
            <th title="" class="Table2__th">
                <div>
                    <!-- -->REC<!-- --> 
                </div>
            </th>
            <th title="" class="Table2__th">
                <div>
                    <!-- -->PTS<!-- --> 
                </div>
            </th>
            <th title="" class="tc Table2__th">
                <div class="tc">
                    <!-- -->TREND<!-- --> 
                </div>
            </th>
            <th title="" class="tl Table2__th">
                <div class="tl">
                    <!-- --> 
                </div>
            </th>
            <th title="" class="tl Table2__th">
                <div class="tl">
                    <!-- --> 
                </div>
            </th>
        </tr>
    </thead>
    <tbody class="Table2__tbody">
        <tr class="Table2__tr Table2__tr--sm Table2__even" data-idx="0">
            <td class="tight-cell  Table2__td">1</td>
            <td class="Table2__td">
                <div class="flex justify-start">
                    <span class="tc dib pr4" style="width:20px;height:14px">
                        <a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
                            <img alt="Clemson" style="width:20px;height:20px" title="Clemson" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/228.png&amp;w=40&amp;h=40">
                        </a>
                    </span>
                    <span class="pl2 pr3 dn show-mobile clr-link underline-hover">
                        <a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
                            <abbr style="text-decoration:none" title="Clemson">
                                CLEM
                            </abbr>
                        </a>
                    </span>
                    <span class="pl3 hide-mobile clr-link underline-hover">
                        <a data-clubhouse-uid="s:20~l:23~t:228" href="/college-football/team/_/id/228/clemson-tigers">
                            Clemson
                        </a>
                    </span>
                    <span class="ml2">
                        (<!-- -->64<!-- -->)
                    </span>
                </div>
            </td>
            <td class="Table2__td">
                <div class="">15-0</div>
            </td>
            <td class="Table2__td">
                <div class="">1600</div>
            </td>
            <td class="Table2__td">
                <div class="tc" style="color:#009444">
                    <svg class="icon__svg" style="height:14px;fill:#009444" viewBox="0 0 24 24">
                        <use xlink:href="#icon__arrow__up"></use>
                    </svg>
                    1
                </div>
            </td>
            <td class="tl  Table2__td">
                <div class="tl "></div>
            </td>
            <td class="tl  Table2__td">
                <div class="tl "></div>
            </td>
        </tr>
    </tbody>
</table>

Open in new window

I previously parsed their BCS standings using this code:

<?  
  // Got this code from someone on StackOverflow in 2014.
  $str = file_get_contents('http://sports.espn.go.com/ncf/BCSStandings');
  preg_match_all('#<a href="http://sports\.espn\.go\.com/ncf/clubhouse\?teamId=\d{1,2700}">([^<]+)</a>#', $str, $m, PREG_PATTERN_ORDER);
  $m=$m[1];
?>

Open in new window

Now, of course, the page no longer exists because there is no more BCS. Wanting to update this parser for the new standings.
I KNOW I can use HTMLTidy or some other tool but I was hoping that I could do it using preg_match_all like before.

Terry Woods

Perhaps try this pattern:

<a data-clubhouse-uid[^>]*>\s*(\w[\w\s]+?)\s*<

Open in new window

Demo:
https://regex101.com/r/vRTAqj/1

Terry Woods

Note that if there are any team names with hyphens, apostrophes or other non-alpha characters then the pattern will need a little adjustment to handle that.

ASKER CERTIFIED SOLUTION

Julian Hansen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

kaufmed

You do realize that your asking to violate Disney's Terms of Service, right?

You may not circumvent or disable any content protection system or digital rights management technology used with any Disney Service; decompile, reverse engineer, disassemble or otherwise reduce any Disney Service to a human-readable form; remove identification, copyright or other proprietary notices; or access or use any Disney Service in an unlawful or unauthorized manner or in a manner that suggests an association with our products, services or brands.

Eddie Shipman

ASKER

@kaufmed, this is just for a one-time presentation and I plan on attributing the data source.

@Julian, I'll test later today.

@Terry, Yes, it misses Texas A & M, can it be readjusted to catch the & sign?

Terry Woods

Pattern adjusted to allow ampersands:

<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&]+?)\s*<

Open in new window

https://regex101.com/r/vRTAqj/2

Eddie Shipman

ASKER

@Terry, I tried several ways to get it but not that one ;-).
Had to adjust to this to pick up "Texas A&M", which is how it is in the full HTML...

<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&amp;]+?)\s*<

Open in new window

@Julian, your solution works great, too.

However, there are 2 tables on the page and both of them have the same class and their parents all have the same class name, too. I want the list from the second one only. The only thing that differentiates the 2 is a sibling with a class of "Table2_title" that has a different value, the left one "AP Top 25", the right one "Coaches Poll". I want the team names from the Coaches Poll. Can that be done with the XPath query?

<section class="Table2__responsiveTable Table2__table-outer-wrap Table2__responsiveTable--hasFooter">
    <div class="Table2__Title">AP Top 25</div>
    <table class="Table2__table__wrapper">
        <tbody>
            <tr>
                <td class="v-top">
                    <div class="Table2__shadow-container">
                        <div class="Table2__shadow-wrapper">
                            <div class="Table2__shadow--left" style="opacity:0"></div>
                            <div class="Table2__shadow-scroller">
                                <table cellpadding="0" cellspacing="0" class="Table2__table-scroll">
                                    <tbody>
                                        <tr>
                                            <td>
                                                <table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
                                                    <colgroup span="7" class="Table2__colgroup">
                                                    </colgroup>
                                                    <thead class="Table2__thead">
                                                    </thead>
                                                    <tbody class="Table2__tbody">
                                                        <!-- this tbody contains the team names in the left side -->
                                                    </tbody>
                                                </table>
                                            </td>
                                        </tr>
                                    </tbody>
                                </table>
                            </div>
                            <div class="Table2__shadow--right" style="opacity:0"></div>
                        </div>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</section>

Open in new window

<section class="Table2__responsiveTable Table2__table-outer-wrap Table2__responsiveTable--hasFooter">
    <div class="Table2__Title">Coaches Poll</div>
    <table class="Table2__table__wrapper">
        <tbody>
            <tr>
                <td class="v-top">
                    <div class="Table2__shadow-container">
                        <div class="Table2__shadow-wrapper">
                            <div class="Table2__shadow--left" style="opacity:0"></div>
                            <div class="Table2__shadow-scroller">
                                <table cellpadding="0" cellspacing="0" class="Table2__table-scroll">
                                    <tbody>
                                        <tr>
                                            <td>
                                                <table cellpadding="0" cellspacing="0" class="Table2__table-scroller Table2__right-aligned Table2__table">
                                                    <colgroup span="7" class="Table2__colgroup">
                                                    </colgroup>
                                                    <thead class="Table2__thead">
                                                    </thead>
                                                    <tbody class="Table2__tbody">
                                                        <!-- this tbody contains the team names in the right side -->
                                                    </tbody>
                                                </table>
                                            </td>
                                        </tr>
                                    </tbody>
                                </table>
                            </div>
                            <div class="Table2__shadow--right" style="opacity:0"></div>
                        </div>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</section>

Open in new window

Julian Hansen

Can that be done with the XPath query

It can - but if you need me to look at it I can only pick this up on Friday.

You will need to do something like

//table[2]/td/*/span ....

Open in new window

Find the second instance of table and then work relative to that.

Terry Woods

In case you're interested in learning a little more about regular expression patterns:

In this pattern that you tried, the \w within the square brackets matches any alphabetic character (or underscore):

<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&]+?)\s*<

Open in new window

So to match & you only need to adjust the pattern to match the & and ; character. You don't need to worry about the a, m and p characters (though it won't hurt). So this pattern would work too:

<a data-clubhouse-uid[^>]*>\s*(\w[\w\s&;]+?)\s*<

Open in new window

Eddie Shipman

ASKER

Thanks guys, will post link to the "Playoff" scenario when I get it done...