?
Solved

Grab URL from HTML Source using Regex in PHP

Posted on 2016-07-17
2
Medium Priority
?
68 Views
Last Modified: 2016-07-17
Hey Folks,

This should be an easy one for Regex junkies.

I'm looking to use PHP's preg_match_all to extract all URLs from HTML source that match a particular pattern.

Here's a snippet of the HTML in question:

<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;

Open in new window


You'll see in the snippet above, there are a bunch of URLs; the only ones I'm interested in are those with the "display.w3p" string inside.

The result I'm looking for after calling preg_match_all is an array of URLs as follows:
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22
etc...

Open in new window


If you can help me with the pattern to use, I'm all set!

Cheers,
Matt
0
Comment
Question by:mattratt
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 22

Accepted Solution

by:
Kim Walker earned 2000 total points
ID: 41716265
I think this might do the trick:
<?php
$pattern = "/\"([^\"]*display.w3p[^\"]*)\"/";
$haystack  = <<<EOT
<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;
EOT;
preg_match_all($pattern,$source,$matches);
foreach ($matches[1] as $val) {
	echo "<p>$val</p>\n";
}
?>

Open in new window

0
 

Author Closing Comment

by:mattratt
ID: 41716273
I knew it'd be something simple .. keep a look out, I doubt it'll be the first time I need a hand with regexes!!

Thanks so much :-D
0

Featured Post

Enroll in August's Course of the Month

August's CompTIA IT Fundamentals course includes 19 hours of basic computer principle modules and prepares you for the certification exam. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

801 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question