• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 103
  • Last Modified:

Grab URL from HTML Source using Regex in PHP

Hey Folks,

This should be an easy one for Regex junkies.

I'm looking to use PHP's preg_match_all to extract all URLs from HTML source that match a particular pattern.

Here's a snippet of the HTML in question:

<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;

Open in new window


You'll see in the snippet above, there are a bunch of URLs; the only ones I'm interested in are those with the "display.w3p" string inside.

The result I'm looking for after calling preg_match_all is an array of URLs as follows:
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22
etc...

Open in new window


If you can help me with the pattern to use, I'm all set!

Cheers,
Matt
0
mattratt
Asked:
mattratt
1 Solution
 
Kim WalkerWeb Programmer/TechnicianCommented:
I think this might do the trick:
<?php
$pattern = "/\"([^\"]*display.w3p[^\"]*)\"/";
$haystack  = <<<EOT
<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;
EOT;
preg_match_all($pattern,$source,$matches);
foreach ($matches[1] as $val) {
	echo "<p>$val</p>\n";
}
?>

Open in new window

0
 
mattrattAuthor Commented:
I knew it'd be something simple .. keep a look out, I doubt it'll be the first time I need a hand with regexes!!

Thanks so much :-D
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now