Solved

Grab URL from HTML Source using Regex in PHP

Posted on 2016-07-17
2
47 Views
Last Modified: 2016-07-17
Hey Folks,

This should be an easy one for Regex junkies.

I'm looking to use PHP's preg_match_all to extract all URLs from HTML source that match a particular pattern.

Here's a snippet of the HTML in question:

<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;

Open in new window


You'll see in the snippet above, there are a bunch of URLs; the only ones I'm interested in are those with the "display.w3p" string inside.

The result I'm looking for after calling preg_match_all is an array of URLs as follows:
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22
http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22
etc...

Open in new window


If you can help me with the pattern to use, I'm all set!

Cheers,
Matt
0
Comment
Question by:mattratt
2 Comments
 
LVL 22

Accepted Solution

by:
Kim Walker earned 500 total points
ID: 41716265
I think this might do the trick:
<?php
$pattern = "/\"([^\"]*display.w3p[^\"]*)\"/";
$haystack  = <<<EOT
<p>View HTML versions <a href="http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault">back to 1901</a></p>
<p><em>Note: The Proof Hansard is replaced by the Official Hansard when it becomes available.</em></p>
<p><em><em>Note: The results of divisions in the House of Representatives are recorded in the </em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/Live_Minutes"><em>Live Minutes</em></a><em>, where they are generally available within a few minutes of a division occurring. For previous sitting days they are available in the&nbsp;</em><a href="http://www.aph.gov.au/Parliamentary_Business/Chamber_documents/HoR/Votes_and_Proceedings"><em>Votes and Proceedings</em></a><em> which are available online shortly after the House adjourns each sitting day. For further information please contact the </em><a href="mailto:Table.Office.Reps@aph.gov.au?subject=HoR%20Divisions"><em>House of Representatives Table office</em></a><em>. </em>&nbsp;</em></p>
<p>You can also&nbsp;view <a title="Senate Hansards " href="/Parliamentary_Business/Hansard/Hanssen261110">Senate Hansard</a>.</p>

    <h3><a name="2016"></a>2016</h3>
  
                    <table width="60%" border="0" cellspacing="0" cellpadding="3">
                    <tbody>
                    <tr>
                      <th>
                        <p style="text-align: left;">Month</p>
                      </th>
                      <th>
                        <p style="text-align: left;">Date</p>
                      </th>
                    </tr><tr>
                      <td style="width: 25%;"><strong>February</strong></td>
                      <td>
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/246a269b-5745-4465-8d60-10707e9a72f2/0000%22" target="_blank">2</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/d39a0737-7c7a-4f79-8bb5-2c3ae841d1cb/0000%22" target="_blank">3</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/aaec4b82-2411-445e-9df4-a787c30c60c2/0000%22" target="_blank">4</a>
                      &nbsp;&nbsp;
                      <a href="http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id%3A%22chamber/hansardr/a8ddabc5-4310-4bda-a2fd-c9e6a54e12dc/0000%22" target="_blank">8</a>
                      &nbsp;&nbsp;
EOT;
preg_match_all($pattern,$source,$matches);
foreach ($matches[1] as $val) {
	echo "<p>$val</p>\n";
}
?>

Open in new window

0
 

Author Closing Comment

by:mattratt
ID: 41716273
I knew it'd be something simple .. keep a look out, I doubt it'll be the first time I need a hand with regexes!!

Thanks so much :-D
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
PHP strings vs array errors 13 33
How to fix Datetime in MySQL? 4 51
Wordpress Only run code if on a certain page 11 24
Add a loading gif while php runs server side 15 37
Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
The viewer will learn how to count occurrences of each item in an array.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question