Solved

Stripping out href links in an html page

Posted on 2004-09-21
13
574 Views
Last Modified: 2008-03-17
Hi all,
A question if you please...

I have an html page retrieved by loading a static html page $result_html. In this page there are 3 main sections, a header, footer and body. Within the body section their are many hyper links but one type of link is always preceeded by a specific word and ended with a specific comment ie

eg1.1) <!--system_link_start--><a href="system.php"<!--system_link_end-->

The header is ended by a comment also <!--header_text_end--> and the footer is started by <!--footer_text_start-->

What I need to do is take the $result_html page and basically strip out only the body content, ie remove everything up to <!--header_text_end--> and remove everything after <!--footer_text_start-->. This should leave me with a variable containing just the core section of the original html page. I then want to remove (in order) all of the 'special' hyperlinks such as (eg1.1 listed above) remembering that all of these links are surrounded by the <!--system_link_start--> and <!--system_link_end--> comments.

Once I have exported all of these links I should have a variable containing something like

<a href="domain.ext/page1.php">link text 1</a>
<a href="domain.ext/page2.php">link text 2</a>
<a href="domain.ext/page3.php">link text 3</a>
<a href="domain.ext/page4.php">link text 4</a>
<a href="domain.ext/page5.php">link text 5</a>
<a href="domain.ext/page6.php">link text 6</a>

I then need to query this list to see what item number a perticular link is on, ie if I was profiling domain.ext/page4.php the result would be 4.

Cheers in advance, sorry but max 500 points as pre-set level by EE
Stu
0
Comment
Question by:08718712060
  • 7
  • 4
  • 2
13 Comments
 
LVL 49

Assisted Solution

by:Roonaan
Roonaan earned 200 total points
Comment Utility
<?php

$result_html = '
bstartlabla
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
<!--header_text_end-->bla
<!--system_link_start--><a href="zut">zut1</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut2</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut3</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut4</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut5</a><!--system_link_end-->blabla
<!--footer_text_start-->
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
endblabla';

/* select body part */
$token_body_start = '<!--header_text_end-->';
$token_body_end  = '<!--footer_text_start-->';
$body_start = strpos($result_html, $token_body_start) + strlen($token_body_start);
$body_end  = strpos($result_html, $token_body_end, $body_start);

$result_html = substr($result_html, $body_start, $body_end - $body_start);

/* find links */
$links = '';

$token_link_start = '<!--system_link_start-->';
$token_link_end  = '<!--system_link_end-->';

while(($link_start = strpos($result_html, $token_link_start)) !== false)
{
  $link_start += strlen($token_link_start);
  $link_end = strpos($result_html, $token_link_end, $link_start);
  $links .= substr($result_html, $link_start, $link_end - $link_start);
  $result_html = substr($result_html, $link_end);
}

echo htmlspecialchars($links);

?>
regards

-r-
0
 

Author Comment

by:08718712060
Comment Utility
Roonaan,
That is mint. One thing though. How do I lookup in $links as to which line my search link appears on?

Your scripts hsows the links being echoe'd out. I need to say, where is ghghj.php within the list of links, for example.

Is this something you can also assist me with?

Cheers
S
0
 
LVL 49

Assisted Solution

by:Roonaan
Roonaan earned 200 total points
Comment Utility
You could instead of using $links as an string, use $links as an array();

$links = '';
      => $links = array();
$links .= substr($result_html, $link_start, $link_end - $link_start);
     =>  $links[] = substr($result_html, $link_start, $link_end - $link_start);

You then only have to take into account the array-index starts at 0 and not at 1, but this is an easy thing to adjust i suppose.

As in regard to "where is ghghj.php" I need to say, it is quite impossible to say something on this matter because I/we don't have access to the datafile you are using. (or I overlooked it in your first post accidentaly)

regards

-r-
0
 
LVL 3

Accepted Solution

by:
nenufarloganx earned 300 total points
Comment Utility
Hi all,

:::08718712060:::

Try using RegEx:

<?php
$result_html = '
bstartlabla
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
<!--header_text_end-->bla
<!--system_link_start--><a href="zut">zut1</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut2</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut3</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut4</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut5</a><!--system_link_end-->blabla
<!--footer_text_start-->
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
endblabla';

$t = preg_match( "/<!--header_text_end-->(.*)<!--footer_text_start-->/is", $result_html, $res );  // Get the body contents

if( $t ){  // If we found any content between <!--header_text_end--> AND <!--footer_text_start--> we found for links

      $t = preg_match_all( "/<!--system_link_start-->(.*)<!--system_link_end-->/i", $res[1], $results, PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE );

      for($i = 0; $i < $t; $i++){  // If there are any link between <!--system_link_start--> AND <!--system_link_end--> we extract it to resuts array
         echo "<strong>Link ".$i." -></strong> ".htmlspecialchars( $results[1][$i][0] )."\n<br>\n";
      }
}
else{ echo "No links were matched!";}
?>

Hope that helps :)
Logan
0
 

Author Comment

by:08718712060
Comment Utility
Hi Roonaan and nenufarloganx,
Both great solutions but being rubbish at all this, it is the final part I am having dificulty with. This is that if the links are extracted to either a list or an array I need to be able to look for a position within this array.

--Roonaan Said Start--
As in regard to "where is ghghj.php" I need to say, it is quite impossible to say something on this matter because I/we don't have access to the datafile you are using. (or I overlooked it in your first post accidentaly)
--Roonaan Said End--

Hi Roonaan, the datasource I need to look within is actually the array being compiled with links. If we go down the route of appending the found system links into an array, how can I say, where within the array is link xyz.php?, answer xyz.php was found in array[5] (ie 5 + 1 allowing for 0 index means that the link I searched for is actually the 6th system link on the page.

What I aim to do is have a class which I can create and the class will be passed an array or vars of
[0] Page to be loaded and parsed (ie system_profile99.htm)
[1] link to be searched for (ie sdfhgdhfgsj.php)
The create function of the class will auto load the page, parse it and return true or false to wether my link was found, if so it will populate $this->position which I can then read out for historical analytical storing.

Does this make any more sense?

I appreciate your help on this guys.
0
 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
Hi :)

Try this:

<?
class findLinks{
      function findLinks( $page, $href, $css = "", $ces = "", $lss = "", $les = "" ){
            $this->page = ( $page ) ? $page : false;
            $this->href = ( $href ) ? $href : false;
            if( $this->page && $this->href ){
                  $this->source = "";
                  $this->css = $css;
                  $this->ces = $ces;
                  $this->lss = $lss;
                  $this->les = $les;

                  if( $this->readFile() ){
                        $this->links = array( "Total" => 0, "Links" => array(), "Names" => array() );
                        if( $this->parseFile() ){
                              $this->result = $this->findHref();
                        }
                  }
            }
            else{ $this->error( "Missing argument!" ); }
      }
      
      function readFile(){
            $url = ( preg_match( "/^(.*)\/\/(.*)$/i", $this->page, $results ) ) ? $results[1]."//".urlencode( $results[2] ) : $this->page;
            if( $this->source = @file_get_contents( $url ) ){
                  $RegEx = "/".$this->css."(.*)".$this->ces."/is";
                  $this->source = ( preg_match( $RegEx, $this->source, $results ) ) ? $results[1] : "";
                  return true;
            }
            else{ $this->error( "$this->page could not be opened!" ); }
      }
      
      function parseFile(){
            if( $this->source != "" ){
                  $RegEx = "/".$this->lss."<a[^>].*href=[\"|'](.*)[\"|'].*>(.*)<\/a>".$this->les."/i";
                  $this->links["Total"] = preg_match_all( $RegEx, $this->source, $results, PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE );
                  if( $this->links["Total"] > 0 ){
                        $this->links["Links"] = $results[1];
                        $this->links["Names"] = $results[2];
                        return true;
                  }
                  else{ $this->error( "I can't get links from $this->page" ); }
            }
            else{ $this->error( "I can't get file contents!" ); }
      }

      function findHref(){
            $a = "";
            for( $i = 0; $i < count( $this->links["Links"] ); $i++ ){
                  if( preg_match( "/.*(".$this->href.").*\z/i", $this->links["Links"][$i][0], $results ) ){
                        $a .= "<strong>$this->href</strong> found in <strong>$this->page</strong>:";
                        $a .= "<ul><li>link position = $i</li><li>link name = ".$this->links["Names"][$i][0]."</li><li>links to = ".$this->links["Links"][$i][0]."</li></ul>\n";
                  }
                  else{ $this->error( "No match" ); }
            }
            return $a;
      }

      function error( $strError ){
            die( "<p align=\"center\"><br /><br /><strong><font color=\"#FF0000\">Error<br />&lt;</font> ".$strError."<font color=\"#FF0000\"> &gt;</font></strong></p>" );
            exit();
      }
}

$css = "<!--header_text_end-->";      // Content Start Separator
$ces = "<!--footer_text_start-->";      // Content Ending Separator
$lss = "<!--system_link_start-->";      // Link Start Separator
$les = "<!--system_link_end-->";      // Link Ending Separator

$a = new findLinks( "myfile.php", "zut", $css, $ces, $lss, $les );      //  findLinks( Page to be parsed, link to be searched, [Separators] )
echo $a->result;
?>

:::: myfile.php START ::::

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>

<body>
bstartlabla
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
<!--header_text_end-->bla
<!--system_link_start--><a href="zut">zut1</a><!--system_link_end-->blabla
<!--system_link_start--><a href='zut'>zut2</a><!--system_link_end-->blabla
<!--system_link_start--><a href='zut'>zut3</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut4</a><!--system_link_end-->blabla
<!--system_link_start--><a href="zut">zut5</a><!--system_link_end-->blabla
<!--footer_text_start-->
<!--system_link_start--><a href="zut">no show</a><!--system_link_end-->
endblabla
</body>
</html>

:::: myfile.php END ::::

Hope that helps :)
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
Btw,

$a->links["Links"] contains the matched links data
$a->links["Links"][$i][0] contains the founded link data, where $i is the "link position" number shown on screen

And of course, the same for $a->links["Names"] which contains the anchor text
0
 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
That class works on both local and remote files (ie: http, https, ftp...)
0
 

Author Comment

by:08718712060
Comment Utility
nenufarloganx you are just highlighting my lack of knowledge now! :-)

The class you have provided is fantastic and high marks are guaranteed along with points, however if I may be cheeky I would like to dip into your knowledge banks one more time if possible?

The class produces a list of links which is fantastic and the fact it works with http:// etc is even better for me to avoid passing relative links. However, is there any way that it can search for a partial match? All the links are formatted by content-ref ie j-system-12kk.php or k-local-ttg.php etc, if I can search for a partial entry such as 'local' or 'system' then that would be great. It is only the FIRST occurance we need to note, the script can bail out after finding the first position of a matching link.

ie.) $a = new findLinks( "myfile.php", "system", $css, $ces, $lss, $les );

Would find the first link containing the string system in it and once storing the position that is all we need.

If you can help with this bit I promise there are no more additions :-)

Cheers in advance
Stu
0
 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
No problem at all about matching full or a part of the link :)

>> Would find the first link containing the string system in it and once storing the position that is all we need.

umh... The first (partial)match will be always at position 1 (array[0])... Do you need the link's relative position in document?

ie:
::::::::::::::::::
$a = new findLinks( "myfile.php", "local", $css, $ces, $lss, $les );

i-system-12kk.php
j-system-12kk.php
k-local-ttg.php
l-system-12kk.php
m-local-ttg.php

RESULTS: First link containing "local" was matched at position 3 (array[2])
::::::::::::::::::
0
 

Author Comment

by:08718712060
Comment Utility
Yes,
That is exactly it, please reassure me it can be done at a level that even my rubbish php capabilities will support?

Cheers hopefully in advance
S
0
 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
Hi :)

Try this:

<?
class findLinks{
     function findLinks( $page, $href, $css = "", $ces = "", $lss = "", $les = "", $first = true ){
          $this->page = ( $page ) ? $page : false;
          $this->href = ( $href ) ? $href : false;
        $this->first = $first;
          if( $this->page && $this->href ){
               $this->source = "";
               $this->css = $css;
               $this->ces = $ces;
               $this->lss = $lss;
               $this->les = $les;

               if( $this->readFile() ){
                    $this->links = array( "Total" => 0, "Links" => array(), "Names" => array() );
                    if( $this->parseFile() ){
                         $this->result = $this->findHref();
                    }
               }
          }
          else{ $this->error( "Missing argument!" ); }
     }
     
     function readFile(){
          $url = ( preg_match( "/^(.*)\/\/(.*)$/i", $this->page, $results ) ) ? $results[1]."//".urlencode( $results[2] ) : $this->page;
          if( $this->source = @file_get_contents( $url ) ){
               $RegEx = "/".$this->css."(.*)".$this->ces."/is";
               $this->source = ( preg_match( $RegEx, $this->source, $results ) ) ? $results[1] : "";
               return true;
          }
          else{ $this->error( "$this->page could not be opened!" ); }
     }
     
     function parseFile(){
          if( $this->source != "" ){
               $RegEx = "/".$this->lss."<a[^>].*href=[\"|'](.*)[\"|'].*>(.*)<\/a>".$this->les."/i";
               $this->links["Total"] = preg_match_all( $RegEx, $this->source, $results, PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE );
               if( $this->links["Total"] > 0 ){
                    $this->links["Links"] = $results[1];
                    $this->links["Names"] = $results[2];
                    return true;
               }
               else{ $this->error( "I can't get links from $this->page" ); }
          }
          else{ $this->error( "I can't get file contents!" ); }
     }

     function findHref(){
          $a = "";
          $error = true;
          for( $i = 0; $i < count( $this->links["Links"] ); $i++ ){
            if( eregi( $this->href, $this->links["Links"][$i][0] ) ){
                  $a .= "<strong>$this->href</strong> found in <strong>$this->page</strong>:";
                          $a .= "<ul><li>link position = ".intval( $i + 1 )."</li><li>array position = $i</li><li>link name = ".$this->links["Names"][$i][0]."</li><li>links to = ".$this->links["Links"][$i][0]."</li></ul>\n";
                  $error = false;
                  if( $this->first ){
                        $a = "First occurrence of ".$a;
                        break;
                  }
            }
          }
        if( $error === true ){ $this->error( "No match" ); }
          return $a;
     }

     function error( $strError ){
          die( "<p align=\"center\"><br /><br /><strong><font color=\"#FF0000\">Error<br />&lt;</font> ".$strError."<font color=\"#FF0000\"> &gt;</font></strong></p>" );
          exit();
     }
}

$css = "<!--header_text_end-->";     // Content Start Separator
$ces = "<!--footer_text_start-->";     // Content Ending Separator
$lss = "<!--system_link_start-->";     // Link Start Separator
$les = "<!--system_link_end-->";     // Link Ending Separator

$a = new findLinks( "myfile.php", "m-local-ttg.php", $css, $ces, $lss, $les, true );     //  findLinks( Page to be parsed, link to be searched, [Separators], $tipeOfMatch )
echo $a->result;
?>

$tipeOfMatch by default is set to true, which means that only the first occurrence will be shown. if set to true, all matches will be shown :)

Note: As default, the class will match PARTIAL contents, if you need an option to match full match, let me know.

Hope that helps :)

Logan
0
 
LVL 3

Assisted Solution

by:nenufarloganx
nenufarloganx earned 300 total points
Comment Utility
Hi again,

User this better:

<?
class findLinks{
     function findLinks( $page, $href, $css = "", $ces = "", $lss = "", $les = "" ){
          $this->page = ( $page ) ? $page : false;
          $this->href = ( $href ) ? $href : false;
          if( $this->page && $this->href ){
               $this->source = "";
               $this->css = $css;
               $this->ces = $ces;
               $this->lss = $lss;
               $this->les = $les;

               if( $this->readFile() ){
                    $this->links = array( "Total" => 0, "Links" => array(), "Names" => array() );
                    if( $this->parseFile() ){
                                     $this->first = false;
                                     $r = $this->findHref();
                         $this->all = $r[0];
                                     $this->first =  $r[1];
                    }
               }
          }
          else{ $this->error( "Missing argument!" ); }
     }
     
     function readFile(){
          $url = ( preg_match( "/^(.*)\/\/(.*)$/i", $this->page, $results ) ) ? $results[1]."//".urlencode( $results[2] ) : $this->page;
          if( $this->source = @file_get_contents( $url ) ){
               $RegEx = "/".$this->css."(.*)".$this->ces."/is";
               $this->source = ( preg_match( $RegEx, $this->source, $results ) ) ? $results[1] : "";
               return true;
          }
          else{ $this->error( "$this->page could not be opened!" ); }
     }
     
     function parseFile(){
          if( $this->source != "" ){
               $RegEx = "/".$this->lss."<a[^>].*href=[\"|'](.*)[\"|'].*>(.*)<\/a>".$this->les."/i";
               $this->links["Total"] = preg_match_all( $RegEx, $this->source, $results, PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE );
               if( $this->links["Total"] > 0 ){
                    $this->links["Links"] = $results[1];
                    $this->links["Names"] = $results[2];
                    return true;
               }
               else{ $this->error( "I can't get links from $this->page" ); }
          }
          else{ $this->error( "I can't get file contents!" ); }
     }

     function findHref(){
          $a = "";
              $error = true;
              $header = "<strong>$this->href</strong> found in <strong>$this->page</strong>:";
          for( $i = 0; $i < count( $this->links["Links"] ); $i++ ){
                        if( eregi( $this->href, $this->links["Links"][$i][0] ) ){
                      $a[0] .= "<ul><li>link position = ".intval( $i + 1 )."</li><li>array position = $i</li><li>link name = ".$this->links["Names"][$i][0]."</li><li>links to = ".$this->links["Links"][$i][0]."</li></ul>\n";
                              if( $this->first === false ){
                                    $this->first = true;
                                    $a[1] = "First occurrence of ".$header.$a[0];
                              }
                              $error = false;
                        }
          }
              if( $error === true ){ $this->error( "No match" ); }
              $a[0] = $header.$a[0];
          return $a;
     }

     function error( $strError ){
          die( "<p align=\"center\"><br /><br /><strong><font color=\"#FF0000\">Error<br />&lt;</font> ".$strError."<font color=\"#FF0000\"> &gt;</font></strong></p>" );
          exit();
     }
}

$css = "<!--header_text_end-->";     // Content Start Separator
$ces = "<!--footer_text_start-->";     // Content Ending Separator
$lss = "<!--system_link_start-->";     // Link Start Separator
$les = "<!--system_link_end-->";     // Link Ending Separator

$a = new findLinks( "myfile.php", "system", $css, $ces, $lss, $les );     //  findLinks( Page to be parsed, link to be searched, [Separators] )
echo $a->all;
echo $a->first;
?>

whithout $tipeOfMatch param you get two vars:
To show all matches: $this->all
To show first match: $this->first

Regards :)
Logan
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Mysqli Query 5 40
Install MySQL 5.6 and PHP on Centos Linux 6 48
Modify PHP Code on the Fly? 8 38
session dropped in IE 10 18
Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
Consider the following scenario: You are working on a website and make something great - something that lets the server work with information submitted by your users. This could be anything, from a simple guestbook to a e-Money solution. But what…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now