Rewriting href src and action

OK here is what I am doing:

I am writing a page that fetches another page for caching purposes...

what I want to do is this:

replace any <img src="/relative.link"> with <img src="http://mysite.com/image.php?site=http://theirdomain.com/relative.link">

I also want to do this with href and action

There will possibly be html attributes before the tag start and the attribute.

There may also be a difference in quotes such as " ' or none at all

I also want them to go to their corresponding pages like /htmlpage.php and /formpage.php

if possible I would like regular expressions

the domain will be stored in $domain without a trailing '/'

Thank you if you can help!
LVL 4
askanthonysAsked:
Who is Participating?
 
CaveyCoUkCommented:
Ok, if you look at

http://prdownloads.sourceforge.net/pageforward/pf1.5b2.zip?download 

this function -

function completeURLs($HTML, $url){
      $URI_PARTS = parseURL($url);
      $path = trim($URI_PARTS["path"], "/");
      $host_url = trim($URI_PARTS["host"], "/");
      
      //$host = $URI_PARTS["scheme"]."://".trim($URI_PARTS["host"], "/")."/".$path; //ORIGINAL
      $host = $URI_PARTS["scheme"]."://".$host_url."/".$path."/";
      $host_no_path = $URI_PARTS["scheme"]."://".$host_url."/";
      
      //Proxifies local META redirects
      $HTML = preg_replace('@<META HTTP-EQUIV(.*)URL=/@', "<META HTTP-EQUIV\$1URL=".$_SERVER['PHP_SELF']."?url=".$host_no_path, $HTML);
      
      //Make sure the host doesn't end in '//'
      $host = rtrim($host, '/')."/";
      
      //Replace '//' with 'http://'
      $pattern = "#(?<=\"|'|=)\/\/#"; //the '|=' is experimental as it's probably not necessary
      $HTML = preg_replace($pattern, "http://", $HTML);
      
      //Fully qualifies '"/'
      $HTML = preg_replace("#\"\/#", "\"".$host, $HTML);
      
      //Fully qualifies "'/"
      $HTML = preg_replace("#\'\/#", "\'".$host, $HTML);
      
      //Matches [src|href|background|action]="/ because in the following pattern the '/' shouldn't stay
      $HTML = preg_replace("#(src|href|background|action)(=\"|='|=(?!'|\"))\/#i", "\$1\$2".$host_no_path, $HTML);
      $HTML = preg_replace("#(href|src|background|action)(=\"|=(?!'|\")|=')(?!http|ftp|https|\"|'|javascript:|mailto:)#i", "\$1\$2".$host, $HTML);
      
      //Points all form actions back to the proxy
      $HTML = preg_replace('/<form.+?action=\s*(["\']?)([^>\s"\']+)\\1[^>]*>/i', "<form action=\"{$_SERVER['PHP_SELF']}\"><input type=\"hidden\" name=\"original_url\" value=\"$2\">", $HTML);
      
      //Matches '/[any assortment of chars or nums]/../'
      $HTML = preg_replace("#\/(\w*?)\/\.\.\/(.*?)>#ims", "/\$2>", $HTML);
      
      //Matches '/./'
      $HTML = preg_replace("#\/\.\/(.*?)>#ims", "/\$1>", $HTML);

      //Handles CSS2 imports
      if (strpos($HTML, "import url(\"http") == false && (strpos($HTML, "import \"http") == false) && strpos($HTML, "import url(\"www") == false && (strpos($HTML, "import \"www") == false)) {
            $pattern = "#import .(.*?).;#ims";
            $mainurl = substr($host, 0, strnpos($host, "/", 3));
            $replace = "import '".$mainurl."\$1';";
            $HTML = preg_replace($pattern, $replace, $HTML);
      }
            
      return $HTML;
}

takes the file contents in as $HTML and then changes all the links into fully qualified links such that their domain is always in the link.  Then this function -

function proxyURLs($HTML){
      $edited_tag = "PF"; //used to check if the link has already been modified by the proxy
      
      //BASE tag needs to be removed for sites like yahoo.com
      //OR make the proxy insert the FULL URL to itself
      $pattern = "#\<base(.*?)\>#ims";
      $replacement = "<!-- <base\$1> -->"; //comment it out for now//
      $HTML = preg_replace($pattern, $replacement, $HTML);
      
      //edit <link tags so that 'edited="$edit_tag" ' is just before 'href'
      $HTML = preg_replace("#\<link(.*?)(\shref=)#ims", "<link\$1 edited=\"".$edited_tag."\"\$2", $HTML);
      
      //matches everything with an </a> after it on the same line....fails to match when that is on another line.
      $pattern = "#(?<!edited=\"".$edited_tag."\"\s)(href='|href=\"|href=(?!'|\"))(?=(.+)\</a\>)(?!mailto:|http://ftp|ftp|javascript:|'|\")#ims";
      $HTML = preg_replace($pattern, "edited=\"".$edited_tag."\" \$1".$_SERVER['PHP_SELF'].'?url=', $HTML);
      
      return $HTML;
}

takes every link in the page (again, as $HTML) and prepends a link to the current proxy.  Finally, this section down the bottom -

      if ($proxify_media) {
            $pattern = '/src=\s*(["\']?)([^>\s"\']+)\\1[^>]*>/i';
            $replace = "src=\"{$_SERVER['PHP_SELF']}?url=$2\">";
            $HTML = preg_replace($pattern, $replace, $HTML);
      }

does the same thing for images and javascript functions (as they contain src= parts).  All you need to do is get the URL to be parsed into the page then call the top two functions and then apply the last chunk of code to the contents.
0
 
askanthonysAuthor Commented:
also if they may already have their domain in their link in which I would only want to prepend http://mysite.com/image.php?site=
0
 
CaveyCoUkCommented:
I am sure there was a quick script on php.net but I cant find it.  A quick google brings up something called PageForward

http://joshdick.net/index.php?section=creations

which led me to

http://sbp.sf.net/

Seems to do what you want but doesnt cache the pages.  Given it is written in php it should be easily customisable using ob_start, ob_get_contents ob_end and ob_flush...
0
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

 
askanthonysAuthor Commented:
I looked through these files and it just seems to be an endless ammount of includes and stuff

I couldn't easily find what I was looking for...

If you could tell me where exactly this code happens to be I would be grateful

and just ignore the cache thing... I decided that would put too much stress on my server
0
 
CaveyCoUkCommented:
I dont quite understand

Page forward is here - http://prdownloads.sourceforge.net/pageforward/pf1.5b2.zip?download 

Its one file, you need php4-curl installed (its a PHP module) for it to work.  If you open the file, look for "$proxify_media = false" and change the false to a true, it should then also do the re-writing for images and script files.

If you want it to cache files, find

//**END USER CONFIG**

and add

//Cache files
$cache_files = true;
$cache_dir = "./cache";
//**END USER CONFIG**

then find

if(!$form_submission){

and add

if ($cache_files){
      if ( is_file($cache_dir."/".urlencode($url)) ){
            if ( filemtime($cache_dir."/".urlencode($url)) > (time() - (5 * 60)) ){
                  readfile($cache_dir."/".urlencode($url));
                  exit();
            }
      }
}
if(!$form_submission){

(change the 5 to however many minutes you want it to re-get files after, set it to 10 it wont bother for 10 minutes) and finally find

      echo ("\n<!-- PageForward v1.5b2 took $duration seconds to construct this page.-->");

and add

      if ($cache_files){
            $fp = fopen($cache_dir."/".urlencode($url),"w+b");
            fwrite($fp,$HTML);
            fclose($fp);
      }
      echo ("\n<!-- PageForward v1.5b2 took $duration seconds to construct this page.-->");

It should then cache all the files into that directory.  Note that this could soon become quite a collection and will probably need to be cleaned out on a regular basis.
0
 
askanthonysAuthor Commented:
I am not worried about caching files...

I would just like to rewrite the <a href> <img src> and the <form action> tags
0
 
askanthonysAuthor Commented:
thank you!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.