Rewriting href src and action

OK here is what I am doing:

I am writing a page that fetches another page for caching purposes...

what I want to do is this:

replace any <img src="/"> with <img src="">

I also want to do this with href and action

There will possibly be html attributes before the tag start and the attribute.

There may also be a difference in quotes such as " ' or none at all

I also want them to go to their corresponding pages like /htmlpage.php and /formpage.php

if possible I would like regular expressions

the domain will be stored in $domain without a trailing '/'

Thank you if you can help!
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

askanthonysAuthor Commented:
also if they may already have their domain in their link in which I would only want to prepend
I am sure there was a quick script on but I cant find it.  A quick google brings up something called PageForward

which led me to

Seems to do what you want but doesnt cache the pages.  Given it is written in php it should be easily customisable using ob_start, ob_get_contents ob_end and ob_flush...
askanthonysAuthor Commented:
I looked through these files and it just seems to be an endless ammount of includes and stuff

I couldn't easily find what I was looking for...

If you could tell me where exactly this code happens to be I would be grateful

and just ignore the cache thing... I decided that would put too much stress on my server
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

I dont quite understand

Page forward is here - 

Its one file, you need php4-curl installed (its a PHP module) for it to work.  If you open the file, look for "$proxify_media = false" and change the false to a true, it should then also do the re-writing for images and script files.

If you want it to cache files, find


and add

//Cache files
$cache_files = true;
$cache_dir = "./cache";

then find


and add

if ($cache_files){
      if ( is_file($cache_dir."/".urlencode($url)) ){
            if ( filemtime($cache_dir."/".urlencode($url)) > (time() - (5 * 60)) ){

(change the 5 to however many minutes you want it to re-get files after, set it to 10 it wont bother for 10 minutes) and finally find

      echo ("\n<!-- PageForward v1.5b2 took $duration seconds to construct this page.-->");

and add

      if ($cache_files){
            $fp = fopen($cache_dir."/".urlencode($url),"w+b");
      echo ("\n<!-- PageForward v1.5b2 took $duration seconds to construct this page.-->");

It should then cache all the files into that directory.  Note that this could soon become quite a collection and will probably need to be cleaned out on a regular basis.
askanthonysAuthor Commented:
I am not worried about caching files...

I would just like to rewrite the <a href> <img src> and the <form action> tags
Ok, if you look at 

this function -

function completeURLs($HTML, $url){
      $URI_PARTS = parseURL($url);
      $path = trim($URI_PARTS["path"], "/");
      $host_url = trim($URI_PARTS["host"], "/");
      //$host = $URI_PARTS["scheme"]."://".trim($URI_PARTS["host"], "/")."/".$path; //ORIGINAL
      $host = $URI_PARTS["scheme"]."://".$host_url."/".$path."/";
      $host_no_path = $URI_PARTS["scheme"]."://".$host_url."/";
      //Proxifies local META redirects
      $HTML = preg_replace('@<META HTTP-EQUIV(.*)URL=/@', "<META HTTP-EQUIV\$1URL=".$_SERVER['PHP_SELF']."?url=".$host_no_path, $HTML);
      //Make sure the host doesn't end in '//'
      $host = rtrim($host, '/')."/";
      //Replace '//' with 'http://'
      $pattern = "#(?<=\"|'|=)\/\/#"; //the '|=' is experimental as it's probably not necessary
      $HTML = preg_replace($pattern, "http://", $HTML);
      //Fully qualifies '"/'
      $HTML = preg_replace("#\"\/#", "\"".$host, $HTML);
      //Fully qualifies "'/"
      $HTML = preg_replace("#\'\/#", "\'".$host, $HTML);
      //Matches [src|href|background|action]="/ because in the following pattern the '/' shouldn't stay
      $HTML = preg_replace("#(src|href|background|action)(=\"|='|=(?!'|\"))\/#i", "\$1\$2".$host_no_path, $HTML);
      $HTML = preg_replace("#(href|src|background|action)(=\"|=(?!'|\")|=')(?!http|ftp|https|\"|'|javascript:|mailto:)#i", "\$1\$2".$host, $HTML);
      //Points all form actions back to the proxy
      $HTML = preg_replace('/<form.+?action=\s*(["\']?)([^>\s"\']+)\\1[^>]*>/i', "<form action=\"{$_SERVER['PHP_SELF']}\"><input type=\"hidden\" name=\"original_url\" value=\"$2\">", $HTML);
      //Matches '/[any assortment of chars or nums]/../'
      $HTML = preg_replace("#\/(\w*?)\/\.\.\/(.*?)>#ims", "/\$2>", $HTML);
      //Matches '/./'
      $HTML = preg_replace("#\/\.\/(.*?)>#ims", "/\$1>", $HTML);

      //Handles CSS2 imports
      if (strpos($HTML, "import url(\"http") == false && (strpos($HTML, "import \"http") == false) && strpos($HTML, "import url(\"www") == false && (strpos($HTML, "import \"www") == false)) {
            $pattern = "#import .(.*?).;#ims";
            $mainurl = substr($host, 0, strnpos($host, "/", 3));
            $replace = "import '".$mainurl."\$1';";
            $HTML = preg_replace($pattern, $replace, $HTML);
      return $HTML;

takes the file contents in as $HTML and then changes all the links into fully qualified links such that their domain is always in the link.  Then this function -

function proxyURLs($HTML){
      $edited_tag = "PF"; //used to check if the link has already been modified by the proxy
      //BASE tag needs to be removed for sites like
      //OR make the proxy insert the FULL URL to itself
      $pattern = "#\<base(.*?)\>#ims";
      $replacement = "<!-- <base\$1> -->"; //comment it out for now//
      $HTML = preg_replace($pattern, $replacement, $HTML);
      //edit <link tags so that 'edited="$edit_tag" ' is just before 'href'
      $HTML = preg_replace("#\<link(.*?)(\shref=)#ims", "<link\$1 edited=\"".$edited_tag."\"\$2", $HTML);
      //matches everything with an </a> after it on the same line....fails to match when that is on another line.
      $pattern = "#(?<!edited=\"".$edited_tag."\"\s)(href='|href=\"|href=(?!'|\"))(?=(.+)\</a\>)(?!mailto:|http://ftp|ftp|javascript:|'|\")#ims";
      $HTML = preg_replace($pattern, "edited=\"".$edited_tag."\" \$1".$_SERVER['PHP_SELF'].'?url=', $HTML);
      return $HTML;

takes every link in the page (again, as $HTML) and prepends a link to the current proxy.  Finally, this section down the bottom -

      if ($proxify_media) {
            $pattern = '/src=\s*(["\']?)([^>\s"\']+)\\1[^>]*>/i';
            $replace = "src=\"{$_SERVER['PHP_SELF']}?url=$2\">";
            $HTML = preg_replace($pattern, $replace, $HTML);

does the same thing for images and javascript functions (as they contain src= parts).  All you need to do is get the URL to be parsed into the page then call the top two functions and then apply the last chunk of code to the contents.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
askanthonysAuthor Commented:
thank you!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.