[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 273
  • Last Modified:

String Manipulation in PHP - working with HTML and paths

I need to do quite a lot of string manipuation in PHP for an application that we are setting up.  I am familiar with PHP, but I've not used the string functions much, and after pouring over some resources this afternoon, I decided that the experts would provide a quicker way for me to get up to speed.

The app is pulling entire web pages from other domains into a string and then delivering them inline as part of a locally hosted page.  Here is what I need to be able to do:

1)For images, stylesheet references, .js files, et all, I need to take all the non-explicit links (there is a proper term for this which escapes me..) and make them explicit. e.g.
               src="/fred/index.php"
becomes: src="http://www.theirserver.com/fred/index.php"

2) manipulate the string so that links like <a href="/fred/index.php> click me </a>
becomes: <a href="http://www.myserver.com/page.php?linkpath=http://www.theirserver.com/fred/index.php> click me </a>
(the links get handled by the local app, and then the corrosponding remote page is loaded inline as part of a local page.)

3) I also need to accomodate in both of these scenarios for page relative links that will need to be converted to the full path, including the domain.

      of course the use of whitespace in the first two examples will vary as HTML allows ( can be src= " or src = " or src =" , et all)

4) Finally, (and I'll gladly put this up as a seperate question with points if I'm asking too much here for my 500...) I need to trace the user's click path as they move from page to page.  There requests will all be handled by the same PHP page on the local site (page.php in point 2 above), with the target page appering after the ?.  I'm assuming that I cannot store that data in an array, as the array would be re-dimensioned each time that page.php is loaded, so my preference would be to write the data to an xml document so that I can manipulate it with other tools for reporting purposes.  I would like to be able to transform the xml document into html for display on the site as well.


0
shotokai
Asked:
shotokai
  • 6
  • 5
1 Solution
 
caterham_wwwCommented:
Hi,

for 1) + 3) try this:

<?
//counter-Variable
$i=0;
//your string
$var='hello world <src ="/fred/index.php" img src= "hello/test.php"> ';

do {
      ${"var".$i}=$var;
      $var = eregi_replace ("src\ ?=\ ?\"(.[^:]*)\"","src=\"http://www.theirserver.com\\1\"",$var);
} while ($var != ${"var".$i})

?>

and for 2) + 3)

<?
//counter-Variable
$i=0;
//your string
$var='<a href ="/fred/index.php">hello world</a> <a href= "/hello/test.php">hello</a>';
do {
      ${"var".$i}=$var;
      $var = eregi_replace ("a href\ ?=\ ?\"(.[^:]*)\"","a href=\"http://www.myserver.com/page.php?linkpath=http://www.theirserver.com\\1\"",$var);
} while ($var != ${"var".$i})
?>


Bob
0
 
shotokaiAuthor Commented:
Thanks Bob:

this is kind of weird, but the eregi_replace isn't doing anything.  I used variables for the search and replace strings so that I could verify that nothing was happening.  Here is the code behind the page, including the part where the html content of the subject site gets pulled into a string:

<? if (! empty($testsite)) {
 
// first step is to determine the domain of the site being tested
//get rid of the http:// if it is used
$search = array("http://");
$replace = array("");
$domain = str_replace($search,$replace,$testsite);
//get rid of the trailing slash and anything there after
$trimright = strcspn($domain,"/");
$domain = substr($domain,"0",$trimright);
$domain = str_replace("/*","",$domain);
//$domain = substr($testsite, 7);
echo ($domain)."<br>";
$testsite = ("http://" . $domain) ;
echo ($testsite);
$incpage = include($testsite);

// next we run through the string (the html content of the test site) and
//      1) ammend the src references to include the full domain of hte site
$i=0;
do {
  //  ${"incpage".$i}=$incpage;
      $search = "a";
      //$search = "src\ ?=\ ?\"(.[^:]*)\""
      $replace= "apples";
      //$replace = "src=\"http://www.theirserver.com\\1\""
     $incpage = eregi_replace ($search,$replace,$incpage);
} while ($incpage != ${"incpage".$i});


//      2) next we need to change any link references to include the path to the usability application
$i=0;
do {
     ${"incpage".$i}=$incpage;
     $incpage = eregi_replace ("a href\ ?=\ ?\"(.[^:]*)\"","a href=\"http://www.myserver.com/page.php?linkpath=http://www.theirserver.com\\1\"",$incpage);
} while ($incpage != ${"incpage".$i});
echo $incpage;
}; //end of the first IF

?>
0
 
caterham_wwwCommented:
And how does the String $testsite look like? Can you post an example for testing?
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
shotokaiAuthor Commented:
In trying to post the content of the variable $incpage, I can see what is wrong - I think.  The source content of the page is not loading into the variable as I had thought.  Instead when I use include() or require(), the page source is output inline there and then.  The value of $incpage is 1.

Is there a function that will allow me to load the source of the remote page into a string?
0
 
caterham_wwwCommented:
did you mean
$incpage = include($testsite);

I think if you have to get the source of a rempte page, you have to use fopen or fsockopen,

see http://www.php.net/fsockopen / http://www.php.net/fopen
But with the line
$domain = substr($domain,"0",$trimright);
you are destroying the complete string.
Ex. 'src="/john/sa.html" test src = "/edde/ggt.css' becomes
' ' (simply blank)

or
'src="http://john/sa.html" test src = "http://edde/ggt.css' becomes
src="john (/sa.html... is missing)
0
 
shotokaiAuthor Commented:
Ok, things are going much better now.

I needed to use the file_get_contents() function to load the source into the string.  PHP version was 4.1.1, and the function isn't available below 4.3, so I've been cursing and upgrading for a couple of days.  That is behind us now.

In the above.  $domain is a short string that holds the name of the site in the form of xxx.server.xxx.  It isn't what I'm manipulating further down the page ($incpage), which is populated through file_get_contents($testsite)

The two bits of code that caterham_www provided above are now working.  One issue.  In th second bit of code:

$i=0;
do {
     ${"incpage".$i}=$incpage;
     $incpage = eregi_replace ("a href\ ?=\ ?\"(.[^:]*)\"","a href=\"http://www.myserver.com/page.php?linkpath=http://www.theirserver.com\\1\"",$incpage);
} while ($incpage != ${"incpage".$i});
0
 
shotokaiAuthor Commented:
---I got truncated some how

So if the link has anything between the 'a' and the 'href', the match isn't picked up.  I don't understand the matching syntax.  So if you can clear this up for me...
0
 
caterham_wwwCommented:
Hi,

try this one:

a .* href\ ?=\ ?\"(.[^:]*)\"
-->
$incpage = eregi_replace ("a .* href\ ?=\ ?\"(.[^:]*)\"","a href=\"http://www.myserver.com/page.php?linkpath=http://www.theirserver.com\\1\"",$incpage);

will match
a href=""
a target="" href="" etc.
If you would like to include e.g. target="" (the things between a and href into your replaced string:

$incpage = eregi_replace ("a (.*) href\ ?=\ ?\"(.[^:]*)\"","a \\1 href=\"http://www.myserver.com/page.php?linkpath=http://www.theirserver.com\\2\"",$incpage);

a target="as" href="/hhh" will become a target="as" href="http://www.myserver..."
0
 
shotokaiAuthor Commented:
That worked well thanks.

There are still some instances where it it missing - but it isn't fair of me to ask you to deal with these.  I'm going to go ahead and accept your answer(s) - and thanks very much for yoru time!  can you point me towards an online resource that explains the syntax for the string matching and replacing as you've used it?

thanks again
0
 
shotokaiAuthor Commented:
Hey - I'll save you the pain.  found some really good reference material on phpfreaks:

http://www.phpfreaks.com/tutorials/63/3.php (specific to eregi_replace and the ERE POSIX syntax)

Thanks again
0
 
caterham_wwwCommented:
Hi,

thanks. Also interesting about regular expressions is this site: http://www.regular-expressions.info
It's not about eregi_replace but about the RegEx, which is of course the pattern-part for eregi_replace.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 6
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now