Solved

PHP, remove filename from URL String

Posted on 2010-09-07
16
1,754 Views
Last Modified: 2013-12-13
hi, I've got a simple sounding problem here, i want to remove the filename from a URL sting.

Here's where it gets tricky, there are many URL's, all entered by different people, so that's ok, i just loop them, but because they are entered by different people i don't know how they have formatted there URL's.
The only way i can think of getting rid of the filename would be explode the URL on each '/', then remove the last array part if it contained a dot/full-stop. However, i have seen many folders which have full stop in them, and if a file in that folder was not specified it would confuse that folder for a file, which would break the system later on.
All i can think from here is to create a large array of file extensions to check against the last few letters of the string to find out if its a file or not, but i wondered if there was a better way than this.

Any suggestions?
<?php

$urls = array(
	'http://www.website.com', // domain only, no trailing slash, index file requested
	'http://www.website.com/', // domain only, inc. trailing slash, index file requested
	'http://www.website.com/file.php', // domain + file 
	'http://www.website.com/folder', // domain + folder, no trailing slash 
	'http://www.website.com/folder/', // domain + folder, inc. trailing slash 
	'http://www.website.com/folder/file.html', // domain + folder + file 
	'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)
	'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)
	'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash
	'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced
);

echo "<pre>\n";

foreach($urls as $url) {
	
	// parse url
	$url_info = parse_url($url);
	$domain = $url_info['scheme'] .'://'. $url_info['host'];
	$url_minus_filename = $domain . $url_info['path'];
	
	// echo out the results 
	echo "Test URL: \t\t{$url}\n",
		 "Domain: \t\t{$domain}\n",
		 "Removed Filename: \t{$url_minus_filename}\n",
		 "\n";
}

?>

Open in new window

0
Comment
Question by:stilliard
  • 7
  • 5
  • 3
  • +1
16 Comments
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 500 total points
ID: 33616754
What about using a regex and restricting it to certain file extensions?
<?php

$urls = array(
     'http://www.website.com', // domain only, no trailing slash, index file requested
     'http://www.website.com/', // domain only, inc. trailing slash, index file requested
     'http://www.website.com/file.php', // domain + file 
     'http://www.website.com/folder', // domain + folder, no trailing slash 
     'http://www.website.com/folder/', // domain + folder, inc. trailing slash 
     'http://www.website.com/folder/file.html', // domain + folder + file 
     'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)
     'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)
     'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash
     'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced
);

echo "<pre>\n";

$pattern = '~^.+/([^/]+)\.(html|htm|php|js|gif|jpg|png)$~';

foreach($urls as $url) {
     
     if ( preg_match( $pattern, $url, $matches ) > 0 ) {
          
          echo "$url - filename part is {$matches[1]}.{$matches[2]}<br/>";
     }
     


}

?>

Open in new window

0
 
LVL 6

Author Comment

by:stilliard
ID: 33616828
@bportlock, Cheers, similar to the explode idea in that it requires an array or pattern of file extensions,  i had but i like that the regex keeps it much shorter!
Now just need to go through and add in all file extensions the urls could be for this application.
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 33616957
Your only other option is to convert the URL paths into filesystem paths and then use is_file() but that seems a bit messy.

http://www.php.net/realpath
http://www.php.net/is_file

0
 
LVL 6

Author Comment

by:stilliard
ID: 33617090
The problem with that is these files are not local, there remote and so could not be converted into filesystem paths.
Luckily on reflection of the app, the only  files that should be requested should be html or other webpages or files that could conatin other files such as in css it may refference background images etc, and in html it may contain links to css, javascript, images etc.
So i think this would cover all these files, can you see any ive missed.
$pattern = '~^.+/([^/]+)\.(html|htm|xml|php|php5|asp|net|pl|py|rb|js|css)$~';
again these would be html or other files which could contain other images or other css etc.
0
 
LVL 6

Expert Comment

by:NeoAshura
ID: 33617143
This is really simple.. just do the following using frame. i got this idea from phpPGadmin with a small correction.



Good Luck Any questions just ask

Mark
index.html
------------
///////////
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>YOUR TITLE HERE</title>
</head>
<frameset rows="*,0" frameborder="no" border="0" framespacing="0">
<frame src="index.php" name="mainFrame" id="mainFrame" />
<frame src="dummy.html"></frameset>
<noframes><body>
</body>
</noframes></html>

///////////////////////
Then a dummy HTML file ( EMPTY )
P.S Make sure your web server call index.html first than index.php
in apache, it can be configured at httpd.conf 
------------------------
dd this script to the real pages (ex: index.php),
better put it in one .js file that can be called from all pages. by .js i mean javascript.
--------------------------------------
////////////////////////
function cekParent(){
     if(top.mainFrame==null){
         window.open("/","_top");
     }
 }
//////////////////////////
function of that script is call main frame if user try to call index.php directly.
address bar will only show www.yourdomain.com

Open in new window

0
 
LVL 6

Expert Comment

by:NeoAshura
ID: 33617155
on line 22 it was ment to say...

*add this script to the real pages (ex: index.php),


cheers

mark
0
 
LVL 10

Expert Comment

by:racmail2001
ID: 33617179
please try this script also

it's not the preatiest method but it's working.

all you have to do it's to define the extensions of the files

hope this helps
<?php



$urls = array(

     'http://www.website.com', // domain only, no trailing slash, index file requested

     'http://www.website.com/', // domain only, inc. trailing slash, index file requested

     'http://www.website.com/file.php', // domain + file 

     'http://www.website.com/folder', // domain + folder, no trailing slash 

     'http://www.website.com/folder/', // domain + folder, inc. trailing slash 

     'http://www.website.com/folder/file.html', // domain + folder + file 

     'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)

     'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)

     'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash

     'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced

);



echo "<pre>\n";

  print_r($urls);

  echo"<hr>";



$extensions=array("pdf","jpg","gif","html","php");





foreach($urls as $url) {

     

  $parse=parse_url($url);

  $path=pathinfo($parse['path']);







foreach($extensions as $value) if(substr($path['basename'],strlen($value)*-1)==$value) $path['basename']="";

	$path['basename']=(strlen($path['basename'])>0 and substr($path['basename'],0,1)!="/")?"/".$path['basename']:$path['basename'];

	$finalurl[]=$parse['scheme']."://".$parse['host'].str_replace("\\","",$path['dirname']).$path['basename'];

}



  print_r($finalurl);

  echo"<hr>";  



?>

Open in new window

0
 
LVL 6

Author Comment

by:stilliard
ID: 33617237
@NeoAshura, sorry but you seem to have 100% missed the point with this one, none of your code given has anything to do with this. I think your trying to not show the file name has changed while moving around a site if im correct, however it doesn't matter if im wrong, this has nothing to do with what im working on, im attempting to remove the file name from a given string in php, which sounds very simple, but listing out all possible web document file extensions is proving a long process, never the less, ta for the effort, however unrelated.



@bportlock,
here is my updated pattern for web documents (that may contain images, css, javascript etc)
$pattern = '~^.+/([^/]+)\.(html|htm|xhtml|xht|xml|mht|mhtml|asp|aspx|adp|bml|cfm|cgi|ihtml|jsp|las|lasso|lassoapp|pl|php|php3|php4|php5|phps|phtml|shtml|stm|atom|eml|metalink|met|rss|css|xslt|xsl|tpl)$~';

I think i have selected them all now, do you know anywhere i can get a full list, ive taken these ones from wikipedia, (http://en.wikipedia.org/wiki/List_of_file_formats#Webpage) but i want to make sure this will not break for another file, but only files which concerns those who contain other request able elements.

0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 6

Author Comment

by:stilliard
ID: 33617301
@racmail2001 thanks but unfortunately this seems much longer than NeoAshura's method of a simple regex, however this is more similar to the explode method i talked about.
But still requires a safe list of files so im going to stick with NeoAshura's  reply, but thank you anyway.
0
 
LVL 6

Author Comment

by:stilliard
ID: 33617313
sorry, i meant to say "bportlock's" in my last post, not "NeoAshura's"

:)
0
 
LVL 6

Expert Comment

by:NeoAshura
ID: 33617329
Apologies i thought u where trying to Hide all file extension from being shown in the URL
0
 
LVL 6

Expert Comment

by:NeoAshura
ID: 33617340
All that mine does is for example if the follow was www.yourdomian.com/myfolder/images/img1.jpeg

it would only show www.yourdomain.com

and nothing else. i must of read the question wrong.
0
 
LVL 6

Author Comment

by:stilliard
ID: 33617392
@NeoAshura, no problem, what im doing is removing the filename from a url string, nothing to do with the browser or html.
0
 
LVL 6

Expert Comment

by:NeoAshura
ID: 33617397
Ah right no worries then bud at least u know for future referance :)
0
 
LVL 6

Author Comment

by:stilliard
ID: 33617438

@NeoAshura, yer, if i was working on such a project that would need that, then yes it would be helpfull.
@racmail2001, thanks for playing, better luck next time though.

@bportlock, i'm going to keep the file extensions i've listed in that last pattern i posted, very long so i think i have them all, but in case you notice some file type i've forgotten please let me know.
Other than that, thank you very much for your help with this.
0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 33618343
If you think that you will need more extensions then define them as a constant in a constants.php file and then alter the pattern in the regex to be like so

define ('FILE_EXTENSION_REGEX', 'html|htm|xml|php|php5|asp|net|pl|py|rb|js|css' );
....
...code
.....

$pattern = '~^.+/([^/]+)\.(' .  FILE_EXTENSION_REGEX . ')$~';

That way updating the list is a trivial matter

Cheers!
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now