PHP, remove filename from URL String

hi, I've got a simple sounding problem here, i want to remove the filename from a URL sting.

Here's where it gets tricky, there are many URL's, all entered by different people, so that's ok, i just loop them, but because they are entered by different people i don't know how they have formatted there URL's.
The only way i can think of getting rid of the filename would be explode the URL on each '/', then remove the last array part if it contained a dot/full-stop. However, i have seen many folders which have full stop in them, and if a file in that folder was not specified it would confuse that folder for a file, which would break the system later on.
All i can think from here is to create a large array of file extensions to check against the last few letters of the string to find out if its a file or not, but i wondered if there was a better way than this.

Any suggestions?
<?php

$urls = array(
	'http://www.website.com', // domain only, no trailing slash, index file requested
	'http://www.website.com/', // domain only, inc. trailing slash, index file requested
	'http://www.website.com/file.php', // domain + file 
	'http://www.website.com/folder', // domain + folder, no trailing slash 
	'http://www.website.com/folder/', // domain + folder, inc. trailing slash 
	'http://www.website.com/folder/file.html', // domain + folder + file 
	'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)
	'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)
	'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash
	'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced
);

echo "<pre>\n";

foreach($urls as $url) {
	
	// parse url
	$url_info = parse_url($url);
	$domain = $url_info['scheme'] .'://'. $url_info['host'];
	$url_minus_filename = $domain . $url_info['path'];
	
	// echo out the results 
	echo "Test URL: \t\t{$url}\n",
		 "Domain: \t\t{$domain}\n",
		 "Removed Filename: \t{$url_minus_filename}\n",
		 "\n";
}

?>

Open in new window

LVL 6
stilliardAsked:
Who is Participating?
 
Beverley PortlockCommented:
What about using a regex and restricting it to certain file extensions?
<?php

$urls = array(
     'http://www.website.com', // domain only, no trailing slash, index file requested
     'http://www.website.com/', // domain only, inc. trailing slash, index file requested
     'http://www.website.com/file.php', // domain + file 
     'http://www.website.com/folder', // domain + folder, no trailing slash 
     'http://www.website.com/folder/', // domain + folder, inc. trailing slash 
     'http://www.website.com/folder/file.html', // domain + folder + file 
     'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)
     'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)
     'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash
     'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced
);

echo "<pre>\n";

$pattern = '~^.+/([^/]+)\.(html|htm|php|js|gif|jpg|png)$~';

foreach($urls as $url) {
     
     if ( preg_match( $pattern, $url, $matches ) > 0 ) {
          
          echo "$url - filename part is {$matches[1]}.{$matches[2]}<br/>";
     }
     


}

?>

Open in new window

0
 
stilliardAuthor Commented:
@bportlock, Cheers, similar to the explode idea in that it requires an array or pattern of file extensions,  i had but i like that the regex keeps it much shorter!
Now just need to go through and add in all file extensions the urls could be for this application.
0
 
Beverley PortlockCommented:
Your only other option is to convert the URL paths into filesystem paths and then use is_file() but that seems a bit messy.

http://www.php.net/realpath
http://www.php.net/is_file

0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
stilliardAuthor Commented:
The problem with that is these files are not local, there remote and so could not be converted into filesystem paths.
Luckily on reflection of the app, the only  files that should be requested should be html or other webpages or files that could conatin other files such as in css it may refference background images etc, and in html it may contain links to css, javascript, images etc.
So i think this would cover all these files, can you see any ive missed.
$pattern = '~^.+/([^/]+)\.(html|htm|xml|php|php5|asp|net|pl|py|rb|js|css)$~';
again these would be html or other files which could contain other images or other css etc.
0
 
NeoAshuraCommented:
This is really simple.. just do the following using frame. i got this idea from phpPGadmin with a small correction.



Good Luck Any questions just ask

Mark
index.html
------------
///////////
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>YOUR TITLE HERE</title>
</head>
<frameset rows="*,0" frameborder="no" border="0" framespacing="0">
<frame src="index.php" name="mainFrame" id="mainFrame" />
<frame src="dummy.html"></frameset>
<noframes><body>
</body>
</noframes></html>

///////////////////////
Then a dummy HTML file ( EMPTY )
P.S Make sure your web server call index.html first than index.php
in apache, it can be configured at httpd.conf 
------------------------
dd this script to the real pages (ex: index.php),
better put it in one .js file that can be called from all pages. by .js i mean javascript.
--------------------------------------
////////////////////////
function cekParent(){
     if(top.mainFrame==null){
         window.open("/","_top");
     }
 }
//////////////////////////
function of that script is call main frame if user try to call index.php directly.
address bar will only show www.yourdomain.com

Open in new window

0
 
NeoAshuraCommented:
on line 22 it was ment to say...

*add this script to the real pages (ex: index.php),


cheers

mark
0
 
racmail2001Commented:
please try this script also

it's not the preatiest method but it's working.

all you have to do it's to define the extensions of the files

hope this helps
<?php

$urls = array(
     'http://www.website.com', // domain only, no trailing slash, index file requested
     'http://www.website.com/', // domain only, inc. trailing slash, index file requested
     'http://www.website.com/file.php', // domain + file 
     'http://www.website.com/folder', // domain + folder, no trailing slash 
     'http://www.website.com/folder/', // domain + folder, inc. trailing slash 
     'http://www.website.com/folder/file.html', // domain + folder + file 
     'http://www.website.com/folder/another_file.gif', // domain + folder + file (almost any file extention could be refferenced)
     'http://www.website.com/folder/folder.with.dots', // domain + folder + another folder (and the sub folder could contain dots/fullstops, in the same way that the domain could conatin dots, or any level of folder)
     'http://www.website.com/folder/folder.with.dots/', // again, but witha trailing slash
     'http://www.website.com/folder/folder.with.dots/file.html' // and after this folder with the dots in it, with a file inside refferenced
);

echo "<pre>\n";
  print_r($urls);
  echo"<hr>";

$extensions=array("pdf","jpg","gif","html","php");


foreach($urls as $url) {
     
  $parse=parse_url($url);
  $path=pathinfo($parse['path']);



foreach($extensions as $value) if(substr($path['basename'],strlen($value)*-1)==$value) $path['basename']="";
	$path['basename']=(strlen($path['basename'])>0 and substr($path['basename'],0,1)!="/")?"/".$path['basename']:$path['basename'];
	$finalurl[]=$parse['scheme']."://".$parse['host'].str_replace("\\","",$path['dirname']).$path['basename'];
}

  print_r($finalurl);
  echo"<hr>";  

?>

Open in new window

0
 
stilliardAuthor Commented:
@NeoAshura, sorry but you seem to have 100% missed the point with this one, none of your code given has anything to do with this. I think your trying to not show the file name has changed while moving around a site if im correct, however it doesn't matter if im wrong, this has nothing to do with what im working on, im attempting to remove the file name from a given string in php, which sounds very simple, but listing out all possible web document file extensions is proving a long process, never the less, ta for the effort, however unrelated.



@bportlock,
here is my updated pattern for web documents (that may contain images, css, javascript etc)
$pattern = '~^.+/([^/]+)\.(html|htm|xhtml|xht|xml|mht|mhtml|asp|aspx|adp|bml|cfm|cgi|ihtml|jsp|las|lasso|lassoapp|pl|php|php3|php4|php5|phps|phtml|shtml|stm|atom|eml|metalink|met|rss|css|xslt|xsl|tpl)$~';

I think i have selected them all now, do you know anywhere i can get a full list, ive taken these ones from wikipedia, (http://en.wikipedia.org/wiki/List_of_file_formats#Webpage) but i want to make sure this will not break for another file, but only files which concerns those who contain other request able elements.

0
 
stilliardAuthor Commented:
@racmail2001 thanks but unfortunately this seems much longer than NeoAshura's method of a simple regex, however this is more similar to the explode method i talked about.
But still requires a safe list of files so im going to stick with NeoAshura's  reply, but thank you anyway.
0
 
stilliardAuthor Commented:
sorry, i meant to say "bportlock's" in my last post, not "NeoAshura's"

:)
0
 
NeoAshuraCommented:
Apologies i thought u where trying to Hide all file extension from being shown in the URL
0
 
NeoAshuraCommented:
All that mine does is for example if the follow was www.yourdomian.com/myfolder/images/img1.jpeg

it would only show www.yourdomain.com 

and nothing else. i must of read the question wrong.
0
 
stilliardAuthor Commented:
@NeoAshura, no problem, what im doing is removing the filename from a url string, nothing to do with the browser or html.
0
 
NeoAshuraCommented:
Ah right no worries then bud at least u know for future referance :)
0
 
stilliardAuthor Commented:

@NeoAshura, yer, if i was working on such a project that would need that, then yes it would be helpfull.
@racmail2001, thanks for playing, better luck next time though.

@bportlock, i'm going to keep the file extensions i've listed in that last pattern i posted, very long so i think i have them all, but in case you notice some file type i've forgotten please let me know.
Other than that, thank you very much for your help with this.
0
 
Beverley PortlockCommented:
If you think that you will need more extensions then define them as a constant in a constants.php file and then alter the pattern in the regex to be like so

define ('FILE_EXTENSION_REGEX', 'html|htm|xml|php|php5|asp|net|pl|py|rb|js|css' );
....
...code
.....

$pattern = '~^.+/([^/]+)\.(' .  FILE_EXTENSION_REGEX . ')$~';

That way updating the list is a trivial matter

Cheers!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.