?
Solved

Regex to find relative filenames in html code.

Posted on 2011-02-15
8
Medium Priority
?
906 Views
Last Modified: 2012-05-11
Hi,

Firstly, here's my main question. How do I define a regex to extract relative file names from html code?

If that interests you, here's my problem...

I'm developing a program that manages templating of html files, and I need a way to handle relative filenames within the html. For example, file C:\dir\index.html acts as a template file, containing html code to be used across other html files that depend upon it, eg. C:\dir\subdir\index.html. Now most of the html code can simply be copied across as is, however any relative filenames within C:\dir\index.html, will obviously no longer point to the right place if copied directly into C:\dir\subdir\index.html. So, I need to adjust any relative file names to suit the new file location.

The current approach I'm working on is to use a regex to identify any filenames in the original html, then in the program code (C#), check that the retrieved file name exists and is relative. This is because I don't trust my regex skills to get this right perfectly, so I'm designing the regex to capture anything that might be a filename, and then double check it in the code.

Currently my regex looks like this:

([^"';(\?%*|<>]+(?=\.([a-zA-Z]+/?)[^a-zA-Z(=.])\.\2)

which seems to capture any filename or url so far in testing.

A few criteria:
- needs to handle a filename anywhere, not just neatly embedded within single or double quotes
- needs to cope with filenames with embedded white space that isn't a part of the file name (eg. file name across multiple lines, which, I think, is valid in html)
- can assume that the file has an extension (which is what I've done above)
- needs to capture the whole filename and path as written in the html, but nothing else
- can be a little too inclusive, as long as anything that it detects that isn't a relative file path can be easily checked for in C# code (eg. using Path.IsPathRooted)

If anyone has any better regex's, or better solutions to my problem, that would be helpful.

Thanks.
0
Comment
Question by:crysallus
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
8 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34903319
Can you provide some example data to work with please?
0
 
LVL 8

Author Comment

by:crysallus
ID: 34903394
Well, it needs to work with any html, not any specific html, so I've just been doing view page source to get test data. Including on this website.

An update on my attempt above. After further testing I've now got:

([-\\\w/.:]+(?=\.([a-zA-Z]+)[^a-zA-Z0-9(=.[])\.\2)
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 1000 total points
ID: 34903494
This seems to do what you want:
(?<=[^-\\\w/.:])(?!(?:ht|f)tps?://)[-\\\w/.:]+\.[A-Z]+(?=[^A-Z0-9(=.\[])

It's surprising actually how well your technique works, but it's very reliant upon the full stop being included as part of the filename/url. With clever (and fairly standard) webserver URL rewriting, you don't necessary get that, so the technique won't always work for you. For a more general tool, I think you would need to be looking at the surrounding html code such as src or href attributes, but if the above works for you then I guess that's all you need.
0
Video: Liquid Web Managed WordPress Comparisons

If you run run a WordPress, you understand the potential headaches you may face when updating your plugins and themes. Do you choose to update on the fly and risk taking down your site; or do you set up a staging, keep it in sync with your live site and use that to test updates?

 
LVL 8

Author Comment

by:crysallus
ID: 34903585
I tried yours, but it didn't work for me.

I've been testing using http://gskinner.com/RegExr/, and it doesn't seem to match anything.

Small typo perhaps?

Yeah, the point about requiring the file extension is a good one, which I am aware of. I'm just not sure how else to detect a file name in a manner that distinguishes it from the rest of the html otherwise. As you said, that would probably require knowing all possible contexts (i.e. tag attributes) within which filenames could be found. Problem is, I'm not sure exactly what all those contexts would be (others may though), and that wouldn't handle filenames within javascript embedded in the html either.

Also, the sorts of filenames that I'm typically thinking of are image, or css, or js files, which would be written with their extension. Relative links to other urls however, may not, which may be an issue.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34903613
Works for me in that tool once the ignore case option is checked.
0
 
LVL 8

Author Comment

by:crysallus
ID: 34903697
Ah, I see.

Would it be safe to assume that any link that doesn't include the extension and relies on URL rewriting would end with '/'? From what I've seen, they tend to, but I'm guessing that isn't necessarily the case.

I amended yours to:

(?<=[^-\\\w/.:])(?!(?:ht|f)tps?://)[-\\\w/.:]+(\.[A-Z]+|/)(?=[^A-Z0-9(=.\[])

which does pick up filenames that end with /, as well as ones with extensions, but it also matches with the // at the start of a comment in javascript. Any way around that?
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34903729
> Would it be safe to assume that any link that doesn't include the extension and relies on URL rewriting would end with '/'?

No, rewritten URLs can be anything, essentially.

This when run in multiline (and ignore case) mode will exclude javascript comments where they occur at the start of a line (can have leading spaces). Because the result array will be a slightly different structure, you'll need to process the results differently.
(?!^\s*//)^.*((?<=[^-\\\w/.:])(?!(?:ht|f)tps?://)[-\\\w/.:]+(\.[A-Z]+|/)(?=[^A-Z0-9(=.\[]))
0
 
LVL 8

Author Closing Comment

by:crysallus
ID: 34922962
I've actually gone with an expanded regex building on your suggestions, but thanks for the input, it definitely helped.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Find out what you should include to make the best professional email signature for your organization.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
In this tutorial viewers will learn how to style transparent/translucent elements using alpha transparency in CSS Start with a normal styled element, such as a div.: Define its "background-color" property as "rgba (255, 255, 255, .5): The numbers in…
In this tutorial viewers will learn how to embed an audio file in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: : The declaration should display (CODE) HTML5 is supported by the most recent versions of all major browsers…
Suggested Courses

741 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question