removing hyperlinks from end-notes in a PDF to HTML conversion

We are using a program called JPDF to HTML from IDR solutions to convert over 500 PDF's into HTML and in the PDF’s, the end notes contain URL’s. Now some of those URL’s have been converted into hyperlinks, and it doesn’t convert the whole URL just the first line of the URL. Does anyone know the easiest way to strip the hyperlinks from the URL’s?

Brandon GarnettAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Julian HansenCommented:
When you say strip do you mean the conversion does this
<a href=""></a> path/somefile.html

Open in new window

And you want to end up with

Open in new window

Some examples of what you are referring to be would be helpful to understand what it is you are asking.
Brandon GarnettAuthor Commented:
Yes exactly
Julian HansenCommented:
Do you have a sample of a converted document (or part of one with a broken hyperlink)?

Specifically need to see how lines are broken.

One solution is to use a regular expressiong

/<a(.*?)>(.*?)</a>/g, \1
The expression and implementation will vary depending on what you use to implement it but basically the expression matches all <a> tags and takes what is between them as the replacement for everything from the opening tag to the closing tag.

What tools would you use to process the html files to do the replace - Java?
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

Brandon GarnettAuthor Commented:
The solution that we are using right now for this is to put the converted web pages into Dreamweaver and then use the find and replace function to quickly go through and find the hyperlinks and the remove the /a tag
Julian HansenCommented:
Understood but I could recommend a solution in PHP only to find you don't use PHP - hence my question given a scripted solution what server side script environment would you prefer to use?
Brandon GarnettAuthor Commented:
We can use what ever language, what do you recommend?
Julian HansenCommented:
It does not really make a difference. Anything that supports regular expressions.
Here is a PHP solution
The script searches for all .html files in a folder, converts the URLs and then writes the file back to a folder (output).
$files = glob('*.html');
foreach($files as $file) {
  $content = file_get_contents($file);
  $fixed = preg_replace('/\<a(.*?)href="(.*?)"(.*?)>(.*?)\<\/a>/i', '\2', $content);
  file_put_contents("output/{$file}", $fixed);

Open in new window

To be able to determine if the script is correct I would need to see a sample of a file.
Here is the test file I used
<!doctype html>
	This is a test to see if ths <a href=""></a>path/somefile.html and some more
	text over here <a href=""></a> would go over here.

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Brandon GarnettAuthor Commented:
Thanks for the help
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.