Avatar of ugeb
ugeb
Flag for United States of America asked on

using sed to modify html

Hi,

I have a series of html files on which I need to do some group maniuplation.  What would be the sed command to find all href's and replace the string with my own string?

For example,  replace all occurrences of:

<a href="this string could vary considerably">

with

<a href="http://www.mydomain.com" title="Call me!">

Note that I can't use a constantt string for the first href because I don't know what will be between the quotes, so I'm guessing it needs some sort of pattern matching.

Thanks!




Unix OSRegular Expressions

Avatar of undefined
Last Comment
ugeb

8/22/2022 - Mon
SOLUTION
omarfarid

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
ugeb

ASKER
Hi,

Thanks for the reply.  This gets close, but the * isn't matching the patterns.  No substitutions are being made.  BTW, I added the trailing apostrophe on the sed expression.

The following, for example, is not affected.
<a href="#">

It's not that this either, nothing is matched by the *.  Is there another variation? I'm on Cygwin. I don't think it makes a difference, though.

thanks
nixfreak

sed -i.bak -r 's#(<a +href=)("[^"]*">)#\1"http://www.mydomain.com" title="Call me!">#'
nixfreak

sed -i.bak -r 's#href=("[^"]*")#href="http://www.mydomain.com" title="Call me!"#'  file.html
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
nixfreak

trivial beautification :-) :

sed -i.bak -r 's#(href=)"[^"]*"#\1"http://www.mydomain.com" title="Call me!"#'  file.html
ozo

are all the hrefs contained within a single line?
nixfreak

ah, thanks ozo

that would be:

sed -i.bak -r 's#(href=)"[^"]*"#\1"http://www.mydomain.com" title="Call me!"#g'  file.html
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
ASKER CERTIFIED SOLUTION
nixfreak

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
ugeb

ASKER
Hi,

ozo:  No, they will be on multiple lines throughout the file.  However, there may be more than one per line.

nixfreak:  Some cool stuff. The syntax is different from what I've used, so a few things are throwing me off.  
Does the # replace / (the hash replace the slash)?  
Do you not need to escape the " because they're inside the ' (doubles inside singles)? There are a number of things I would normally escape, but the # gets around that ...?
Is the * going to match anything this time?  Why would it match this time but not last?

Thanks a bunch!

ozo

Maybe I should rephrase that:
Can any of the hrefs span more than one line?
nixfreak

i have used the hash to avoid having to escape any /

the re is quite simple:

[Hh][Rr][Ee][Ff] *= *   will match href= with some precautions

"[^"]*"  will match the referenced link in the original html and disposed off

Your help has saved me hundreds of hours of internet surfing.
fblack61
ugeb

ASKER
I believe all the href's are contained on a single line -- certainly there aren't any newlines between < and >.  It's possible to have something like:

<a href="abcde.com">This is an  <br />
Amazing Title </a>

nixfreak, it looks like your solution is working on the files I've tried it on.  All the href's were lower case, so I was good there.  I was thinking sed could ignore case like grep could, but guess not ...
 
Thanks to all!


ugeb

ASKER
Curiously, I also got omarfarid's solution to work.  I had to add a "." (dot) before the * and then it would match everything.  Two very different ways of achieving the same result.  Pretty interesting:)