stakor
asked on
Remove links and text from list.
I am looking to remove anything (Including the 'marker' text) between:
<a href"
and the first instance of:
</li>
So the following:
<li> blah blah <a href="blah.com"> blah </a> </li>
<li> blah blah <a href="blah2.com"> blah blah blah </a> </li>
Looks like:
<li> blah blah
<li> blah blah
<a href"
and the first instance of:
</li>
So the following:
<li> blah blah <a href="blah.com"> blah </a> </li>
<li> blah blah <a href="blah2.com"> blah blah blah </a> </li>
Looks like:
<li> blah blah
<li> blah blah
Actually this is probably better:
$data =~ s/(<li>.*?)<a href=".*?<\/li>/$1/sg;
ASKER
Is there a way to tweak this, so that the file can be piped into it?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If the intended purpose of the (<li>.*?) was to prevent removal of things like
<a href="blah.com"> blah </a>
<li> blah blah </li>
you may want to note that it does not prevent removal of something like
<li> blah blah </li>
<a href="blah2.com"> blah blah blah </a>
<li> blah blah blah blah blah </li>
<a href="blah.com"> blah </a>
<li> blah blah </li>
you may want to note that it does not prevent removal of something like
<li> blah blah </li>
<a href="blah2.com"> blah blah blah </a>
<li> blah blah blah blah blah </li>
ASKER
The items to be removed would exist inside of the <li> </li> tags. There will probably be more cleaning up that will be required, but at this stage, that is fine. I appreciate the feed back.
Good point ozo. This might be better:
perl -0777 -i.bak -pe 's/(<li>(?:(?!<\/li>).)*?)<a href=".*?<\/li>/$1/sg' file.txt
But the substitution in http:#a39233593 is capable of removing items not inside of <li> </li> tags.
If your input file has no items outside of <li> </li> tags then it may not matter for your application.
But if your input file has no items outside of <li> </li> tags then the (<li>.*?) would seem to serve no purpose.
If your input file has no items outside of <li> </li> tags then it may not matter for your application.
But if your input file has no items outside of <li> </li> tags then the (<li>.*?) would seem to serve no purpose.
ASKER
The application that I am going to use this for only has data inside of <li> </li> tags. There is no other data. As long as the 'delete' only goes to the next </li> tag, it should be good. I just don't want to have it get greedy and delete an entire <li> ... </li> set.
s/(<li>(?:(?!<\/li>).)*?)< a href=".*?<\/li>/$1/sg
takes care of the case of
<li> blah blah </li>
<li> blah blah <a href="blah2.com"> blah blah blah </a> </li>
but not the case of
<li> blah blah
<li> blah blah <a href="blah.com"> blah </a>
<li> blah blah <a href="blah2.com"> blah blah blah </a>
</li>
takes care of the case of
<li> blah blah </li>
<li> blah blah <a href="blah2.com"> blah blah blah </a> </li>
but not the case of
<li> blah blah
<li> blah blah <a href="blah.com"> blah </a>
<li> blah blah <a href="blah2.com"> blah blah blah </a>
</li>
ASKER
There will not be nested <li></li> sets. That I am comfortable with. There is a chance that there might be the occasional set that does not have a link. But I will have to see if that happens.
You'll need the latest version I posted to successful handle data where there exists a <li> tag without a link in it.
Open in new window