Solved

What regex will remove duplicate rel="nofolow" tags?

Posted on 2016-08-09
3
112 Views
Last Modified: 2016-08-10
I had this question after viewing Python error - Need Help.

I created this regex to remove the duplicate rel="nofollow" tags using grep in TextWrangler but I am not clear how to add this into the Python regex code.

rel="nofollow"(\s|\n|\n\r)rel="nofollow"

Open in new window


replace with
rel="nofollow"

Open in new window

0
Comment
Question by:sharingsunshine
  • 2
3 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 41749890
With respect to your previous question, you can use the following code. However, you should consider a quick hack. It would not work if the original page contained the rel="nofollow" attribute in another location (that is the duplicates not being adjacent). The proper, robust solution would need the use of an HTML parser:
import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document.html', 'w') as f:
    f.write(html)

rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
result = rexURL.sub(r'\1 rel="nofollow"', html)

rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
result = rexDoubledNofollow.sub(r'\1', result)

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

The \s* means zero or more whitespace characters that include also tabs and newlines. It is added to the searched sequence and captured as a group of characters (enclosed in parentheses, later referred as \1 in the next sub call). The + after means one or more occurrences.
0
 
LVL 29

Expert Comment

by:pepr
ID: 41749898
I have noticed a bug in the original page:
<a 1="" href="http://www.theherbsplace.com/" imageanchor=" rel="nofollow" style="...

Open in new window


Notice the 1="" and the imageanchor=" without the enclosing double quote.
0
 

Author Closing Comment

by:sharingsunshine
ID: 41751387
Thanks for the help.  On the other exceptions you pointed out I will just have to fix them as I find them.
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
In this article we have discussed about the OS X EI Capitan and how to fix Wi-Fi issue in OS X El Capitan. We have explained how to delete system level preferences and create a new Wi-Fi location to resolve Wi-Fi issue.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question