Solved

What regex will remove duplicate rel="nofolow" tags?

Posted on 2016-08-09
3
144 Views
Last Modified: 2016-08-10
I had this question after viewing Python error - Need Help.

I created this regex to remove the duplicate rel="nofollow" tags using grep in TextWrangler but I am not clear how to add this into the Python regex code.

rel="nofollow"(\s|\n|\n\r)rel="nofollow"

Open in new window


replace with
rel="nofollow"

Open in new window

0
Comment
Question by:sharingsunshine
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 41749890
With respect to your previous question, you can use the following code. However, you should consider a quick hack. It would not work if the original page contained the rel="nofollow" attribute in another location (that is the duplicates not being adjacent). The proper, robust solution would need the use of an HTML parser:
import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document.html', 'w') as f:
    f.write(html)

rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
result = rexURL.sub(r'\1 rel="nofollow"', html)

rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
result = rexDoubledNofollow.sub(r'\1', result)

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

The \s* means zero or more whitespace characters that include also tabs and newlines. It is added to the searched sequence and captured as a group of characters (enclosed in parentheses, later referred as \1 in the next sub call). The + after means one or more occurrences.
0
 
LVL 29

Expert Comment

by:pepr
ID: 41749898
I have noticed a bug in the original page:
<a 1="" href="http://www.theherbsplace.com/" imageanchor=" rel="nofollow" style="...

Open in new window


Notice the 1="" and the imageanchor=" without the enclosing double quote.
0
 

Author Closing Comment

by:sharingsunshine
ID: 41751387
Thanks for the help.  On the other exceptions you pointed out I will just have to fix them as I find them.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
In this article we will discuss some EI Capitan Mail app issues and provide some manual process to resolve them.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question