Solved

What regex will remove duplicate rel="nofolow" tags?

Posted on 2016-08-09
3
132 Views
Last Modified: 2016-08-10
I had this question after viewing Python error - Need Help.

I created this regex to remove the duplicate rel="nofollow" tags using grep in TextWrangler but I am not clear how to add this into the Python regex code.

rel="nofollow"(\s|\n|\n\r)rel="nofollow"

Open in new window


replace with
rel="nofollow"

Open in new window

0
Comment
Question by:sharingsunshine
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 41749890
With respect to your previous question, you can use the following code. However, you should consider a quick hack. It would not work if the original page contained the rel="nofollow" attribute in another location (that is the duplicates not being adjacent). The proper, robust solution would need the use of an HTML parser:
import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document.html', 'w') as f:
    f.write(html)

rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
result = rexURL.sub(r'\1 rel="nofollow"', html)

rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
result = rexDoubledNofollow.sub(r'\1', result)

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

The \s* means zero or more whitespace characters that include also tabs and newlines. It is added to the searched sequence and captured as a group of characters (enclosed in parentheses, later referred as \1 in the next sub call). The + after means one or more occurrences.
0
 
LVL 29

Expert Comment

by:pepr
ID: 41749898
I have noticed a bug in the original page:
<a 1="" href="http://www.theherbsplace.com/" imageanchor=" rel="nofollow" style="...

Open in new window


Notice the 1="" and the imageanchor=" without the enclosing double quote.
0
 

Author Closing Comment

by:sharingsunshine
ID: 41751387
Thanks for the help.  On the other exceptions you pointed out I will just have to fix them as I find them.
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question