Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

What regex will remove duplicate rel="nofolow" tags?

Posted on 2016-08-09
3
Medium Priority
?
177 Views
Last Modified: 2016-08-10
I had this question after viewing Python error - Need Help.

I created this regex to remove the duplicate rel="nofollow" tags using grep in TextWrangler but I am not clear how to add this into the Python regex code.

rel="nofollow"(\s|\n|\n\r)rel="nofollow"

Open in new window


replace with
rel="nofollow"

Open in new window

0
Comment
Question by:sharingsunshine
  • 2
3 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 41749890
With respect to your previous question, you can use the following code. However, you should consider a quick hack. It would not work if the original page contained the rel="nofollow" attribute in another location (that is the duplicates not being adjacent). The proper, robust solution would need the use of an HTML parser:
import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document.html', 'w') as f:
    f.write(html)

rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
result = rexURL.sub(r'\1 rel="nofollow"', html)

rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
result = rexDoubledNofollow.sub(r'\1', result)

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

The \s* means zero or more whitespace characters that include also tabs and newlines. It is added to the searched sequence and captured as a group of characters (enclosed in parentheses, later referred as \1 in the next sub call). The + after means one or more occurrences.
0
 
LVL 29

Expert Comment

by:pepr
ID: 41749898
I have noticed a bug in the original page:
<a 1="" href="http://www.theherbsplace.com/" imageanchor=" rel="nofollow" style="...

Open in new window


Notice the 1="" and the imageanchor=" without the enclosing double quote.
0
 

Author Closing Comment

by:sharingsunshine
ID: 41751387
Thanks for the help.  On the other exceptions you pointed out I will just have to fix them as I find them.
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we discuss how to recover the missing Outlook 2011 for Mac data like Emails and Contacts manually.
This article shows how to use a free utility called 'Parkdale' to easily test the performance and benchmark any Hard Drive(s) installed in your computer. We also look at RAM Disks and their speed comparisons.
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

916 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question