Solved

Python Regex Problem

Posted on 2016-10-08
24
79 Views
Last Modified: 2016-11-16
HI have a large Selenium Python 3.5.2 32 bit script running on Windows 7.  However, I only need to focus on the part that uses control a to select the source code of the page copy it to pyperclip perform a regex on it then paste it back changed.

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    

    html_source = str(pyperclip.paste())
    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?"\s?[^rel="nofollow"])')
 
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1 rel="nofollow"', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


Essentially,  I need to do two things:
1.  add rel="no follow" to any href link pointing back to theherbsplace.com
2.  make sure if the href has nofollow already that I don't duplicate another rel="nofollow"

here is what the original looks like
<br />
<a href="http://www.theherbsplace.com/onsale" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


here it is after the regex is applied
<br />
<a href="http://www.theherbsplace.com/onsale" t rel="nofollow"arget="_blank"><img alt="http://www.theherbsplace.com/onsale" s rel="nofollow"rc="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


notice it is clipping the t from target and changing the image src by clipping an s which I don't want affected.

I know regex and it seems the .*? is greedy but I don't know how to make it( .*?)?  because I don't know python regex.

Thanks,
0
Comment
Question by:sharingsunshine
  • 9
  • 8
  • 3
24 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
[^rel="nofollow"] means:
Match any single character NOT present in the list  'rel="nofw' (case sensitive).

You need a negative lookahead.

Try this:
search:
(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))

Open in new window


replace:
\1rel="nofollow" 

Open in new window


HTH,
Dan

PS1: There is a space after the last " in the replace string
PS2: "I know regex" is a bit strong. I've been working with complex regex for a few years and I can only say I know a bit of regex.
0
 

Author Comment

by:sharingsunshine
Comment Utility
You are correct, to say "know" is incorrect.  Thanks for the reminder and gentle rebuke.

Here is the code you gave me.

 time.sleep(5)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    # elem.send_keys(Keys.COMMAND, 'v') #paste

    html_source = str(pyperclip.paste())
     rex = re.compile(r'(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))')
    
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1rel="nofollow" ', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this links is fine because it had no rel="nofollow"
<a href="http://www.theherbsplace.com/onsale" rel="nofollow" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


these links already had rel="nofollow" as you can see from the original code
<a href="http://www.theherbsplace.com/forwomen.html" rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html" rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html" rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html" rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html" rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html" rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html" rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html" rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html" rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf" rel="nofollow">Most Popular Products Brochure</a></span></b><

Open in new window


here it is after I ran the above regex against them
<a href="http://www.theherbsplace.com/forwomen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf"rel="nofollow"  rel="nofollow">Most Popular Products Brochure</a></span></b>

Open in new window


as you can see it is putting in a rel="nofollow" when there is one already.
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
As I said, not that easy :)

Try this for search:
(?:.*(?!rel="nofollow"))(a href="http://www\.theherbsplace\.com/.*?"\s+)

Open in new window


Replace remains the same.

Logic:
- (?:.*(?!rel="nofollow")) will search the current line for any strings rel="nofollow". If it will find any the regex will fail.
- (a href="http://www\.theherbsplace\.com/.*?"\s+) will search for any link on the domain theherbsplace and store it on group 1.
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
you can apply 2 successive regular expressions :
- your current one
- one that replaces "rel="nofollow"  rel="nofollow" with "rel="nofollow"

the second ereg will replace nothing when there is no duplication

... or can't you use if/else constructs in selenium code ?
0
 

Author Comment

by:sharingsunshine
Comment Utility
I changed to your code Dan and we are still getting the double nofollows.

Is there a way using Python regex to make two passes one to put in rel='nofollow" and the other pass to take out the duplicate rel="nofollow" tags?

Skullnobrains I don't know what you are trying to get at?  Python has if else constructs but not centered around regex.
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
replace dups :

html_source=string.replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ' ,html_source);

Open in new window

--

if/else constructs are uselessly complicated in comparison but something like this would work

def repl(matchobj):
  if matchobj.group(0).search(r'rel="nofollow"')return matchobj.group(0);
  else: return matchobj.group(0).replace(r'rel="nofollow"','');

Open in new window

and use unquoted "repl" as the replacement value
the function will be called on each captured link

you may directly add the rel=nofollow in a similar way only when it is not found in the captured string
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
Yup, looks like you will have to do it in 2 steps:

1. search for all the lines that contain a link and do not contain rel="nofollow"
^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").)*$

Open in new window

2. Use a regular replace to add the rel="nofollow" tag.
0
 

Author Comment

by:sharingsunshine
Comment Utility
sorry for the delay in answering but my script quit working. Consequently, I haven't been able to test your answers.

Here is the error via ipython
(ff2-32) C:\Users\Randal J. Watkins\ff2>ipython
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (In
tel)]
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %run -d expertsbrazil2.py
*** Blank or comment
*** Blank or comment
*** Blank or comment
NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
> c:\users\randal j. watkins\ff2\expertsbrazil2.py(3)<module>()
      1
      2 #from selenium.webdriver.remote.remote_connection import logging
----> 3 from selenium import webdriver
      4 from selenium.webdriver.common.desired_capabilities import DesiredCapabi
lities
      5 from selenium.webdriver.common.proxy import *

ipdb> c
Traceback (most recent call last):
  File "C:\Users\Randal J. Watkins\ff2\expertsbrazil2.py", line 3, in <module>
    from selenium import webdriver
  File "c:\users\randal~1.wat\envs\ff2-32\lib\site-packages\selenium\webdriver\s
upport\wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
    at FirefoxDriver.prototype.findElementInternal_ (file:///C:/Users/RANDAL~1.W
AT/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/
driver-component.js:10770)
    at FirefoxDriver.prototype.findElement (file:///C:/Users/RANDAL~1.WAT/AppDat
a/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/driver-co
mponent.js:10779)
    at DelayedCommand.prototype.executeInternal_/h (file:///C:/Users/RANDAL~1.WA
T/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/c
ommand-processor.js:12661)
    at DelayedCommand.prototype.executeInternal_ (file:///C:/Users/RANDAL~1.WAT/
AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/com
mand-processor.js:12666)
    at DelayedCommand.prototype.execute/< (file:///C:/Users/RANDAL~1.WAT/AppData
/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/command-pr
ocessor.js:12608)

Open in new window


This is my script
#from selenium.webdriver.remote.remote_connection import logging
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.proxy import *
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.expected_conditions import element_to_be_clickable
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary





import traceback
import random
import os
import time
import re
import logging
import pyperclip
#import tkinter as Tk



#os.environ["SELENIUM_SERVER_JAR"] = "/Users/rjw/Documents/Python/selenium-server-standalone-3.0.0-beta2.jar"


logger = logging.basicConfig(filename='blogger.log')


browser = None
try:
    browser = webdriver.Firefox()
    binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox')

    driver = webdriver.Firefox(firefox_binary=binary)
  #  driver = webdriver.Safari()
#    driver = webdriver.Chrome(service_log_path="~/Documents/Python/log")
  #  driver = webdriver.Chrome("\\Users\\Randal J. Watkins\\chromedriver_win32\\")
    driver.wait = WebDriverWait(driver, 10)

    driver.get('https://www.blogger.com/about/')   # navigate to your blog
    time.sleep(5)

    SIGN_IN = driver.find_element(By.LINK_TEXT, "SIGN IN")
    SIGN_IN.click()

    time.sleep(15)

    inputElement = driver.find_element(By.NAME, "Email")
    inputElement.send_keys("name@gmail.com")
    driver.find_element(By.NAME, "signIn").click()
    time.sleep(12)
    #if driver == webdriver.Chrome():
    inputElement = driver.find_element(By.NAME, "Passwd")
    inputElement = driver.find_element(By.ID, "Passwd")
    inputElement.send_keys("'password")
    driver.find_element(By.ID, "signIn").click()
    time.sleep(5)
#    alert = driver.switch_to.alert

 #   alert.accept()
    silver = driver.find_element(By.LINK_TEXT, "Silver Sol - Silver Shield by Nature\x27s Sunshine - Immune Support and 
Fighter")
    silver.click()
    time.sleep(9)
    posts = driver.find_element(By.LINK_TEXT, "Posts")
    posts.click()

    element = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "OMGM5KC-e-i")))
    element.click()
    time.sleep(9)
    
    button = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "button.blogg-button.blogg-collapse-right")))
   
    button.click()

  
    time.sleep(9)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "htmlBoxWrapper")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingComposeBox")))
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy

    

 

    html_source = str(pyperclip.paste())
   # html_source = pyperclip.paste()

   

    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rex.sub(r'\1 rel="nofollow"', html_source)

    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(5)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source2 = str(pyperclip.paste())
    
    rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
    result2 = rexDoubledNofollow.sub(r'\1', html_source2)
    pyperclip.copy(result2)
                   
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(8)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source3 = str(pyperclip.paste())

    #rexImageNoFollow = re.compile(r'(imageanchor="1" rel="nofollow")')
   # result3 = rexImageNoFollow.sub(r'imageanchor="1"', htmlsource3)
   # pyperclip.copy(result3)

   # element7.send_keys(Keys.CONTROL,'a') #highlight all in box
   # element7.send_keys(Keys.DELETE) #delete old text
   # element7.send_keys(Keys.CONTROL, 'v') #paste

    #time.sleep(9)
    #button = driver.wait.until(EC.visibility_of_element_located((By.XPATH, "//button[contains(.,'Update')]"))
    #button.click()

   # pyperclip.paste() #paste results to page


  

except:
    print(traceback.format_exc())
finally:
    if browser:
        browser.quit()

Open in new window


It never sends control a to select the source code of the page.

Do you see anything which would  stop it from executing the control a?
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
this question was answered by dan ( assuming the urls are never split across multiple lines ) in comment ID: 41837966 and by myself in 41839977

the last post from the author shows he got a timeout while authenticating on the site after modifying a different part of his script... which has no link with the initial problem whatsoever

i believe this question has value in the db and should be kept.
0
 

Author Comment

by:sharingsunshine
Comment Utility
I am the primary caregiver for my wife that has been on hospice and now is having heart surgery.  If I can have some more time I can test these answers.  I went a different way due to the error I mentioned.  That,s why I didn't respond earlier.
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
we all hope for the best. take care of what's important.
0
 

Author Comment

by:sharingsunshine
Comment Utility
Thanks skullnobrains for your concern and your well wishes.  However, I want to get this question decided because both of you are invested in it.

I set this up as a test page.  The first link has double nofollow's and the last link has none.

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


this is the code I used

 #     
   #    highlights all of text and then copies it into pyperclip
 
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy old text
    html_source = str(pyperclip.paste())

    rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
    
        #rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rexURL.sub(r'\1 rel="nofollow"', html_source)

        #rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
        #result = rexDoubledNofollow.sub(r'\1', result)

        #rexImage = re.compile(r'(rel="nofollow"\s*)(imageanchor="1"\s*)(rel="nofollow"\s*)')
        #result = rexImage.sub(r'\1\2', result)

    
        
    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this is the result

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"

Open in new window


I suspect my compile is incorrect in the placement of parentheses so if you could  give me the complete statement then I can test it more accurately.
0
 

Author Comment

by:sharingsunshine
Comment Utility
skullnobrains if you can show me where and how to insert this code in relation to the code I am using then I can test it also

https://gyazo.com/f8f82e8d41c7290d4e12197dec7276ac
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

sorry i misused "replace" in my previous post

--

btw the lookahead does not work because the double quotes precede it and the ereg might accidentally grab more than expected

i'd use a simpler one

re.compile(r'("http://www\.theherbsplace\.com/[^"]*")')

Open in new window

or try this lookahead without the replace

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^"]*)")')

Open in new window

0
 

Author Comment

by:sharingsunshine
Comment Utility
using this code
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

this is what I get

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


Using the simpler one I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"   target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"  imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


it added the rel="nofollow to the last link but didn't handle the two rel="nofollow" in the first link.  Also, it gives an error message NoneType.

using your lookahead with the replace I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"    target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"   imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


which yields the same result as your simpler one and the same error NoneType
0
 
LVL 26

Accepted Solution

by:
skullnobrains earned 500 total points
Comment Utility
the first is nok because the ereg is wrong and adds the nofollow stuff at the end of the line rather than after the url

--

the "simple" solution will work with little modifications ( i did not handle attributes other than nofollow after href )
but remember it needs the nofollow stuff to always appear in the same place in the existing source code.
here i assume it to be the last attribute of the link

re.compile(r'("http://www\.theherbsplace\.com/[^>]*)')

Open in new window


if the nofollow stuff appear in various places depending on the links it is probably simpler to use an/the if...else construct

--

likewise the corrected lookahead would be

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^>]*))')

Open in new window

0
 

Author Closing Comment

by:sharingsunshine
Comment Utility
thanks for the help.
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
hope you did get it straight.
feel free to ask in this thread if needed.

best hopes of recovery to your wife
1
 

Author Comment

by:sharingsunshine
Comment Utility
You are so kind to remember her but actually she is now in the arms of Jesus.  She passed away Friday.  She is out of pain and jumping for joy and I will see her again.
0
 
LVL 26

Expert Comment

by:skullnobrains
Comment Utility
sorry for your loss

see you around the threads
1

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
This Micro Tutorial will give you a basic overview of Windows DVD Burner through its features and interface. This will be demonstrated using Windows 7 operating system.
The viewer will learn how to successfully create a multiboot device using the SARDU utility on Windows 7. Start the SARDU utility: Change the image directory to wherever you store your ISOs, this will prevent you from having 2 copies of an ISO wit…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now