• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 204
  • Last Modified:

Python Regex Problem

HI have a large Selenium Python 3.5.2 32 bit script running on Windows 7.  However, I only need to focus on the part that uses control a to select the source code of the page copy it to pyperclip perform a regex on it then paste it back changed.

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    

    html_source = str(pyperclip.paste())
    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?"\s?[^rel="nofollow"])')
 
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1 rel="nofollow"', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


Essentially,  I need to do two things:
1.  add rel="no follow" to any href link pointing back to theherbsplace.com
2.  make sure if the href has nofollow already that I don't duplicate another rel="nofollow"

here is what the original looks like
<br />
<a href="http://www.theherbsplace.com/onsale" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


here it is after the regex is applied
<br />
<a href="http://www.theherbsplace.com/onsale" t rel="nofollow"arget="_blank"><img alt="http://www.theherbsplace.com/onsale" s rel="nofollow"rc="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


notice it is clipping the t from target and changing the image src by clipping an s which I don't want affected.

I know regex and it seems the .*? is greedy but I don't know how to make it( .*?)?  because I don't know python regex.

Thanks,
0
sharingsunshine
Asked:
sharingsunshine
  • 9
  • 8
  • 3
1 Solution
 
Dan CraciunIT ConsultantCommented:
[^rel="nofollow"] means:
Match any single character NOT present in the list  'rel="nofw' (case sensitive).

You need a negative lookahead.

Try this:
search:
(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))

Open in new window


replace:
\1rel="nofollow" 

Open in new window


HTH,
Dan

PS1: There is a space after the last " in the replace string
PS2: "I know regex" is a bit strong. I've been working with complex regex for a few years and I can only say I know a bit of regex.
0
 
sharingsunshineAuthor Commented:
You are correct, to say "know" is incorrect.  Thanks for the reminder and gentle rebuke.

Here is the code you gave me.

 time.sleep(5)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    # elem.send_keys(Keys.COMMAND, 'v') #paste

    html_source = str(pyperclip.paste())
     rex = re.compile(r'(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))')
    
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1rel="nofollow" ', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this links is fine because it had no rel="nofollow"
<a href="http://www.theherbsplace.com/onsale" rel="nofollow" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


these links already had rel="nofollow" as you can see from the original code
<a href="http://www.theherbsplace.com/forwomen.html" rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html" rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html" rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html" rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html" rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html" rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html" rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html" rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html" rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf" rel="nofollow">Most Popular Products Brochure</a></span></b><

Open in new window


here it is after I ran the above regex against them
<a href="http://www.theherbsplace.com/forwomen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf"rel="nofollow"  rel="nofollow">Most Popular Products Brochure</a></span></b>

Open in new window


as you can see it is putting in a rel="nofollow" when there is one already.
0
 
Dan CraciunIT ConsultantCommented:
As I said, not that easy :)

Try this for search:
(?:.*(?!rel="nofollow"))(a href="http://www\.theherbsplace\.com/.*?"\s+)

Open in new window


Replace remains the same.

Logic:
- (?:.*(?!rel="nofollow")) will search the current line for any strings rel="nofollow". If it will find any the regex will fail.
- (a href="http://www\.theherbsplace\.com/.*?"\s+) will search for any link on the domain theherbsplace and store it on group 1.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
skullnobrainsCommented:
you can apply 2 successive regular expressions :
- your current one
- one that replaces "rel="nofollow"  rel="nofollow" with "rel="nofollow"

the second ereg will replace nothing when there is no duplication

... or can't you use if/else constructs in selenium code ?
0
 
sharingsunshineAuthor Commented:
I changed to your code Dan and we are still getting the double nofollows.

Is there a way using Python regex to make two passes one to put in rel='nofollow" and the other pass to take out the duplicate rel="nofollow" tags?

Skullnobrains I don't know what you are trying to get at?  Python has if else constructs but not centered around regex.
0
 
skullnobrainsCommented:
replace dups :

html_source=string.replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ' ,html_source);

Open in new window

--

if/else constructs are uselessly complicated in comparison but something like this would work

def repl(matchobj):
  if matchobj.group(0).search(r'rel="nofollow"')return matchobj.group(0);
  else: return matchobj.group(0).replace(r'rel="nofollow"','');

Open in new window

and use unquoted "repl" as the replacement value
the function will be called on each captured link

you may directly add the rel=nofollow in a similar way only when it is not found in the captured string
0
 
Dan CraciunIT ConsultantCommented:
Yup, looks like you will have to do it in 2 steps:

1. search for all the lines that contain a link and do not contain rel="nofollow"
^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").)*$

Open in new window

2. Use a regular replace to add the rel="nofollow" tag.
0
 
sharingsunshineAuthor Commented:
sorry for the delay in answering but my script quit working. Consequently, I haven't been able to test your answers.

Here is the error via ipython
(ff2-32) C:\Users\Randal J. Watkins\ff2>ipython
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (In
tel)]
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %run -d expertsbrazil2.py
*** Blank or comment
*** Blank or comment
*** Blank or comment
NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
> c:\users\randal j. watkins\ff2\expertsbrazil2.py(3)<module>()
      1
      2 #from selenium.webdriver.remote.remote_connection import logging
----> 3 from selenium import webdriver
      4 from selenium.webdriver.common.desired_capabilities import DesiredCapabi
lities
      5 from selenium.webdriver.common.proxy import *

ipdb> c
Traceback (most recent call last):
  File "C:\Users\Randal J. Watkins\ff2\expertsbrazil2.py", line 3, in <module>
    from selenium import webdriver
  File "c:\users\randal~1.wat\envs\ff2-32\lib\site-packages\selenium\webdriver\s
upport\wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
    at FirefoxDriver.prototype.findElementInternal_ (file:///C:/Users/RANDAL~1.W
AT/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/
driver-component.js:10770)
    at FirefoxDriver.prototype.findElement (file:///C:/Users/RANDAL~1.WAT/AppDat
a/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/driver-co
mponent.js:10779)
    at DelayedCommand.prototype.executeInternal_/h (file:///C:/Users/RANDAL~1.WA
T/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/c
ommand-processor.js:12661)
    at DelayedCommand.prototype.executeInternal_ (file:///C:/Users/RANDAL~1.WAT/
AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/com
mand-processor.js:12666)
    at DelayedCommand.prototype.execute/< (file:///C:/Users/RANDAL~1.WAT/AppData
/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/command-pr
ocessor.js:12608)

Open in new window


This is my script
#from selenium.webdriver.remote.remote_connection import logging
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.proxy import *
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.expected_conditions import element_to_be_clickable
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary





import traceback
import random
import os
import time
import re
import logging
import pyperclip
#import tkinter as Tk



#os.environ["SELENIUM_SERVER_JAR"] = "/Users/rjw/Documents/Python/selenium-server-standalone-3.0.0-beta2.jar"


logger = logging.basicConfig(filename='blogger.log')


browser = None
try:
    browser = webdriver.Firefox()
    binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox')

    driver = webdriver.Firefox(firefox_binary=binary)
  #  driver = webdriver.Safari()
#    driver = webdriver.Chrome(service_log_path="~/Documents/Python/log")
  #  driver = webdriver.Chrome("\\Users\\Randal J. Watkins\\chromedriver_win32\\")
    driver.wait = WebDriverWait(driver, 10)

    driver.get('https://www.blogger.com/about/')   # navigate to your blog
    time.sleep(5)

    SIGN_IN = driver.find_element(By.LINK_TEXT, "SIGN IN")
    SIGN_IN.click()

    time.sleep(15)

    inputElement = driver.find_element(By.NAME, "Email")
    inputElement.send_keys("name@gmail.com")
    driver.find_element(By.NAME, "signIn").click()
    time.sleep(12)
    #if driver == webdriver.Chrome():
    inputElement = driver.find_element(By.NAME, "Passwd")
    inputElement = driver.find_element(By.ID, "Passwd")
    inputElement.send_keys("'password")
    driver.find_element(By.ID, "signIn").click()
    time.sleep(5)
#    alert = driver.switch_to.alert

 #   alert.accept()
    silver = driver.find_element(By.LINK_TEXT, "Silver Sol - Silver Shield by Nature\x27s Sunshine - Immune Support and 
Fighter")
    silver.click()
    time.sleep(9)
    posts = driver.find_element(By.LINK_TEXT, "Posts")
    posts.click()

    element = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "OMGM5KC-e-i")))
    element.click()
    time.sleep(9)
    
    button = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "button.blogg-button.blogg-collapse-right")))
   
    button.click()

  
    time.sleep(9)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "htmlBoxWrapper")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingComposeBox")))
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy

    

 

    html_source = str(pyperclip.paste())
   # html_source = pyperclip.paste()

   

    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rex.sub(r'\1 rel="nofollow"', html_source)

    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(5)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source2 = str(pyperclip.paste())
    
    rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
    result2 = rexDoubledNofollow.sub(r'\1', html_source2)
    pyperclip.copy(result2)
                   
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(8)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source3 = str(pyperclip.paste())

    #rexImageNoFollow = re.compile(r'(imageanchor="1" rel="nofollow")')
   # result3 = rexImageNoFollow.sub(r'imageanchor="1"', htmlsource3)
   # pyperclip.copy(result3)

   # element7.send_keys(Keys.CONTROL,'a') #highlight all in box
   # element7.send_keys(Keys.DELETE) #delete old text
   # element7.send_keys(Keys.CONTROL, 'v') #paste

    #time.sleep(9)
    #button = driver.wait.until(EC.visibility_of_element_located((By.XPATH, "//button[contains(.,'Update')]"))
    #button.click()

   # pyperclip.paste() #paste results to page


  

except:
    print(traceback.format_exc())
finally:
    if browser:
        browser.quit()

Open in new window


It never sends control a to select the source code of the page.

Do you see anything which would  stop it from executing the control a?
0
 
skullnobrainsCommented:
this question was answered by dan ( assuming the urls are never split across multiple lines ) in comment ID: 41837966 and by myself in 41839977

the last post from the author shows he got a timeout while authenticating on the site after modifying a different part of his script... which has no link with the initial problem whatsoever

i believe this question has value in the db and should be kept.
0
 
sharingsunshineAuthor Commented:
I am the primary caregiver for my wife that has been on hospice and now is having heart surgery.  If I can have some more time I can test these answers.  I went a different way due to the error I mentioned.  That,s why I didn't respond earlier.
0
 
skullnobrainsCommented:
we all hope for the best. take care of what's important.
0
 
sharingsunshineAuthor Commented:
Thanks skullnobrains for your concern and your well wishes.  However, I want to get this question decided because both of you are invested in it.

I set this up as a test page.  The first link has double nofollow's and the last link has none.

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


this is the code I used

 #     
   #    highlights all of text and then copies it into pyperclip
 
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy old text
    html_source = str(pyperclip.paste())

    rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
    
        #rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rexURL.sub(r'\1 rel="nofollow"', html_source)

        #rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
        #result = rexDoubledNofollow.sub(r'\1', result)

        #rexImage = re.compile(r'(rel="nofollow"\s*)(imageanchor="1"\s*)(rel="nofollow"\s*)')
        #result = rexImage.sub(r'\1\2', result)

    
        
    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this is the result

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"

Open in new window


I suspect my compile is incorrect in the placement of parentheses so if you could  give me the complete statement then I can test it more accurately.
0
 
sharingsunshineAuthor Commented:
skullnobrains if you can show me where and how to insert this code in relation to the code I am using then I can test it also

https://gyazo.com/f8f82e8d41c7290d4e12197dec7276ac
0
 
skullnobrainsCommented:
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

sorry i misused "replace" in my previous post

--

btw the lookahead does not work because the double quotes precede it and the ereg might accidentally grab more than expected

i'd use a simpler one

re.compile(r'("http://www\.theherbsplace\.com/[^"]*")')

Open in new window

or try this lookahead without the replace

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^"]*)")')

Open in new window

0
 
sharingsunshineAuthor Commented:
using this code
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

this is what I get

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


Using the simpler one I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"   target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"  imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


it added the rel="nofollow to the last link but didn't handle the two rel="nofollow" in the first link.  Also, it gives an error message NoneType.

using your lookahead with the replace I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"    target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"   imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


which yields the same result as your simpler one and the same error NoneType
0
 
skullnobrainsCommented:
the first is nok because the ereg is wrong and adds the nofollow stuff at the end of the line rather than after the url

--

the "simple" solution will work with little modifications ( i did not handle attributes other than nofollow after href )
but remember it needs the nofollow stuff to always appear in the same place in the existing source code.
here i assume it to be the last attribute of the link

re.compile(r'("http://www\.theherbsplace\.com/[^>]*)')

Open in new window


if the nofollow stuff appear in various places depending on the links it is probably simpler to use an/the if...else construct

--

likewise the corrected lookahead would be

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^>]*))')

Open in new window

0
 
sharingsunshineAuthor Commented:
thanks for the help.
0
 
skullnobrainsCommented:
hope you did get it straight.
feel free to ask in this thread if needed.

best hopes of recovery to your wife
1
 
sharingsunshineAuthor Commented:
You are so kind to remember her but actually she is now in the arms of Jesus.  She passed away Friday.  She is out of pain and jumping for joy and I will see her again.
0
 
skullnobrainsCommented:
sorry for your loss

see you around the threads
1

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 9
  • 8
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now