Solved

Python Regex Problem

Posted on 2016-10-08
24
139 Views
Last Modified: 2016-11-16
HI have a large Selenium Python 3.5.2 32 bit script running on Windows 7.  However, I only need to focus on the part that uses control a to select the source code of the page copy it to pyperclip perform a regex on it then paste it back changed.

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    

    html_source = str(pyperclip.paste())
    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?"\s?[^rel="nofollow"])')
 
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1 rel="nofollow"', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


Essentially,  I need to do two things:
1.  add rel="no follow" to any href link pointing back to theherbsplace.com
2.  make sure if the href has nofollow already that I don't duplicate another rel="nofollow"

here is what the original looks like
<br />
<a href="http://www.theherbsplace.com/onsale" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


here it is after the regex is applied
<br />
<a href="http://www.theherbsplace.com/onsale" t rel="nofollow"arget="_blank"><img alt="http://www.theherbsplace.com/onsale" s rel="nofollow"rc="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


notice it is clipping the t from target and changing the image src by clipping an s which I don't want affected.

I know regex and it seems the .*? is greedy but I don't know how to make it( .*?)?  because I don't know python regex.

Thanks,
0
Comment
Question by:sharingsunshine
  • 9
  • 8
  • 3
24 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 41835550
[^rel="nofollow"] means:
Match any single character NOT present in the list  'rel="nofw' (case sensitive).

You need a negative lookahead.

Try this:
search:
(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))

Open in new window


replace:
\1rel="nofollow" 

Open in new window


HTH,
Dan

PS1: There is a space after the last " in the replace string
PS2: "I know regex" is a bit strong. I've been working with complex regex for a few years and I can only say I know a bit of regex.
0
 

Author Comment

by:sharingsunshine
ID: 41837598
You are correct, to say "know" is incorrect.  Thanks for the reminder and gentle rebuke.

Here is the code you gave me.

 time.sleep(5)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))

    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    # elem.send_keys(Keys.COMMAND, 'v') #paste

    html_source = str(pyperclip.paste())
     rex = re.compile(r'(a href="http://www\.theherbsplace\.com/.*?"\s?(?!rel="nofollow"))')
    
                   # notice the new placement of the left parenthesis
    result = rex.sub(r'\1rel="nofollow" ', html_source)
                   # double quotes are just chars -- the literal wrapped in single quotes
    pyperclip.copy(result) #copy results to clipboard
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this links is fine because it had no rel="nofollow"
<a href="http://www.theherbsplace.com/onsale" rel="nofollow" target="_blank"><img alt="http://www.theherbsplace.com/onsale" src="http://image.exct.net/lib/ff2c1c757166/i/7/58f9627e-a.jpg" style="border-width: 0px; display: block; height: auto; max-width: 600px; width: 100%;" /></a>

Open in new window


these links already had rel="nofollow" as you can see from the original code
<a href="http://www.theherbsplace.com/forwomen.html" rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html" rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html" rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html" rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html" rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html" rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html" rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html" rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html" rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf" rel="nofollow">Most Popular Products Brochure</a></span></b><

Open in new window


here it is after I ran the above regex against them
<a href="http://www.theherbsplace.com/forwomen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Women</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/formen.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Men</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/children.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Children</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/essential.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Essential Oils</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/cleansing.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Cleansing</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/weightloss.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Weight Loss</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Heartworm_sp_36.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Pets - Heartworms</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Mood_Support_page_1_c_130.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;" target="_blank">Mood Support</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><a href="http://www.theherbsplace.com/Multi_Vitamin_page_1_c_115.html"rel="nofollow"  rel="nofollow" style="text-align: -webkit-auto;">Multi-Vitamins</a><span style="text-align: -webkit-auto;">&nbsp;*&nbsp;</span><span style="text-align: -webkit-auto;"><a href="http://www.theherbsplace.com/pdf/brochure_website_2011.pdf"rel="nofollow"  rel="nofollow">Most Popular Products Brochure</a></span></b>

Open in new window


as you can see it is putting in a rel="nofollow" when there is one already.
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 41837966
As I said, not that easy :)

Try this for search:
(?:.*(?!rel="nofollow"))(a href="http://www\.theherbsplace\.com/.*?"\s+)

Open in new window


Replace remains the same.

Logic:
- (?:.*(?!rel="nofollow")) will search the current line for any strings rel="nofollow". If it will find any the regex will fail.
- (a href="http://www\.theherbsplace\.com/.*?"\s+) will search for any link on the domain theherbsplace and store it on group 1.
0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 
LVL 27

Expert Comment

by:skullnobrains
ID: 41838112
you can apply 2 successive regular expressions :
- your current one
- one that replaces "rel="nofollow"  rel="nofollow" with "rel="nofollow"

the second ereg will replace nothing when there is no duplication

... or can't you use if/else constructs in selenium code ?
0
 

Author Comment

by:sharingsunshine
ID: 41839295
I changed to your code Dan and we are still getting the double nofollows.

Is there a way using Python regex to make two passes one to put in rel='nofollow" and the other pass to take out the duplicate rel="nofollow" tags?

Skullnobrains I don't know what you are trying to get at?  Python has if else constructs but not centered around regex.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41839977
replace dups :

html_source=string.replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ' ,html_source);

Open in new window

--

if/else constructs are uselessly complicated in comparison but something like this would work

def repl(matchobj):
  if matchobj.group(0).search(r'rel="nofollow"')return matchobj.group(0);
  else: return matchobj.group(0).replace(r'rel="nofollow"','');

Open in new window

and use unquoted "repl" as the replacement value
the function will be called on each captured link

you may directly add the rel=nofollow in a similar way only when it is not found in the captured string
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 41840396
Yup, looks like you will have to do it in 2 steps:

1. search for all the lines that contain a link and do not contain rel="nofollow"
^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").)*$

Open in new window

2. Use a regular replace to add the rel="nofollow" tag.
0
 

Author Comment

by:sharingsunshine
ID: 41845329
sorry for the delay in answering but my script quit working. Consequently, I haven't been able to test your answers.

Here is the error via ipython
(ff2-32) C:\Users\Randal J. Watkins\ff2>ipython
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (In
tel)]
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %run -d expertsbrazil2.py
*** Blank or comment
*** Blank or comment
*** Blank or comment
NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
> c:\users\randal j. watkins\ff2\expertsbrazil2.py(3)<module>()
      1
      2 #from selenium.webdriver.remote.remote_connection import logging
----> 3 from selenium import webdriver
      4 from selenium.webdriver.common.desired_capabilities import DesiredCapabi
lities
      5 from selenium.webdriver.common.proxy import *

ipdb> c
Traceback (most recent call last):
  File "C:\Users\Randal J. Watkins\ff2\expertsbrazil2.py", line 3, in <module>
    from selenium import webdriver
  File "c:\users\randal~1.wat\envs\ff2-32\lib\site-packages\selenium\webdriver\s
upport\wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
    at FirefoxDriver.prototype.findElementInternal_ (file:///C:/Users/RANDAL~1.W
AT/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/
driver-component.js:10770)
    at FirefoxDriver.prototype.findElement (file:///C:/Users/RANDAL~1.WAT/AppDat
a/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/driver-co
mponent.js:10779)
    at DelayedCommand.prototype.executeInternal_/h (file:///C:/Users/RANDAL~1.WA
T/AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/c
ommand-processor.js:12661)
    at DelayedCommand.prototype.executeInternal_ (file:///C:/Users/RANDAL~1.WAT/
AppData/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/com
mand-processor.js:12666)
    at DelayedCommand.prototype.execute/< (file:///C:/Users/RANDAL~1.WAT/AppData
/Local/Temp/tmpv54nv171/extensions/fxdriver@googlecode.com/components/command-pr
ocessor.js:12608)

Open in new window


This is my script
#from selenium.webdriver.remote.remote_connection import logging
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.proxy import *
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.expected_conditions import element_to_be_clickable
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary





import traceback
import random
import os
import time
import re
import logging
import pyperclip
#import tkinter as Tk



#os.environ["SELENIUM_SERVER_JAR"] = "/Users/rjw/Documents/Python/selenium-server-standalone-3.0.0-beta2.jar"


logger = logging.basicConfig(filename='blogger.log')


browser = None
try:
    browser = webdriver.Firefox()
    binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox')

    driver = webdriver.Firefox(firefox_binary=binary)
  #  driver = webdriver.Safari()
#    driver = webdriver.Chrome(service_log_path="~/Documents/Python/log")
  #  driver = webdriver.Chrome("\\Users\\Randal J. Watkins\\chromedriver_win32\\")
    driver.wait = WebDriverWait(driver, 10)

    driver.get('https://www.blogger.com/about/')   # navigate to your blog
    time.sleep(5)

    SIGN_IN = driver.find_element(By.LINK_TEXT, "SIGN IN")
    SIGN_IN.click()

    time.sleep(15)

    inputElement = driver.find_element(By.NAME, "Email")
    inputElement.send_keys("name@gmail.com")
    driver.find_element(By.NAME, "signIn").click()
    time.sleep(12)
    #if driver == webdriver.Chrome():
    inputElement = driver.find_element(By.NAME, "Passwd")
    inputElement = driver.find_element(By.ID, "Passwd")
    inputElement.send_keys("'password")
    driver.find_element(By.ID, "signIn").click()
    time.sleep(5)
#    alert = driver.switch_to.alert

 #   alert.accept()
    silver = driver.find_element(By.LINK_TEXT, "Silver Sol - Silver Shield by Nature\x27s Sunshine - Immune Support and 
Fighter")
    silver.click()
    time.sleep(9)
    posts = driver.find_element(By.LINK_TEXT, "Posts")
    posts.click()

    element = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "OMGM5KC-e-i")))
    element.click()
    time.sleep(9)
    
    button = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "button.blogg-button.blogg-collapse-right")))
   
    button.click()

  
    time.sleep(9)
    element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingHtmlBox")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "htmlBoxWrapper")))
    #element7 = driver.wait.until(EC.visibility_of_element_located((By.ID, "postingComposeBox")))
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy

    

 

    html_source = str(pyperclip.paste())
   # html_source = pyperclip.paste()

   

    rex = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rex.sub(r'\1 rel="nofollow"', html_source)

    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(5)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source2 = str(pyperclip.paste())
    
    rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
    result2 = rexDoubledNofollow.sub(r'\1', html_source2)
    pyperclip.copy(result2)
                   
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

    time.sleep(8)
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy
    html_source3 = str(pyperclip.paste())

    #rexImageNoFollow = re.compile(r'(imageanchor="1" rel="nofollow")')
   # result3 = rexImageNoFollow.sub(r'imageanchor="1"', htmlsource3)
   # pyperclip.copy(result3)

   # element7.send_keys(Keys.CONTROL,'a') #highlight all in box
   # element7.send_keys(Keys.DELETE) #delete old text
   # element7.send_keys(Keys.CONTROL, 'v') #paste

    #time.sleep(9)
    #button = driver.wait.until(EC.visibility_of_element_located((By.XPATH, "//button[contains(.,'Update')]"))
    #button.click()

   # pyperclip.paste() #paste results to page


  

except:
    print(traceback.format_exc())
finally:
    if browser:
        browser.quit()

Open in new window


It never sends control a to select the source code of the page.

Do you see anything which would  stop it from executing the control a?
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41876111
this question was answered by dan ( assuming the urls are never split across multiple lines ) in comment ID: 41837966 and by myself in 41839977

the last post from the author shows he got a timeout while authenticating on the site after modifying a different part of his script... which has no link with the initial problem whatsoever

i believe this question has value in the db and should be kept.
0
 

Author Comment

by:sharingsunshine
ID: 41879066
I am the primary caregiver for my wife that has been on hospice and now is having heart surgery.  If I can have some more time I can test these answers.  I went a different way due to the error I mentioned.  That,s why I didn't respond earlier.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41880723
we all hope for the best. take care of what's important.
0
 

Author Comment

by:sharingsunshine
ID: 41882961
Thanks skullnobrains for your concern and your well wishes.  However, I want to get this question decided because both of you are invested in it.

I set this up as a test page.  The first link has double nofollow's and the last link has none.

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


this is the code I used

 #     
   #    highlights all of text and then copies it into pyperclip
 
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.CONTROL,'c') #copy old text
    html_source = str(pyperclip.paste())

    rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
    
        #rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
    result = rexURL.sub(r'\1 rel="nofollow"', html_source)

        #rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
        #result = rexDoubledNofollow.sub(r'\1', result)

        #rexImage = re.compile(r'(rel="nofollow"\s*)(imageanchor="1"\s*)(rel="nofollow"\s*)')
        #result = rexImage.sub(r'\1\2', result)

    
        
    pyperclip.copy(result)
    
    element7.send_keys(Keys.CONTROL,'a') #highlight all in box
    element7.send_keys(Keys.DELETE) #delete old text
    element7.send_keys(Keys.CONTROL, 'v') #paste

Open in new window


this is the result

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"

Open in new window


I suspect my compile is incorrect in the placement of parentheses so if you could  give me the complete statement then I can test it more accurately.
0
 

Author Comment

by:sharingsunshine
ID: 41882970
skullnobrains if you can show me where and how to insert this code in relation to the code I am using then I can test it also

https://gyazo.com/f8f82e8d41c7290d4e12197dec7276ac
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41883444
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

sorry i misused "replace" in my previous post

--

btw the lookahead does not work because the double quotes precede it and the ereg might accidentally grab more than expected

i'd use a simpler one

re.compile(r'("http://www\.theherbsplace\.com/[^"]*")')

Open in new window

or try this lookahead without the replace

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^"]*)")')

Open in new window

0
 

Author Comment

by:sharingsunshine
ID: 41883967
using this code
rexURL = re.compile(r'(^.*(a href="http://www\.theherbsplace\.com/.*?"\s+)(?:(?!rel="nofollow").*))')
result = rexURL.sub(r'\1 rel="nofollow"', html_source).replace(r'rel="nofollow" rel="nofollow"', 'rel="nofollow" ')

Open in new window

this is what I get

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow" target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
 rel="nofollow"
 rel="nofollow"
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


Using the simpler one I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"   target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"  imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


it added the rel="nofollow to the last link but didn't handle the two rel="nofollow" in the first link.  Also, it gives an error message NoneType.

using your lookahead with the replace I get this

<a href="http://www.theherbsplace.com/Silver_Shield_Aqua_Sol_Technology_Colloidal_Silver_18_ppm_p_730.html" rel="nofollow"    target="_blank" rel="nofollow"><b>Buy Silver Shield at wholesale prices every day</b></a>!
<br />
<br />
<div style="font-weight: bold; margin-bottom: 12px; text-align: left;">
<a href="http://www.theherbsplace.com/" rel="nofollow"   imageanchor="1" rel="nofollow" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-XsXBsKXDZ34/UNscoQTM32I/AAAAAAAAHIY/nPqDNGVlU2M/s1600/test_half_THP+Gingerbread+Logo.jpg" /><b></b></a>Sponsored by&nbsp;<a href="http://www.theherbsplace.com/" rel="nofollow"><b>The Herbs Place</b></a> - Wholesale Prices Always</div>

Open in new window


which yields the same result as your simpler one and the same error NoneType
0
 
LVL 27

Accepted Solution

by:
skullnobrains earned 500 total points
ID: 41884494
the first is nok because the ereg is wrong and adds the nofollow stuff at the end of the line rather than after the url

--

the "simple" solution will work with little modifications ( i did not handle attributes other than nofollow after href )
but remember it needs the nofollow stuff to always appear in the same place in the existing source code.
here i assume it to be the last attribute of the link

re.compile(r'("http://www\.theherbsplace\.com/[^>]*)')

Open in new window


if the nofollow stuff appear in various places depending on the links it is probably simpler to use an/the if...else construct

--

likewise the corrected lookahead would be

re.compile(r'("http://www\.theherbsplace\.com/(?:(?!rel="nofollow")[^>]*))')

Open in new window

0
 

Author Closing Comment

by:sharingsunshine
ID: 41889015
thanks for the help.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41889883
hope you did get it straight.
feel free to ask in this thread if needed.

best hopes of recovery to your wife
1
 

Author Comment

by:sharingsunshine
ID: 41889919
You are so kind to remember her but actually she is now in the arms of Jesus.  She passed away Friday.  She is out of pain and jumping for joy and I will see her again.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 41889923
sorry for your loss

see you around the threads
1

Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
If you get continual lockouts after changing your Active Directory password, there are several possible reasons.  Two of the most common are using other devices to access your email and stored passwords in the credential manager of windows.
This Micro Tutorial will give you basic overview of the control panel section on Windows 7. It will depth in Network and Internet, Hardware and Sound, etc. This will be demonstrated using Windows 7 operating system.
This Micro Tutorial will give you a introduction in two parts how to utilize Windows Live Movie Maker to its maximum capability. This will be demonstrated using Windows Live Movie Maker on Windows 7 operating system.

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question