Link to home
Create AccountLog in
Avatar of sharingsunshine
sharingsunshineFlag for United States of America

asked on

Using Python To Iterate Through All Website Pages And Paste Changed Content Back

I had this question after viewing What regex will remove duplicate rel="nofolow" tags?.

What has come out of the previous questions works great for an individual page and then manually pasting the updated page back to the site.  The site is blogger and the api will only allow you to change a very limited number of pages per day.   For example, if I was doing this manually open the page, copy the page, update the links, paste the page back and then save the page.  So is there  a  way to do all the pages in the site and pasting the changed content back into the blog as I just described?

By the way, I tried the Adam Lewis find and replace http://www.adamwlewis.com/articles/blogger-find-replace but since it is using the API it only does a limited number of pages per day and then it starts at the top again each time it is ran.  I have 1,000's of pages and many more links to change thus why I am hoping I can get this to iterate through the website and make the appropriate changes.

Thanks,
Avatar of Walter Ritzel
Walter Ritzel
Flag of Brazil image

You can use python in 2 ways here:
1) In the same way as the tool Blogger Find and Replace, but with some kind of control on what page to restart;
2) Create a python script that will use Selenium Webdriver to do what you want.
Avatar of sharingsunshine

ASKER

I appreciate the idea to pursue Selenium Webdriver but being new to Python and not seeing anything specific to my issue in the selenium docs I am still at a stand still.

Can you provide more specifics, some snippets of code or a link to something similar.

Thanks,
With Selenium Webdriver, you control a webbrowser through python code. So, this means it is possible for you write the code to interact with blog on blogger, do all the replace and then put the changed text back, without the limitation of the API and still automated.
here is a small example:
from selenium import webdriver
from selenium.webdriver.common.proxy import *
# from pyvirtualdisplay import Display
import traceback
import random


def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
        if random.randrange(num + 2):
            continue
        line = aline
    return line

browser = None
try:
    #    display = Display(visible=0, size=(800, 600))
    #    display.start()

    proxy = None
    with open('../data/proxies.txt', 'r') as f:
        myProxy = '177.130.59.66:3128' #random_line(f).replace('\n','').replace('http://','').replace('https://','')
        proxy = Proxy({
            'proxyType': ProxyType.MANUAL,
            'httpProxy': myProxy,
            'ftpProxy': myProxy,
            'sslProxy': myProxy,
            'noProxy': ''})

    browser = webdriver.Firefox(proxy=proxy)
    browser.get('https://www.yell.com/connectscan')

     print(browser.page_source)
     elem = browser.find_element_by_name('company.name')  # Find the search box
     elem.send_keys('Reconditioned Ranges Ltd')
     elem = browser.find_element_by_name('company.phoneNumber')  # Find the search box
     elem.send_keys('01209214774')
     elem = browser.find_element_by_name('company.email')  # Find the search box
     elem.send_keys('aaaaaa@gmail.com')
     elem = browser.find_element_by_class_name("js-show-manual-address utils-btnLink")
     elem.click()
     elem = browser.find_element_by_name('company.address.buildingNumber')  # Find the search box
     elem.send_keys('Aga House')
     elem = browser.find_element_by_name('company.address.streetAddress')  # Find the search box
     elem.send_keys('Scorrier Road')
     elem = browser.find_element_by_name('company.address.locality')  # Find the search box
     elem.send_keys('')
     elem = browser.find_element_by_name('company.address.town')  # Find the search box
     elem.send_keys('Redruth')
     elem = browser.find_element_by_name('company.address.county')  # Find the search box
     elem.send_keys('Cornwall')
     elem = browser.find_element_by_name('company.address.postcode')  # Find the search box
     elem.send_keys('TR16 5AA')
    
     elem.submit()

    print('no_errors')
except:
    print(traceback.format_exc())
finally:
    if browser:
        browser.quit()
 display.stop()

Open in new window

Based on the python code I have now

import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document3.html', 'w') as f:
    f.write(html)

rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
result = rexURL.sub(r'\1 rel="nofollow"', html)

rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
result = rexDoubledNofollow.sub(r'\1', result)

with open('new_document3.html', 'w') as f:
    f.write(result)

Open in new window


I just need to open a webpage and paste this as the source code instead of writing it to new_document3.html.

I tried changing the open statement to include an http url but I got an error saying no such file or directory
Traceback (most recent call last):
  File "/Users/rjw/Documents/Python/expertsPepr2.py", line 15, in <module>
    with open('http://www.theherbsplacenews.com/2015/06/save-up-to-18-on-lbs-ii-aloe-vera-and.html', 'a') as f:
IOError: [Errno 2] No such file or directory: 'http://www.theherbsplacenews.com/2015/06/save-up-to-18-on-lbs-ii-aloe-vera-and.html'

Open in new window

I have seen several examples similar to what you gave me and I can tell it took some time to put that together but I am so close as I indicated above I just need to transfer the value back to the same webpage I started with as source code.

So can you tell me how to make that connection?
The only way to use your code is to add the API call to send the page back. Otherwise, anyone could hack any website.
I am trying to run your code and I am getting his error:

Traceback (most recent call last):
  File "/Users/rjw/Documents/Python/expertsBrazilwebdriver.py", line 31, in <module>
    browser = webdriver.Firefox(proxy=proxy)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/firefox/webdriver.py", line 80, in __init__
    self.binary, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/firefox/extension_connection.py", line 52, in __init__
    self.binary.launch_browser(self.profile, timeout=timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 68, in launch_browser
    self._wait_until_connectable(timeout=timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 108, in _wait_until_connectable
    % (self.profile.path))
selenium.common.exceptions.WebDriverException: Message: Can't load the profile. Profile Dir: /var/folders/vg/lzbgw_fx4k90zdjn3zy95qt80000gp/T/tmpf63fqcrk If you specified a log_file in the FirefoxBinary constructor, check it for details.


Traceback (most recent call last):
  File "/Users/rjw/Documents/Python/expertsBrazilwebdriver.py", line 64, in <module>
    display.stop()
NameError: name 'display' is not defined

Open in new window


this is the code
from selenium import webdriver
from selenium.webdriver.common.proxy import *
# from pyvirtualdisplay import Display
import traceback
import random


def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
        if random.randrange(num + 2):
            continue
        line = aline
    return line

browser = None
try:
    #    display = Display(visible=0, size=(800, 600))
    #    display.start()

    proxy = None
    with open('new_document3wd.html', 'r') as f:
        myProxy = '177.130.59.66:3128' #random_line(f).replace('\n','').replace('http://','').replace('https://','')
        proxy = Proxy({
            'proxyType': ProxyType.MANUAL,
            'httpProxy': myProxy,
            'ftpProxy': myProxy,
            'sslProxy': myProxy,
            'noProxy': ''})

    browser = webdriver.Firefox(proxy=proxy)
    browser.get('https://www.yell.com/connectscan')

    print(browser.page_source)
    elem = browser.find_element_by_name('company.name')  # Find the search box
    elem.send_keys('Reconditioned Ranges Ltd')
    elem = browser.find_element_by_name('company.phoneNumber')  # Find the search box
    elem.send_keys('01209214774')
    elem = browser.find_element_by_name('company.email')  # Find the search box
    elem.send_keys('aaaaaa@gmail.com')
    elem = browser.find_element_by_class_name("js-show-manual-address utils-btnLink")
    elem.click()
    elem = browser.find_element_by_name('company.address.buildingNumber')  # Find the search box
    elem.send_keys('Aga House')
    elem = browser.find_element_by_name('company.address.streetAddress')  # Find the search box
    elem.send_keys('Scorrier Road')
    elem = browser.find_element_by_name('company.address.locality')  # Find the search box
    elem.send_keys('')
    elem = browser.find_element_by_name('company.address.town')  # Find the search box
    elem.send_keys('Redruth')
    elem = browser.find_element_by_name('company.address.county')  # Find the search box
    elem.send_keys('Cornwall')
    elem = browser.find_element_by_name('company.address.postcode')  # Find the search box
    elem.send_keys('TR16 5AA')

    elem.submit()

    print('no_errors')
except:
    print(traceback.format_exc())
finally:
    if browser:
        browser.quit()
display.stop()

Open in new window


Please advise.
I dont have what to advise, as you just put the code in the middle of yours, without any thinking on how to use it.
Anyway the selenium error is related to the lack of synch between the firefox version and selenium webdriver version. You may need to install the appropriate webdriver for your firefox version or upgrade your firefox to the version used by the webdriver.

The last display.stop() line needs to be commented, as all the other lines referring to display.
I disagree the code above your comment is an exact copy of what you provided initially.  I have no code in there of my own.

I am sure you know a lot about selenium and webdriver but it seems we aren't communicating.  I have asked how can I marry the two my regex code and the webdriver and you have yet to provide an answer.

I have no issue using the api but there must be someway to bridge the python regex code to the webdriver.

Since we have been going at this since the 11th and we are not any closer to a solution.  I am going to request the moderators get more experts involved.
Well, in fact, if you dont see problems in using the API, then no selenium is needed and no complication for you.
I can show what to change on your code. Please see the comments below:
import re
# import google api --- please check documentation for that.

# -----
# here you'll initialize the google api to use it
# -----

# list_of_blogs = call the method to retrieve the list of blogs
# for blog in list_of_blogs:
#       pages_blog = get the list of pages of that blog
#        for page in pages_blog:
#             website = get the page content
                rexURL = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
                result = rexURL.sub(r'\1 rel="nofollow"', website)
                rexDoubledNofollow = re.compile(r'(rel="nofollow"\s*)+')
               result = rexDoubledNofollow.sub(r'\1', result)
#           save the page back.

Open in new window


Now, if you can use this, try to write some code and show us where is the problem, then we can help you more.
I thought you were  callling selenium and webdriver the api.  My mistake for not clarifying.

However, the google api only allows 50 blog posts a day to be changed.  I have over 1700 posts in one blog alone.  and I have 4 blogs to change with equal or more posts.

This is why I wanted to use webdriver and selenium after you mentioned it.

I will be logged into the blogger dashboard and I have every right to copy and paste content into each post.  I just wanted to automate it (even a portion of it) rather than having to do it manually which will take a long time.
ASKER CERTIFIED SOLUTION
Avatar of Walter Ritzel
Walter Ritzel
Flag of Brazil image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
I don't have the correct version of python to have selenium 3.0 which is the version I need.  So I will post another question to find out how to downgrade my 3.5 to 3.3.  Then I will be back to work on this.
Please let me know your Firefox version.
FireFox version 48.
Ok. So, let's get rid of Firefox in this code and use the more obvious browser for Mac:

Replace Firefox by Safari on this line.

browser = webdriver.Safari()

Open in new window

Thanks for sticking with me on this

Traceback (most recent call last):
  File "/Users/rjw/.pyenv/versions/test_env/lib/python3.3/site-packages/selenium/webdriver/safari/webdriver.py", line 50, in __init__
    executable_path = os.environ["SELENIUM_SERVER_JAR"]
  File "/Users/rjw/.pyenv/versions/3.3.6/lib/python3.3/os.py", line 656, in __getitem__
    raise KeyError(key) from None
KeyError: 'SELENIUM_SERVER_JAR'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "expertsBrazil2webdriver.py", line 9, in <module>
    browser = webdriver.Safari()
  File "/Users/rjw/.pyenv/versions/test_env/lib/python3.3/site-packages/selenium/webdriver/safari/webdriver.py", line 53, in __init__
    'SELENIUM_SERVER_JAR'")
Exception: No executable path given, please add one to Environment Variable                 'SELENIUM_SERVER_JAR'

(test_env) rjw python -V
Python 3.3.6

Open in new window

Ok, so you need to download the jar driver:
http://docs.seleniumhq.org/download/ and find the line where it said:

Download version 3.0.0-beta2

Click on the link and download to your computer.

Last step, add the following line on your script, after  the last line of import.
import os

os.environ["SELENIUM_SERVER_JAR"] = "<path to your download jar with the file name>"

Open in new window

I am getting this error

Error: Unable to access jarfile <'/Users/rjw/Downloads/selenium-server-standalone-3.0.0-beta2.jar'>
Traceback (most recent call last):
  File "expertsBrazil2webdriver.py", line 12, in <module>
    browser = webdriver.Safari()
  File "/Users/rjw/.pyenv/versions/test_env/lib/python3.3/site-packages/selenium/webdriver/safari/webdriver.py", line 55, in __init__
    self.service.start()
  File "/Users/rjw/.pyenv/versions/test_env/lib/python3.3/site-packages/selenium/webdriver/safari/service.py", line 69, in start
    raise WebDriverException("Can not connect to the SafariDriver")
selenium.common.exceptions.WebDriverException: Message: Can not connect to the SafariDriver

Open in new window


I see this but I am not clear which option to pick

https://gyazo.com/0a02f733c36e1b780686c7ae1baa6dd7
Since this is turning out to be so difficult and time consuming for both of us is there a way to ( after I have manually opened the url in blogger clicked on a post)  have python copy the source code run it through the regex routine I have and then paste it back.  Press the update button and iterate to the next post?

I have 177,000 links that need to be changed so any automation would be helpful.
The error now seems to be with permissions on the folder. Can't you copy the jar file to your script folder and adjust the environment variable and try again?
Also, please check the jar is accessible by anyone. I think if OS X is similar to linux, you can type this command on the terminal:
chmod 777 <jar file name>

Open in new window

it opened -  Hurray!

(test_env) rjw python expertsBrazil2webdriver.py
12:25:09.171 INFO - Selenium build info: version: '3.0.0-beta2', revision: '2aa21c1'
12:25:09.173 INFO - Launching a standalone Selenium Server
2016-08-18 12:25:09.335:INFO::main: Logging initialized @5823ms
12:25:09.692 INFO - Driver provider org.openqa.selenium.ie.InternetExplorerDriver registration is skipped:
registration capabilities Capabilities [{ensureCleanSession=true, browserName=internet explorer, version=, platform=WINDOWS}] does not match the current platform MAC
12:25:09.694 INFO - Driver provider org.openqa.selenium.edge.EdgeDriver registration is skipped:
registration capabilities Capabilities [{browserName=MicrosoftEdge, version=, platform=WINDOWS}] does not match the current platform MAC
12:25:09.695 INFO - Driver class not found: com.opera.core.systems.OperaDriver
12:25:09.695 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered
2016-08-18 12:25:11.491:INFO:osjs.Server:main: jetty-9.2.15.v20160210
2016-08-18 12:25:12.028:INFO:osjsh.ContextHandler:main: Started o.s.j.s.ServletContextHandler@3a82f6ef{/,null,AVAILABLE}
2016-08-18 12:25:12.899:INFO:osjs.ServerConnector:main: Started ServerConnector@4cc0edeb{HTTP/1.1}{0.0.0.0:56842}
2016-08-18 12:25:12.900:INFO:osjs.Server:main: Started @9388ms
12:25:12.901 INFO - Selenium Server is up and running
12:25:15.640 INFO - SessionCleaner initialized with insideBrowserTimeout 0 and clientGoneTimeout 1800000 polling every 180000
12:25:16.583 INFO - Executing: [new session: Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]])
12:25:17.221 INFO - Creating a new session for Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]
12:25:18.637 INFO - Server started on port 46981
12:25:18.684 INFO - Launching Safari
12:25:18.758 INFO - Waiting for SafariDriver to connect
12:25:25.816 INFO - Connection opened
12:25:25.878 INFO - Driver connected in 7119 ms
12:25:26.146 INFO - Done: [new session: Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]]
12:25:26.209 INFO - Executing: [get: http://www.theherbsplacenews.com/])
12:25:35.139 INFO - Done: [get: http://www.theherbsplacenews.com/]
12:25:35.209 INFO - Executing: [delete session: 52897159-dfd1-4df5-a75f-0f0fd3029a31])
12:25:35.211 INFO - Shutting down
12:25:35.211 INFO - Closing connection
12:25:35.219 INFO - Stopping Safari
12:25:35.296 INFO - Stopping server
12:25:35.296 INFO - Stopping server
12:25:35.375 INFO - Shutdown complete
12:25:35.375 INFO - Done: [delete session: 52897159-dfd1-4df5-a75f-0f0fd3029a31]

Open in new window


the problem wasn't what you said it was my ignorance.  When I saw your command
chmod 777 <jar file name>

It occurred to me that you were using the <> to offset an entry not put that in the syntax.  I moved it to the script folder but I removed <> and it worked.

now where do I go from here?
That's good!
Ok, let's move on.

Next step will depend on how the page is being displayed on your Safari.
If the page shows that you are already logged, your next step will be to identify the link that goes to the list of posts and click on it. The code below does exactly that:
     elem = browser.find_element_by_class_name("btn_list_posts")
     elem.click()

Open in new window


And you'll write a pair of commands like that for each step of your task.

To know which commands to use, please check the documentation at:
http://www.seleniumhq.org/docs/03_webdriver.jsp#selenium-webdriver-api-commands-and-operations
to get to my dashboard I changed the link to
https://www.blogger.com/blogger.g?blogID=2213276582068581739#allposts

It never opened safari but it showed the safari launcher briefly.  Does something need to be different for https?

(test_env) rjw python expertsBrazil2webdriver.py
13:35:54.471 INFO - Selenium build info: version: '3.0.0-beta2', revision: '2aa21c1'
13:35:54.473 INFO - Launching a standalone Selenium Server
2016-08-18 13:35:54.517:INFO::main: Logging initialized @611ms
13:35:54.644 INFO - Driver provider org.openqa.selenium.ie.InternetExplorerDriver registration is skipped:
registration capabilities Capabilities [{ensureCleanSession=true, browserName=internet explorer, version=, platform=WINDOWS}] does not match the current platform MAC
13:35:54.644 INFO - Driver provider org.openqa.selenium.edge.EdgeDriver registration is skipped:
registration capabilities Capabilities [{browserName=MicrosoftEdge, version=, platform=WINDOWS}] does not match the current platform MAC
13:35:54.645 INFO - Driver class not found: com.opera.core.systems.OperaDriver
13:35:54.645 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered
2016-08-18 13:35:54.756:INFO:osjs.Server:main: jetty-9.2.15.v20160210
2016-08-18 13:35:54.805:INFO:osjsh.ContextHandler:main: Started o.s.j.s.ServletContextHandler@3a82f6ef{/,null,AVAILABLE}
2016-08-18 13:35:54.897:INFO:osjs.ServerConnector:main: Started ServerConnector@35e71ec3{HTTP/1.1}{0.0.0.0:57784}
2016-08-18 13:35:54.898:INFO:osjs.Server:main: Started @992ms
13:35:54.899 INFO - Selenium Server is up and running
13:36:04.049 INFO - SessionCleaner initialized with insideBrowserTimeout 0 and clientGoneTimeout 1800000 polling every 180000
13:36:04.096 INFO - Executing: [new session: Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]])
13:36:04.122 INFO - Creating a new session for Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]
13:36:04.211 INFO - Server started on port 24036
13:36:04.221 INFO - Launching Safari
13:36:04.238 INFO - Waiting for SafariDriver to connect
13:36:07.139 INFO - Connection opened
13:36:07.143 INFO - Driver connected in 2904 ms
13:36:07.260 INFO - Done: [new session: Capabilities [{browserName=safari, javascriptEnabled=true, version=, platform=MAC}]]
13:36:07.278 INFO - Executing: [get: https://www.blogger.com/blogger.g\?blogID=2213276582068581739#allposts])
13:36:07.870 INFO - Done: [get: https://www.blogger.com/blogger.g\?blogID=2213276582068581739#allposts]
13:36:07.901 INFO - Executing: [delete session: a8558bd3-3176-4f5b-994b-8a9b6c805477])
13:36:07.906 INFO - Shutting down
13:36:07.906 INFO - Closing connection
13:36:07.907 INFO - Stopping Safari
13:36:07.974 INFO - Stopping server
13:36:07.975 INFO - Stopping server
13:36:07.985 INFO - Shutdown complete
13:36:07.986 INFO - Done: [delete session: a8558bd3-3176-4f5b-994b-8a9b6c805477]

Open in new window

By the logs, I'm not seeing any errors, so let's try this: first, open the http address, then add a small delay (you can do that time.sleep(5) for 5 seconds. You may need to add import time at the import sections of the script), and then run the get command for the https address.
I logged out and then put in the main blogger page
https://www.blogger.com/about/

and it worked fine, looking at the docs - is that how to do the login?
https://gyazo.com/e733118cd9a108c3ff99b992821dc1ec - but substituting the word login in place of the word cheese?
Yes. You'll need to inspect the HTML code to discover the name of the objects on the page, class that you can use to identify the element, etc... It is boring, but it is an effort that you make only once.
You have been a great help and if I run into any more issues I will post another question.