Solved

Python error - Need Help

Posted on 2016-08-06
12
81 Views
Last Modified: 2016-08-09
I am new to python and I am trying to add rel="nofollow" to all of the links in http://www.theherbsplacenews.com that match the regex pattern below.  When I do that I am getting this error.

============== RESTART: /Users/rjw/Documents/Python/getLinks.py ==============
<_sre.SRE_Match object at 0x1021a0f30>

Open in new window


this is my code
# http://www.theherbsplacenews.com/

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')

#read html code
html = website.read()

#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)

prog = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
result = prog.match('"http://www.theherbsplace.com/\1" rel="nofollow"')



print result

Open in new window

0
Comment
Question by:sharingsunshine
  • 5
  • 4
  • 3
12 Comments
 
LVL 35

Expert Comment

by:Terry Woods
Comment Utility
I'm not a Python coder, but I have experience with regex... try replacing the code:
prog = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
result = prog.match('"http://www.theherbsplace.com/\1" rel="nofollow"')

Open in new window

with the code:
result = re.sub(r'"http://www\.theherbsplace\.com/(.*?)?"', r'"http://www.theherbsplace.com/\1" rel="nofollow"', html)

Open in new window

1
 
LVL 35

Expert Comment

by:Terry Woods
Comment Utility
What you were getting was not an error, but a regex object. Looking at result.group(0) would have given you a string possibly.

You shouldn't need to use compile as far as I can tell, though it might be more efficient if you use the same pattern multiple times.

Minor point: The . characters in the pattern should be escaped for the regex engine so they aren't treated as wildcards. Adding the r before the text should make sure the backslash gets to the regex engine. Same for the \1 in the replacement I understand.
1
 
LVL 28

Expert Comment

by:pepr
Comment Utility
To add, you may try:
#!python2

# http://www.theherbsplacenews.com/

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')

#read html code
html = website.read()

rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')

#use re.findall to get all the links
links = rex.findall(html)

with open('links.txt', 'w') as f:
    for link in links:
        f.write(link + '\n')

Open in new window

As Terry wrote, you shoul (probably always) use the r'raw string literals' for regular expression patterns. Otherwise, you would have to double all backslashes that should not be interpreted as a being the starter of a string escape sequence. Regular expressions also use backslash for special purpose , and the two usages together make it confusing. This is the reasons why languages with syntactically built-in regular expressions use a form of writing the patterns different from normal string literals (say enclosed in slashes).

I did uncomment your .findall and wrote the links to the file. The .findall is a high-level method. It returns list of extracted substrings. In this sense it differs from re.match on how it should be used.

For the "to compile or not to compile", I am for compilation almost always. The reason is not only the precompiled object usage is faster in loops. The not precompiled regular expression have to be compiled at least once anyway. The syntax that use a precompiled object is simpler. When the object is given a good name, it is more readable. When the name is not important, or used only locally, I suggest to use rex always -- as your convention (Regular EXpression compiled object).

Similarly, I suggest to use m for the match object as the pattern is used often.
#!python2

import re
rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')

m = rex.match('"http://www.theherbsplace.com/1" rel="nofollow"')
if m:
    print m.group(0)
    print m.group(1)

Open in new window

I am not sure what you mean by \1 in your example, but you probably know now what you want to do ;) Otherwise, feel free to ask.
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 50 total points
Comment Utility
@pepr, the \1 says to insert the first captured group from the pattern into the replacement. Capturing groups are values contained in unescaped round brackets, which in this case is the .*? part of the pattern. The backslash that you've removed from the replacement needs to go back in (and it needs to be a double backslash if the string isn't a raw one)
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
@Terry: Oh, my fault! I did not read the question correctly. ;)
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
There is a minor flaw in the solution. If the original already contained the rel="nofollow" attribute, it would be duplicated.

The more robust solution should probably parse the HTML, modify the element and dump the modified data structure to the file. Anyway, the simpler solution may win in the special case.
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 

Author Comment

by:sharingsunshine
Comment Utility
Hi Pepr & Terry,

I tried the following code from you Pepr.  Admittedly, I may have the code sequence wrong because I was working off of both of your examples and trying to combine them into one.

#!python2

# http://www.theherbsplacenews.com/

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')

#read html code
html = website.read()

rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')

#use re.findall to get all the links
links = rex.findall(html)

with open('links.txt', 'w') as f:
    for link in links:
        f.write(link + '\n')
#rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')

m = rex.match('"http://www.theherbsplace.com///\1" rel="nofollow"')
if m:
    print m.group(0)
    print m.group(1)

Open in new window


and I get this output

https://gyazo.com/cb6e0e9b3d29b8af05ae63344e8e8815 which I am not clear what the A in the box is saying.
0
 

Author Comment

by:sharingsunshine
Comment Utility
Terry I tried your code and this is what I used

# http://www.theherbsplacenews.com/

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')

#read html code
html = website.read()

#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)

#prog = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
#result = prog.match('"http://www.theherbsplace.com/\1" rel="nofollow"')

result = re.sub(r'"http://www\.theherbsplace\.com/(.*?)?"', r'"http://www.theherbsplace.com/\1" rel="nofollow"', html)

print result

Open in new window


and the result I got is too big to include
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Firstly, the \1 does not make (practical) sense when used in rex.match. The reason for observing the picture of "A" in a box is that your console displays the character with the ordinal number 1. The escape sequence \1 is converted to that character, because you did not suppress the escape sequence interpretation by using a r'raw string' (notice the r at in front of the first single quote).

For your later code, replace the last print result by
with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

and then look inside the generated file. Similarly, you can save the website content to the original_document.html, and then you can use the tool of your choice for comparing (diff) the two files (see http://alternativeto.net/software/winmerge/?platform=mac).
0
 

Author Comment

by:sharingsunshine
Comment Utility
Pepr,

It's obvious you know your python.  However, I can't really understand what you are suggesting I do.  So please but the entire code together because my piecing your suggestions together isn't working as you can see.

# http://www.theherbsplacenews.com/

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')

#read html code
html = website.read()

#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)

pattern = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
#result = prog.match(r'"http://www.theherbsplace.com/\1" rel="nofollow"')

#result = re.sub(r'"http://www\.theherbsplace\.com/(.*?)?"', r'"http://www.theherbsplace.com/\1" rel="nofollow"', html)

result = pattern.sub(r"http://www.theherbsplace.com/\1" rel="nofollow", 
#print result

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window


here is what I get as a result
============== RESTART: /Users/rjw/Documents/Python/getLinks.py ==============
>>> 

Open in new window


I know this is my ignorance so please guide someone that is new to python.

Thanks,
0
 
LVL 28

Accepted Solution

by:
pepr earned 450 total points
Comment Utility
Here is the working example:
import urllib2
import re

website = urllib2.urlopen('http://www.theherbsplacenews.com/')
html = website.read()   # the content of the page

with open('original_document.html', 'w') as f:
    f.write(html)

rex = re.compile(r'("http://www\.theherbsplace\.com/.*?")')
                   # notice the new placement of the left parenthesis
result = rex.sub(r'\1 rel="nofollow"', html)
                   # double quotes are just chars -- the literal wrapped in single quotes

with open('new_document.html', 'w') as f:
    f.write(result)

Open in new window

The open('new_document.html', 'w') returns the file object open for writing. The with xxx as f: names the file object as f and wraps the block below so that the file is closed automatically when the block ends.

I have changed the pattern variable to rex because the pattern is used for the string literal, not for the compiled object. Notice I have placed differently the parentheses to wrap the complete URL string (including its double quotes). This way, the first part of the URL need not to be retyped in the rex.sub, and then it is less error prone. The \1 will include everything, including the double quotes. (That is because it is inside the parentheses of the compiled pattern.)

After running the code, look inside the working directory. You should see there the files original_document.html and new_document.html. These are the files that you can pass to the diff tool to see what has changed. You will also see that some of the rel="nofollow" are doubled in the new document -- that is, the rel="nofollow" was already there.
0
 

Author Comment

by:sharingsunshine
Comment Utility
Pepr, this is great and I will post another question on how to address the double rel="nofollow" in textwrangler I just ran a duplicate finder regex to remove them.

Terry, thanks for the clarification  to Pepr on the \1
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now