sharingsunshine
asked on
Python error - Need Help
I am new to python and I am trying to add rel="nofollow" to all of the links in http://www.theherbsplacenews.com that match the regex pattern below. When I do that I am getting this error.
this is my code
============== RESTART: /Users/rjw/Documents/Python/getLinks.py ==============
<_sre.SRE_Match object at 0x1021a0f30>
this is my code
# http://www.theherbsplacenews.com/
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')
#read html code
html = website.read()
#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)
prog = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
result = prog.match('"http://www.theherbsplace.com/\1" rel="nofollow"')
print result
What you were getting was not an error, but a regex object. Looking at result.group(0) would have given you a string possibly.
You shouldn't need to use compile as far as I can tell, though it might be more efficient if you use the same pattern multiple times.
Minor point: The . characters in the pattern should be escaped for the regex engine so they aren't treated as wildcards. Adding the r before the text should make sure the backslash gets to the regex engine. Same for the \1 in the replacement I understand.
You shouldn't need to use compile as far as I can tell, though it might be more efficient if you use the same pattern multiple times.
Minor point: The . characters in the pattern should be escaped for the regex engine so they aren't treated as wildcards. Adding the r before the text should make sure the backslash gets to the regex engine. Same for the \1 in the replacement I understand.
To add, you may try:
I did uncomment your .findall and wrote the links to the file. The .findall is a high-level method. It returns list of extracted substrings. In this sense it differs from re.match on how it should be used.
For the "to compile or not to compile", I am for compilation almost always. The reason is not only the precompiled object usage is faster in loops. The not precompiled regular expression have to be compiled at least once anyway. The syntax that use a precompiled object is simpler. When the object is given a good name, it is more readable. When the name is not important, or used only locally, I suggest to use rex always -- as your convention (Regular EXpression compiled object).
Similarly, I suggest to use m for the match object as the pattern is used often.
#!python2
# http://www.theherbsplacenews.com/
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')
#read html code
html = website.read()
rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')
#use re.findall to get all the links
links = rex.findall(html)
with open('links.txt', 'w') as f:
for link in links:
f.write(link + '\n')
As Terry wrote, you shoul (probably always) use the r'raw string literals' for regular expression patterns. Otherwise, you would have to double all backslashes that should not be interpreted as a being the starter of a string escape sequence. Regular expressions also use backslash for special purpose , and the two usages together make it confusing. This is the reasons why languages with syntactically built-in regular expressions use a form of writing the patterns different from normal string literals (say enclosed in slashes).I did uncomment your .findall and wrote the links to the file. The .findall is a high-level method. It returns list of extracted substrings. In this sense it differs from re.match on how it should be used.
For the "to compile or not to compile", I am for compilation almost always. The reason is not only the precompiled object usage is faster in loops. The not precompiled regular expression have to be compiled at least once anyway. The syntax that use a precompiled object is simpler. When the object is given a good name, it is more readable. When the name is not important, or used only locally, I suggest to use rex always -- as your convention (Regular EXpression compiled object).
Similarly, I suggest to use m for the match object as the pattern is used often.
#!python2
import re
rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')
m = rex.match('"http://www.theherbsplace.com/1" rel="nofollow"')
if m:
print m.group(0)
print m.group(1)
I am not sure what you mean by \1 in your example, but you probably know now what you want to do ;) Otherwise, feel free to ask.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
@Terry: Oh, my fault! I did not read the question correctly. ;)
There is a minor flaw in the solution. If the original already contained the rel="nofollow" attribute, it would be duplicated.
The more robust solution should probably parse the HTML, modify the element and dump the modified data structure to the file. Anyway, the simpler solution may win in the special case.
The more robust solution should probably parse the HTML, modify the element and dump the modified data structure to the file. Anyway, the simpler solution may win in the special case.
ASKER
Hi Pepr & Terry,
I tried the following code from you Pepr. Admittedly, I may have the code sequence wrong because I was working off of both of your examples and trying to combine them into one.
and I get this output
https://gyazo.com/cb6e0e9b3d29b8af05ae63344e8e8815 which I am not clear what the A in the box is saying.
I tried the following code from you Pepr. Admittedly, I may have the code sequence wrong because I was working off of both of your examples and trying to combine them into one.
#!python2
# http://www.theherbsplacenews.com/
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')
#read html code
html = website.read()
rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')
#use re.findall to get all the links
links = rex.findall(html)
with open('links.txt', 'w') as f:
for link in links:
f.write(link + '\n')
#rex = re.compile(r'"http://www\.theherbsplace\.com/(.*?)?"')
m = rex.match('"http://www.theherbsplace.com///\1" rel="nofollow"')
if m:
print m.group(0)
print m.group(1)
and I get this output
https://gyazo.com/cb6e0e9b3d29b8af05ae63344e8e8815 which I am not clear what the A in the box is saying.
ASKER
Terry I tried your code and this is what I used
and the result I got is too big to include
# http://www.theherbsplacenews.com/
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')
#read html code
html = website.read()
#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)
#prog = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
#result = prog.match('"http://www.theherbsplace.com/\1" rel="nofollow"')
result = re.sub(r'"http://www\.theherbsplace\.com/(.*?)?"', r'"http://www.theherbsplace.com/\1" rel="nofollow"', html)
print result
and the result I got is too big to include
Firstly, the \1 does not make (practical) sense when used in rex.match. The reason for observing the picture of "A" in a box is that your console displays the character with the ordinal number 1. The escape sequence \1 is converted to that character, because you did not suppress the escape sequence interpretation by using a r'raw string' (notice the r at in front of the first single quote).
For your later code, replace the last print result by
For your later code, replace the last print result by
with open('new_document.html', 'w') as f:
f.write(result)
and then look inside the generated file. Similarly, you can save the website content to the original_document.html, and then you can use the tool of your choice for comparing (diff) the two files (see http://alternativeto.net/software/winmerge/?platform=mac).
ASKER
Pepr,
It's obvious you know your python. However, I can't really understand what you are suggesting I do. So please but the entire code together because my piecing your suggestions together isn't working as you can see.
here is what I get as a result
I know this is my ignorance so please guide someone that is new to python.
Thanks,
It's obvious you know your python. However, I can't really understand what you are suggesting I do. So please but the entire code together because my piecing your suggestions together isn't working as you can see.
# http://www.theherbsplacenews.com/
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.theherbsplacenews.com/')
#read html code
html = website.read()
#use re.findall to get all the links
# links = re.findall('"(http://www.theherbsplace.com/.*?)"', html)
pattern = re.compile('"http://www\.theherbsplace\.com/(.*?)?"')
#result = prog.match(r'"http://www.theherbsplace.com/\1" rel="nofollow"')
#result = re.sub(r'"http://www\.theherbsplace\.com/(.*?)?"', r'"http://www.theherbsplace.com/\1" rel="nofollow"', html)
result = pattern.sub(r"http://www.theherbsplace.com/\1" rel="nofollow",
#print result
with open('new_document.html', 'w') as f:
f.write(result)
here is what I get as a result
============== RESTART: /Users/rjw/Documents/Python/getLinks.py ==============
>>>
I know this is my ignorance so please guide someone that is new to python.
Thanks,
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Pepr, this is great and I will post another question on how to address the double rel="nofollow" in textwrangler I just ran a duplicate finder regex to remove them.
Terry, thanks for the clarification to Pepr on the \1
Terry, thanks for the clarification to Pepr on the \1
Open in new window
with the code:Open in new window