sharingsunshine
asked on
Python Encoding Problem \u2013
I am getting this error from a python script
print(driver.page_source)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 424837: ordinal not in range(128)
Here is a part of the code
I have opened a page using webdriver and I am trying to get the page source of that page. Consequently, in debugging my script I wanted to see if the source code was being read in correctly. Using the print() statement is when I realized the error.
I have 1000's of pages to read in like this and encountering the em dash will be frequent. So what is the best way to handle this in the script so I don't have to start and stop the script all the time?
print(driver.page_source)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 424837: ordinal not in range(128)
Here is a part of the code
html_source = driver.page_source
print(driver.page_source)
I have opened a page using webdriver and I am trying to get the page source of that page. Consequently, in debugging my script I wanted to see if the source code was being read in correctly. Using the print() statement is when I realized the error.
I have 1000's of pages to read in like this and encountering the em dash will be frequent. So what is the best way to handle this in the script so I don't have to start and stop the script all the time?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
The gibberish one is a different portion of the text -- it is for some encoded form of an image.
You still have the driver.page_source as a original Unicode string. The ascii() is only used for converting the content to the one that can be displayed via print. The print command has to determine the encoding that is used by the console that shows the result of printing. If the console uses some encoding that is not Unicode, the print encodes the Unicode string to that encoding. If it is not possible (the console is not capable to display everything), the exception is raised -- the one that you have observed.
Instead of printing, you can write your driver.page_source into a file and check its content using some Unicode-capable editor:
You still have the driver.page_source as a original Unicode string. The ascii() is only used for converting the content to the one that can be displayed via print. The print command has to determine the encoding that is used by the console that shows the result of printing. If the console uses some encoding that is not Unicode, the print encodes the Unicode string to that encoding. If it is not possible (the console is not capable to display everything), the exception is raised -- the one that you have observed.
Instead of printing, you can write your driver.page_source into a file and check its content using some Unicode-capable editor:
with open('check_me.txt', 'w', encoding='utf-8') as f:
f.write(driver.page_source)
ASKER
You answered my question but upon looking deeper at the output I realized that I don't need the page source after all. I just need the html of the page which I can get via a button on the blogger post page.
Thanks for your help and it's always a pleasure to have your input.
Thanks for your help and it's always a pleasure to have your input.
ASKER
Open in new window
Here is the same page via copy and paste
Open in new window
I need the source so I can parse it with some regexes to change the links.