python extracting text from html - coding issue

Hi,

I'm using python.
I'm trying to extract the text (content without tags) from html page: http://thevlog.co.il/vegan-kebabs/

I was able to download the html, but unable to extract the text from it (removing tags...).
it seems that the page encoding is: ISO-8859-2

when I tried to use beautifulsoup, I received unreadable characters: שקדים,
I tried also using regular expression and lxml, nothing worked for me..

it seems like encoding issue...

thanks.
omer dAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Walter RitzelSenior Software EngineerCommented:
Example assuming the html file is already on your system
import sys
import re
from BeautifulSoup import BeautifulSoup
import codecs

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

def main():
	with codecs.open('page.htm','r',encoding='ISO-8859-2') as f:
		tree=BeautifulSoup(f.read())
		texts = tree.findAll(text=True) 
		visible_texts = filter(visible,texts)
		fWrite = codecs.open('saved_text.txt','w+',encoding='ISO-8859-2')
		for r in visible_texts:
			fWrite.write(r)
		fWrite.close()

if __name__ == '__main__':
    main()

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
omer dAuthor Commented:
Hi Walter,

thanks, the idea seems right, but for that specific site it doesn't work for me..

did it work for you?
0
Walter RitzelSenior Software EngineerCommented:
Yes, it has worked. But has worked only after I have read the html from a file specifying the encoding.
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

omer dAuthor Commented:
are you talking about the html from: http://thevlog.co.il/vegan-kebabs/
or just somw html with that encodig?
because I'm not sure regarding the encoding, this is what I got from beautifulsoup..
0
Walter RitzelSenior Software EngineerCommented:
The html from  http://thevlog.co.il/vegan-kebabs/
I have tried to download it using urllib2, but I had issues with proxy.
So, I have accessed the url with a browser and saved the page.
Then, have used the code above.
You can see the saved file below on notepad.
file.jpg
0
omer dAuthor Commented:
thanks, it didn't work for me because I didn't convert the text to unicode...
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.