python extracting text from html - coding issue


I'm using python.
I'm trying to extract the text (content without tags) from html page:

I was able to download the html, but unable to extract the text from it (removing tags...).
it seems that the page encoding is: ISO-8859-2

when I tried to use beautifulsoup, I received unreadable characters: שקדים,
I tried also using regular expression and lxml, nothing worked for me..

it seems like encoding issue...

omer dAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Walter RitzelSenior Software EngineerCommented:
Example assuming the html file is already on your system
import sys
import re
from BeautifulSoup import BeautifulSoup
import codecs

def visible(element):
    if in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

def main():
	with'page.htm','r',encoding='ISO-8859-2') as f:
		texts = tree.findAll(text=True) 
		visible_texts = filter(visible,texts)
		fWrite ='saved_text.txt','w+',encoding='ISO-8859-2')
		for r in visible_texts:

if __name__ == '__main__':

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
omer dAuthor Commented:
Hi Walter,

thanks, the idea seems right, but for that specific site it doesn't work for me..

did it work for you?
Walter RitzelSenior Software EngineerCommented:
Yes, it has worked. But has worked only after I have read the html from a file specifying the encoding.
Price Your IT Services for Profit

Managed service contracts are great - when they're making you money. Yes, you’re getting paid monthly, but is it actually profitable? Learn to calculate your hourly overhead burden so you can master your IT services pricing strategy.

omer dAuthor Commented:
are you talking about the html from:
or just somw html with that encodig?
because I'm not sure regarding the encoding, this is what I got from beautifulsoup..
Walter RitzelSenior Software EngineerCommented:
The html from
I have tried to download it using urllib2, but I had issues with proxy.
So, I have accessed the url with a browser and saved the page.
Then, have used the code above.
You can see the saved file below on notepad.
omer dAuthor Commented:
thanks, it didn't work for me because I didn't convert the text to unicode...
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.