python extracting text from html - coding issue


I'm using python.
I'm trying to extract the text (content without tags) from html page:

I was able to download the html, but unable to extract the text from it (removing tags...).
it seems that the page encoding is: ISO-8859-2

when I tried to use beautifulsoup, I received unreadable characters: שקדים,
I tried also using regular expression and lxml, nothing worked for me..

it seems like encoding issue...

omer dAsked:
Walter RitzelSenior Software EngineerCommented:
Example assuming the html file is already on your system
import sys
import re
from BeautifulSoup import BeautifulSoup
import codecs

def visible(element):
    if in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

def main():
	with'page.htm','r',encoding='ISO-8859-2') as f:
		texts = tree.findAll(text=True) 
		visible_texts = filter(visible,texts)
		fWrite ='saved_text.txt','w+',encoding='ISO-8859-2')
		for r in visible_texts:

if __name__ == '__main__':

omer dAuthor Commented:
Hi Walter,

thanks, the idea seems right, but for that specific site it doesn't work for me..

did it work for you?
Walter RitzelSenior Software EngineerCommented:
Yes, it has worked. But has worked only after I have read the html from a file specifying the encoding.
omer dAuthor Commented:
are you talking about the html from:
or just somw html with that encodig?
because I'm not sure regarding the encoding, this is what I got from beautifulsoup..
Walter RitzelSenior Software EngineerCommented:
The html from
I have tried to download it using urllib2, but I had issues with proxy.
So, I have accessed the url with a browser and saved the page.
Then, have used the code above.
You can see the saved file below on notepad.
omer dAuthor Commented:
thanks, it didn't work for me because I didn't convert the text to unicode...
From novice to tech pro — start learning today.