Language detection and translation.

Hi everyone!

I was given a specific list of url's that I need to spider with a .NET, VB Program. The problem is that many of the sites are in a foreign language.
The question is, is there a way for my spider to detect the page's language? Maybe through the server's response headers? Also, what is a good option for automatic translations? I thought about Systran, but really don't know if it is a good, cost effective option.

Thanks a lot!


Who is Participating?
Some webpages contains META tags, which specify the language being used. An example of a META tag making this declaration would be:

    <meta http-equiv="Content-Language" content="en-US" />

Which obviously means: English, US.

However, most pages actually don't contain these; in which cases, there's no other way of finding out 100%.

What you could do, is create a database, with the 5 (or 10, for more accuracy) most common words for each language. Then, for each page that you spider, scan them for these words; then create a 'tally' like object, which you will use at the end, to decide which language it is most relevant to (and therefore, is..).

Other than that, I don't do VB, so I couldn't really contribute on a coding level.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.