• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 220
  • Last Modified:

Language detection and translation.

Hi everyone!

I was given a specific list of url's that I need to spider with a .NET, VB Program. The problem is that many of the sites are in a foreign language.
The question is, is there a way for my spider to detect the page's language? Maybe through the server's response headers? Also, what is a good option for automatic translations? I thought about Systran, but really don't know if it is a good, cost effective option.

Thanks a lot!


1 Solution
Some webpages contains META tags, which specify the language being used. An example of a META tag making this declaration would be:

    <meta http-equiv="Content-Language" content="en-US" />

Which obviously means: English, US.

However, most pages actually don't contain these; in which cases, there's no other way of finding out 100%.

What you could do, is create a database, with the 5 (or 10, for more accuracy) most common words for each language. Then, for each page that you spider, scan them for these words; then create a 'tally' like object, which you will use at the end, to decide which language it is most relevant to (and therefore, is..).

Other than that, I don't do VB, so I couldn't really contribute on a coding level.


Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now