Determine the language of a text file.

I have 50,000+ text files that I need to determine the language for.  So far, I can see German, French and English.  I am looking for some automated way to determine this.  This can be done through a third part tool, command line, api, etc.  My only restriction is I can’t use any Web services.

I do have experience in .Net and could write an application wrapped around and existing API.  Any input would be greatly appreciated.
rye004Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
If you want to do this without having reference words to search for, you can get a fairly accurate result by calculating the index of coincidence.

This article goes a little too deep into it for your needs and the formula looks horrible. I wrote a small script to do this some years ago and it was much simpler.

http://en.m.wikipedia.org/wiki/Index_of_coincidence

basically, you count the appearances of letters, multiply by a coefficient and get a number. You don't need much more than a single sentence to return a number very close to those shown in the article. It's even accurate enough to distinguish between different authors.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
I'm looking for a simpler easy to implement version...
0
Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
I wrote mine in vb, all lost now.
This contains one in c++, look down the headers in the code block for index of coincidence.

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=8607&lngWId=3
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
This has a java version which looks a bit clearer
https://evilzone.org/java/%28java-source%29-cryptanalysis-frequency-and-coincidence-counting/

still looking for where I developed my code from
0
Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
I found my ocode segmentld project. I can't run it any more, but I'll attach a pic as I can't fit a floppy disc in my phone. The pic contains the meat of the calculation.  It takes a string and counts the incidence of all characters. Performs a simple calculation and returns the index of coincidence which can be checked against this table. It was written to identify the language of encrypted texts- it still works if the letters are interchanged by a cypher.

Expected Index of Coincidence
'English    0.0661
'French     0.0778
'German     0.0762
'Italian    0.0738
'Japanese   0.0819
'Portugese  0.0791
'Russian    0.0529
'Spanish    0.0775
code segment
0
Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
Ah sorry. Posted from phone and hit the button twice.
0
ozoCommented:
Here are some packages using this and other methods:  http://en.wikipedia.org/wiki/Language_identification#Libraries
0
BillDLCommented:
Being something of a simpleton when it comes to programming, I would be inclined to make a list of the most commonly used conjunctive words (conjunctions) in as many different languages as you expect to find in your text files and use a batch search program or script to search the target files for instances of those words and move or copy the files into different folders, or rename the files using a prefix or suffix to the filename to identify the language.

In English, the conjunctives are: and, but, or, nor, for, yet, and so.
If matching only complete words, "and" and "but" are likely to be the most unique to English.  It would be a case of picking out similarly unique words or conjunctions in the other languages that you expect to discover, and making lists.

French conjunctions: et, mais, ou, ni, pour, alors (various including encore), and si.
German conjunctions: und, aber doch, oder, noch, für, doch, so (also damit).

http://en.wiktionary.org/wiki/Category:Conjunctions_by_language

It would be reasonably easy to create 3 "DOS" batch files (probably calling FINDSTR) that separately walk through a set of folders and sub-folders finding *.txt files and searching each of them for exact instances of specific words.  You could make it so that if more than 3 exact word matches are found, the process moves on to the next file and renames or moves the last one.  Searching 50,000 text files for a few words using a DOS batch file would take a long time though.

An alternative for me would be a "Search and Replace" program used only in "Search" mode.  For this kind of thing, although not on the same scale as you need, I use an old registered version of the shareware Search and Replace.
http://www.funduc.com/search_replace.htm
There will be loads of similar freeware offerings out there, including Windows own search feature.

There would also be the option to search files for accents that are unique to different languages, like the circumflex in French, and the umlauts in German.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.