asked on

Determine the language of a text file.

I have 50,000+ text files that I need to determine the language for. So far, I can see German, French and English. I am looking for some automated way to determine this. This can be done through a third part tool, command line, api, etc. My only restriction is I can’t use any Web services.

I do have experience in .Net and could write an application wrapped around and existing API. Any input would be greatly appreciated.

ASKER CERTIFIED SOLUTION

Enabbar Ocap

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Enabbar Ocap

I'm looking for a simpler easy to implement version...

Enabbar Ocap

I wrote mine in vb, all lost now.
This contains one in c++, look down the headers in the code block for index of coincidence.

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=8607&lngWId=3

Enabbar Ocap

This has a java version which looks a bit clearer
https://evilzone.org/java/%28java-source%29-cryptanalysis-frequency-and-coincidence-counting/

still looking for where I developed my code from

Enabbar Ocap

I found my o

ld project. I can't run it any more, but I'll attach a pic as I can't fit a floppy disc in my phone. The pic contains the meat of the calculation. It takes a string and counts the incidence of all characters. Performs a simple calculation and returns the index of coincidence which can be checked against this table. It was written to identify the language of encrypted texts- it still works if the letters are interchanged by a cypher.

Expected Index of Coincidence
'English    0.0661
'French     0.0778
'German     0.0762
'Italian    0.0738
'Japanese   0.0819
'Portugese 0.0791
'Russian    0.0529
'Spanish    0.0775

Enabbar Ocap

Ah sorry. Posted from phone and hit the button twice.

ozo

Here are some packages using this and other methods: http://en.wikipedia.org/wiki/Language_identification#Libraries

BillDL

Being something of a simpleton when it comes to programming, I would be inclined to make a list of the most commonly used conjunctive words (conjunctions) in as many different languages as you expect to find in your text files and use a batch search program or script to search the target files for instances of those words and move or copy the files into different folders, or rename the files using a prefix or suffix to the filename to identify the language.

In English, the conjunctives are: and, but, or, nor, for, yet, and so.
If matching only complete words, "and" and "but" are likely to be the most unique to English. It would be a case of picking out similarly unique words or conjunctions in the other languages that you expect to discover, and making lists.

French conjunctions: et, mais, ou, ni, pour, alors (various including encore), and si.
German conjunctions: und, aber doch, oder, noch, für, doch, so (also damit).

http://en.wiktionary.org/wiki/Category:Conjunctions_by_language

It would be reasonably easy to create 3 "DOS" batch files (probably calling FINDSTR) that separately walk through a set of folders and sub-folders finding *.txt files and searching each of them for exact instances of specific words. You could make it so that if more than 3 exact word matches are found, the process moves on to the next file and renames or moves the last one. Searching 50,000 text files for a few words using a DOS batch file would take a long time though.

An alternative for me would be a "Search and Replace" program used only in "Search" mode. For this kind of thing, although not on the same scale as you need, I use an old registered version of the shareware Search and Replace.
http://www.funduc.com/search_replace.htm
There will be loads of similar freeware offerings out there, including Windows own search feature.

There would also be the option to search files for accents that are unique to different languages, like the circumflex in French, and the umlauts in German.