• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 245
  • Last Modified:

index and search a HUGE text file w/Java

I have a huge flat (text) file with each row being a record.  This thing is about 300 MILLION records and is a textfile in a UNIX environment.

I want to be able to index this file and then search for a particular string of info...all done by Java.  

We can do the same with a Sybase database, but we want to see the performance using a Java-based search.

Anyone know any Java apps that will do this?  How about non-java apps for simply indexing and searching a file?

We want to keep the methods as close to the methods used to index the WWW as possible.


Much thanks to anyone who can help!

daniel_garrison@hotmail.com
0
dgarrison
Asked:
dgarrison
1 Solution
 
jpk041897Commented:
Theres a couple of option you could use, either via a hash table or using JDBC to index on a Data Base.

Java is cetanly not designed as a number crunching language so 300 million records will probably not be eve supported by Java's hashing structure.

Your performance, using Java, will certanly be slower than Sybase, since Sybase uses C+ tree search algorithms on the indexes. (That's fourth root of the number of records times average disk seek time) to obtain the average seek time for one index.

Using JNI, you couold make calls to a C based Indexing database such as MIX softwares C Database toolchest or equivalent products and obtain values close to Sybase.

You could also port such a library, fairly painlesly, from C to Java and obtain slightly slower performace (provided you use a good native code Java Compiler such as Asymetric's Supercede).

Although any of these solutions would probably represent a lot more work than its worth if you allready have a DB app. that does the work for you.

My suggestion would be to use Java and JDBC, either as an applet or as an CGI app. To create the front end and use your Sybase DB as the back end for optimal performance.

Now if what you are looking for is a way to improve on the access time of the seeks, there was an article in Dr. Dobbs some years ago that contained a disk hashing algorithm in Pascal tha could access records in a file up to 1 GBytes in 2 disk accesses and 3 access for anything larger. Such an approach would yield no significant diferences between C++ or Java since procesing time is kept to a minimum. Of course, porting from Pascal to Java is a bit more painful than from C++ to Java.

If your interested in such an approach, I could dig up the article and send you the reference.


0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now