Solved

index and search a HUGE text file w/Java

Posted on 1997-11-20
1
233 Views
Last Modified: 2006-11-17
I have a huge flat (text) file with each row being a record.  This thing is about 300 MILLION records and is a textfile in a UNIX environment.

I want to be able to index this file and then search for a particular string of info...all done by Java.  

We can do the same with a Sybase database, but we want to see the performance using a Java-based search.

Anyone know any Java apps that will do this?  How about non-java apps for simply indexing and searching a file?

We want to keep the methods as close to the methods used to index the WWW as possible.


Much thanks to anyone who can help!

daniel_garrison@hotmail.com
0
Comment
Question by:dgarrison
1 Comment
 
LVL 6

Accepted Solution

by:
jpk041897 earned 50 total points
ID: 1231016
Theres a couple of option you could use, either via a hash table or using JDBC to index on a Data Base.

Java is cetanly not designed as a number crunching language so 300 million records will probably not be eve supported by Java's hashing structure.

Your performance, using Java, will certanly be slower than Sybase, since Sybase uses C+ tree search algorithms on the indexes. (That's fourth root of the number of records times average disk seek time) to obtain the average seek time for one index.

Using JNI, you couold make calls to a C based Indexing database such as MIX softwares C Database toolchest or equivalent products and obtain values close to Sybase.

You could also port such a library, fairly painlesly, from C to Java and obtain slightly slower performace (provided you use a good native code Java Compiler such as Asymetric's Supercede).

Although any of these solutions would probably represent a lot more work than its worth if you allready have a DB app. that does the work for you.

My suggestion would be to use Java and JDBC, either as an applet or as an CGI app. To create the front end and use your Sybase DB as the back end for optimal performance.

Now if what you are looking for is a way to improve on the access time of the seeks, there was an article in Dr. Dobbs some years ago that contained a disk hashing algorithm in Pascal tha could access records in a file up to 1 GBytes in 2 disk accesses and 3 access for anything larger. Such an approach would yield no significant diferences between C++ or Java since procesing time is kept to a minimum. Of course, porting from Pascal to Java is a bit more painful than from C++ to Java.

If your interested in such an approach, I could dig up the article and send you the reference.


0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Weekend adv creator 3 54
javap bin 2 41
eclipse buid path vs tomcat lib path 10 34
Eclipse Java import and method not resolved 4 52
For customizing the look of your lightweight component and making it look lucid like it was made of glass. Or: how to make your component more Apple-ish ;) This tip assumes your component to be of rectangular shape and completely opaque. (COD…
Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question