Solved

index and search a HUGE text file w/Java

Posted on 1997-11-20
1
232 Views
Last Modified: 2006-11-17
I have a huge flat (text) file with each row being a record.  This thing is about 300 MILLION records and is a textfile in a UNIX environment.

I want to be able to index this file and then search for a particular string of info...all done by Java.  

We can do the same with a Sybase database, but we want to see the performance using a Java-based search.

Anyone know any Java apps that will do this?  How about non-java apps for simply indexing and searching a file?

We want to keep the methods as close to the methods used to index the WWW as possible.


Much thanks to anyone who can help!

daniel_garrison@hotmail.com
0
Comment
Question by:dgarrison
1 Comment
 
LVL 6

Accepted Solution

by:
jpk041897 earned 50 total points
ID: 1231016
Theres a couple of option you could use, either via a hash table or using JDBC to index on a Data Base.

Java is cetanly not designed as a number crunching language so 300 million records will probably not be eve supported by Java's hashing structure.

Your performance, using Java, will certanly be slower than Sybase, since Sybase uses C+ tree search algorithms on the indexes. (That's fourth root of the number of records times average disk seek time) to obtain the average seek time for one index.

Using JNI, you couold make calls to a C based Indexing database such as MIX softwares C Database toolchest or equivalent products and obtain values close to Sybase.

You could also port such a library, fairly painlesly, from C to Java and obtain slightly slower performace (provided you use a good native code Java Compiler such as Asymetric's Supercede).

Although any of these solutions would probably represent a lot more work than its worth if you allready have a DB app. that does the work for you.

My suggestion would be to use Java and JDBC, either as an applet or as an CGI app. To create the front end and use your Sybase DB as the back end for optimal performance.

Now if what you are looking for is a way to improve on the access time of the seeks, there was an article in Dr. Dobbs some years ago that contained a disk hashing algorithm in Pascal tha could access records in a file up to 1 GBytes in 2 disk accesses and 3 access for anything larger. Such an approach would yield no significant diferences between C++ or Java since procesing time is kept to a minimum. Of course, porting from Pascal to Java is a bit more painful than from C++ to Java.

If your interested in such an approach, I could dig up the article and send you the reference.


0

Featured Post

Problems using Powershell and Active Directory?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
servlet filter example 37 63
hibernate insert example 13 29
Error in @AspectJ Based AOP with Spring 2 13
pagenation logic how it is working in my code 1 31
An old method to applying the Singleton pattern in your Java code is to check if a static instance, defined in the same class that needs to be instantiated once and only once, is null and then create a new instance; otherwise, the pre-existing insta…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question