Solved

Binary Search a File

Posted on 2004-09-17
18
380 Views
Last Modified: 2013-11-15
I have got a file and it serves as a dictionary with many word entries in it. An example of such file can be downloaded from ftp://puzzlers.org/pub/wordlists/soed.txt . I need a program that will perform a BINARY search on the file and return 1 if word is found and return a 0 if otherwise.

My function prototype is this
    int binary_search_dictionary(char *);
whereby the passed in value is the word to be searched.

I NEED SUGGESTIONS on how to solve it and need not be a complete program!
0
Comment
Question by:ansi_c
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
  • 4
  • +4
18 Comments
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12086835
First suggestion:
Your sample file has just 1 Mb. If this is your definitive case, then I recommend you to store full dictionary in memory. This will give you a far better performance.
0
 

Author Comment

by:ansi_c
ID: 12086938
I wish to check against the file and not wish to use arrays of strings.

I have got 1 problem and that is how to set the file pointer to point to the middle word. How?
0
 
LVL 9

Expert Comment

by:ankuratvb
ID: 12086979
You can use fseek to set the file pointer to the middle word.
0
Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

 
LVL 9

Expert Comment

by:ankuratvb
ID: 12086990
But since the words are of variable lengths,you'd have trouble locating the start of the word i.e. the start of the line.
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087000
Well, you need an extra "index" file, you cannot point to the "middle" word without it.
You can do the following when calling the binary_search_dictionary() function:

- Check the existence of your index file, if not, built it containing offset and size of each word
- At least you have to load the index file in memory (in a static). If not, your binary search will be extremely slow. Index size could have 3 bytes (up to 16 million words) for offset and 1 byte for word size (up to 256 chars per word).
- To jump to an specific word (by order number), just go to proper index array element to obtain offset and size, the go to dictionary file and use fseek/fread to read the word.
- Proceed with the rest as usual.
 
0
 

Author Comment

by:ansi_c
ID: 12087004
I am aware of the fseek function.
fseek moves to the nth byte in the file as specified but then I am talking of words in a file.
How to move the file pointer to the middle word? Note that length of words are not fixed.
I can have this

Words
=====
Apple
Bananna
Cheese
Yam
Zebra


There are 5 words in the file and so file pointer should point to the third word Cheese and of course has got to point to the first character which is C.

How?
0
 
LVL 9

Expert Comment

by:ankuratvb
ID: 12087018
0
 

Author Comment

by:ansi_c
ID: 12087036
jaime,

To set the index, does this mean I have to go through the entire file line by line and build the index before I can do the binary search?
0
 
LVL 9

Expert Comment

by:ankuratvb
ID: 12087048
Once the index is created and loaded into memory,it can be used repeatedly till the program is running.

SO at the expense of going thru the file once,you gain for all the next times.
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087056
Yes, but this will be done once. Remember when you make some word search in Windows help. The index is built only for the first time access to an specific help file.
0
 

Author Comment

by:ansi_c
ID: 12087109
But what if I wish to run the search only once?
That's the index is not needed.
Is it still possible to go get to the middle word using functions like ftell, fscanf, and fseek?
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087175
Only one word? Once in the life? I guess that is not a real application.
Anyway, you can construct and distribute the dictionary and index together. Don't need to construct when calling your search function, you can make it before, outside your application.
0
 
LVL 30

Expert Comment

by:Axter
ID: 12090478
You could just use file mapping, and then search for the word as you would search a memory buffer.

File mapping example for Win32:

if (INVALID_HANDLE_VALUE != (m_FileHandle = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL)))
{
      if ((m_MapViewHandle = CreateFileMapping(m_FileHandle, NULL, PAGE_READONLY, 0,0, NULL)) != NULL)
      {
            m_Begin = (LPCTSTR)MapViewOfFile(m_MapViewHandle, FILE_MAP_READ, 0, 0, 0);
            char* FoundPos = strstr(m_Begin, "'sblood");
      }
}
0
 
LVL 30

Expert Comment

by:Axter
ID: 12090483
If you're programming in Linux/UNIX you can use POSIX file mapping functions.
0
 
LVL 2

Expert Comment

by:sneeuw_chan
ID: 12094865
If you really want to make a function that *only* uses fseek, and does everything from the file, that's not too hard.

Just use a standard binary search, but skip past the next newline before reading the word to be compared.

An example: (Looking for 'Beer')  The digits from 0-9 show the positions of the 'start' and 'end' pointers of the bsearch.
Step 1: skip past 'heese ' and compare against 'Yam'.  Result: smaller.
Step 2: skip past 'anana ' and compare against 'Cheese'.  Result: smaller.
Step 3: skip past 'le ' and compare against 'Banana'.  Result: greater.
Step 4: skip past ' ' and compare against 'Banana'.  Result: greater.
Step 5: skip past 'Banana ' and compare against 'Cheese'.  Result: smaller.
difference between pointers is 1, so the word is not found.

A - 012
p
p
l -        3
e
\n         45
B-          5
a -   234
n
a
n
a
\n
C
h - 1
e
e
s
e
\n
Y
a
m
\n
Z
e
b
r
a
EOF - 0
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12099699
Are you sure you want to use binary search only ?
How many searches do you need to perform? Is it worth building an index for guided search? Have you considered grep as an option?

If all you want is to position FILE * to the middle line, you can do something like...

stat() the file
get the size
divide by 2 and fseek to the result
keep advancing FILE * until you find a \n ...

Now you are approximately in the middle of the file but it may not necessarily be middle line

Alternatively, you might have to read in file in a loop and count all \n and then fseek to the coount/2 number of \n s

something like

count=0;
while ( fread(..buffer..) != 0 )
{
         temp1=buffer;
         while(1)
         {
                  if (*temp1=='\n')
                       count++;
                  if ( end of buffer/data)
                       break;
           }
}
and then again reread the file to (count/2) '\n' ... not worth the effort for one search. If you have mutiple searches, then may be you can build a sort of cache or index as you search the long way.
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 100 total points
ID: 12100026
You have three choices:

1. Index the file. Remember that you could always add the index to the beginning of the file instyead of making a sacond file.
2. Fixed-Length the records. I.E. Make a copy of the file where every word is padded out to a known fixed length.
3. Binary search reading blocks.

Let me expand a little on '3'.

The problem with a raw list of words is that they are different lengths so you cant specifically find the 'n'th word. But your question demands only that we guarantee to detect a word if it is there so we only have to either find it or prove that it is not there and we've done our job.

The algorithm therefore reads as follows:

1. Read a block from the middle of the area you are looking.
2. If your word should appear before the first word in the block, set the end of your area to before the first word in the block.
Otherwise, If your word should appear after the last word in the block, set the start of your area to after the last word in the block.
Otherwise, the word should be in the block.
3. Repeat from 1.

This is simple to describe but you must be careful with your file offsets as mistakes can hide for a long time here. Your test data should at least search for the first and last word in the file. You might find it better to test with the whole of the original word list.

Paul

0
 
LVL 2

Expert Comment

by:sneeuw_chan
ID: 12100347
You have a fourth choice: Binary search reading characters. Like I described before.
It's actually quite simple:  You have to map each charater in the file to a word in the file, so that all words are mapped, and then do a simple binary search on the character positions.
Mapping a character to a word is simple: just skip forward until you read a newline.  That way, all words are mapped by a character (except the first).  To map the first, you make position 0 a special case and don't skip from there.  The only way this doesn't work is if the file starts with an empty line.

This algorithm is equivalent to binary searching a list where each word is copied a number of times, so you can be sure it works, as long as you can prove that all words can be reached from a position.
0

Featured Post

Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Cognitive Services 1 73
downlod failures 6 91
getting movie maker for windows 10 8 170
how to export quicken data 2 25
If your app took Google’s lash recently, here are the 5 most likely reasons.
All of the resources available today make learning a new digital media easier than ever-- if you know where to begin. This is a clear, simple guide to a few of the basic digital art mediums and how to begin learning them on your own.
The viewer will learn how to set up a document for the web and print and the recommended PPI for printing.
An overview on how to enroll an hourly employee into the employee database and how to give them access into the clock in terminal.

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question