Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 399
  • Last Modified:

Binary Search a File

I have got a file and it serves as a dictionary with many word entries in it. An example of such file can be downloaded from ftp://puzzlers.org/pub/wordlists/soed.txt . I need a program that will perform a BINARY search on the file and return 1 if word is found and return a 0 if otherwise.

My function prototype is this
    int binary_search_dictionary(char *);
whereby the passed in value is the word to be searched.

I NEED SUGGESTIONS on how to solve it and need not be a complete program!
0
ansi_c
Asked:
ansi_c
  • 4
  • 4
  • 4
  • +4
1 Solution
 
Jaime OlivaresCommented:
First suggestion:
Your sample file has just 1 Mb. If this is your definitive case, then I recommend you to store full dictionary in memory. This will give you a far better performance.
0
 
ansi_cAuthor Commented:
I wish to check against the file and not wish to use arrays of strings.

I have got 1 problem and that is how to set the file pointer to point to the middle word. How?
0
 
ankuratvbCommented:
You can use fseek to set the file pointer to the middle word.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
ankuratvbCommented:
But since the words are of variable lengths,you'd have trouble locating the start of the word i.e. the start of the line.
0
 
Jaime OlivaresCommented:
Well, you need an extra "index" file, you cannot point to the "middle" word without it.
You can do the following when calling the binary_search_dictionary() function:

- Check the existence of your index file, if not, built it containing offset and size of each word
- At least you have to load the index file in memory (in a static). If not, your binary search will be extremely slow. Index size could have 3 bytes (up to 16 million words) for offset and 1 byte for word size (up to 256 chars per word).
- To jump to an specific word (by order number), just go to proper index array element to obtain offset and size, the go to dictionary file and use fseek/fread to read the word.
- Proceed with the rest as usual.
 
0
 
ansi_cAuthor Commented:
I am aware of the fseek function.
fseek moves to the nth byte in the file as specified but then I am talking of words in a file.
How to move the file pointer to the middle word? Note that length of words are not fixed.
I can have this

Words
=====
Apple
Bananna
Cheese
Yam
Zebra


There are 5 words in the file and so file pointer should point to the third word Cheese and of course has got to point to the first character which is C.

How?
0
 
ansi_cAuthor Commented:
jaime,

To set the index, does this mean I have to go through the entire file line by line and build the index before I can do the binary search?
0
 
ankuratvbCommented:
Once the index is created and loaded into memory,it can be used repeatedly till the program is running.

SO at the expense of going thru the file once,you gain for all the next times.
0
 
Jaime OlivaresCommented:
Yes, but this will be done once. Remember when you make some word search in Windows help. The index is built only for the first time access to an specific help file.
0
 
ansi_cAuthor Commented:
But what if I wish to run the search only once?
That's the index is not needed.
Is it still possible to go get to the middle word using functions like ftell, fscanf, and fseek?
0
 
Jaime OlivaresCommented:
Only one word? Once in the life? I guess that is not a real application.
Anyway, you can construct and distribute the dictionary and index together. Don't need to construct when calling your search function, you can make it before, outside your application.
0
 
AxterCommented:
You could just use file mapping, and then search for the word as you would search a memory buffer.

File mapping example for Win32:

if (INVALID_HANDLE_VALUE != (m_FileHandle = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL)))
{
      if ((m_MapViewHandle = CreateFileMapping(m_FileHandle, NULL, PAGE_READONLY, 0,0, NULL)) != NULL)
      {
            m_Begin = (LPCTSTR)MapViewOfFile(m_MapViewHandle, FILE_MAP_READ, 0, 0, 0);
            char* FoundPos = strstr(m_Begin, "'sblood");
      }
}
0
 
AxterCommented:
If you're programming in Linux/UNIX you can use POSIX file mapping functions.
0
 
sneeuw_chanCommented:
If you really want to make a function that *only* uses fseek, and does everything from the file, that's not too hard.

Just use a standard binary search, but skip past the next newline before reading the word to be compared.

An example: (Looking for 'Beer')  The digits from 0-9 show the positions of the 'start' and 'end' pointers of the bsearch.
Step 1: skip past 'heese ' and compare against 'Yam'.  Result: smaller.
Step 2: skip past 'anana ' and compare against 'Cheese'.  Result: smaller.
Step 3: skip past 'le ' and compare against 'Banana'.  Result: greater.
Step 4: skip past ' ' and compare against 'Banana'.  Result: greater.
Step 5: skip past 'Banana ' and compare against 'Cheese'.  Result: smaller.
difference between pointers is 1, so the word is not found.

A - 012
p
p
l -        3
e
\n         45
B-          5
a -   234
n
a
n
a
\n
C
h - 1
e
e
s
e
\n
Y
a
m
\n
Z
e
b
r
a
EOF - 0
0
 
sunnycoderCommented:
Are you sure you want to use binary search only ?
How many searches do you need to perform? Is it worth building an index for guided search? Have you considered grep as an option?

If all you want is to position FILE * to the middle line, you can do something like...

stat() the file
get the size
divide by 2 and fseek to the result
keep advancing FILE * until you find a \n ...

Now you are approximately in the middle of the file but it may not necessarily be middle line

Alternatively, you might have to read in file in a loop and count all \n and then fseek to the coount/2 number of \n s

something like

count=0;
while ( fread(..buffer..) != 0 )
{
         temp1=buffer;
         while(1)
         {
                  if (*temp1=='\n')
                       count++;
                  if ( end of buffer/data)
                       break;
           }
}
and then again reread the file to (count/2) '\n' ... not worth the effort for one search. If you have mutiple searches, then may be you can build a sort of cache or index as you search the long way.
0
 
PaulCaswellCommented:
You have three choices:

1. Index the file. Remember that you could always add the index to the beginning of the file instyead of making a sacond file.
2. Fixed-Length the records. I.E. Make a copy of the file where every word is padded out to a known fixed length.
3. Binary search reading blocks.

Let me expand a little on '3'.

The problem with a raw list of words is that they are different lengths so you cant specifically find the 'n'th word. But your question demands only that we guarantee to detect a word if it is there so we only have to either find it or prove that it is not there and we've done our job.

The algorithm therefore reads as follows:

1. Read a block from the middle of the area you are looking.
2. If your word should appear before the first word in the block, set the end of your area to before the first word in the block.
Otherwise, If your word should appear after the last word in the block, set the start of your area to after the last word in the block.
Otherwise, the word should be in the block.
3. Repeat from 1.

This is simple to describe but you must be careful with your file offsets as mistakes can hide for a long time here. Your test data should at least search for the first and last word in the file. You might find it better to test with the whole of the original word list.

Paul

0
 
sneeuw_chanCommented:
You have a fourth choice: Binary search reading characters. Like I described before.
It's actually quite simple:  You have to map each charater in the file to a word in the file, so that all words are mapped, and then do a simple binary search on the character positions.
Mapping a character to a word is simple: just skip forward until you read a newline.  That way, all words are mapped by a character (except the first).  To map the first, you make position 0 a special case and don't skip from there.  The only way this doesn't work is if the file starts with an empty line.

This algorithm is equivalent to binary searching a list where each word is copied a number of times, so you can be sure it works, as long as you can prove that all words can be reached from a position.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 4
  • 4
  • 4
  • +4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now