Solved

Binary Search a File

Posted on 2004-09-17
18
338 Views
Last Modified: 2013-11-15
I have got a file and it serves as a dictionary with many word entries in it. An example of such file can be downloaded from ftp://puzzlers.org/pub/wordlists/soed.txt . I need a program that will perform a BINARY search on the file and return 1 if word is found and return a 0 if otherwise.

My function prototype is this
    int binary_search_dictionary(char *);
whereby the passed in value is the word to be searched.

I NEED SUGGESTIONS on how to solve it and need not be a complete program!
0
Comment
Question by:ansi_c
  • 4
  • 4
  • 4
  • +4
18 Comments
 
LVL 55

Expert Comment

by:Jaime Olivares
Comment Utility
First suggestion:
Your sample file has just 1 Mb. If this is your definitive case, then I recommend you to store full dictionary in memory. This will give you a far better performance.
0
 

Author Comment

by:ansi_c
Comment Utility
I wish to check against the file and not wish to use arrays of strings.

I have got 1 problem and that is how to set the file pointer to point to the middle word. How?
0
 
LVL 9

Expert Comment

by:ankuratvb
Comment Utility
You can use fseek to set the file pointer to the middle word.
0
 
LVL 9

Expert Comment

by:ankuratvb
Comment Utility
But since the words are of variable lengths,you'd have trouble locating the start of the word i.e. the start of the line.
0
 
LVL 55

Expert Comment

by:Jaime Olivares
Comment Utility
Well, you need an extra "index" file, you cannot point to the "middle" word without it.
You can do the following when calling the binary_search_dictionary() function:

- Check the existence of your index file, if not, built it containing offset and size of each word
- At least you have to load the index file in memory (in a static). If not, your binary search will be extremely slow. Index size could have 3 bytes (up to 16 million words) for offset and 1 byte for word size (up to 256 chars per word).
- To jump to an specific word (by order number), just go to proper index array element to obtain offset and size, the go to dictionary file and use fseek/fread to read the word.
- Proceed with the rest as usual.
 
0
 

Author Comment

by:ansi_c
Comment Utility
I am aware of the fseek function.
fseek moves to the nth byte in the file as specified but then I am talking of words in a file.
How to move the file pointer to the middle word? Note that length of words are not fixed.
I can have this

Words
=====
Apple
Bananna
Cheese
Yam
Zebra


There are 5 words in the file and so file pointer should point to the third word Cheese and of course has got to point to the first character which is C.

How?
0
 
LVL 9

Expert Comment

by:ankuratvb
Comment Utility
0
 

Author Comment

by:ansi_c
Comment Utility
jaime,

To set the index, does this mean I have to go through the entire file line by line and build the index before I can do the binary search?
0
 
LVL 9

Expert Comment

by:ankuratvb
Comment Utility
Once the index is created and loaded into memory,it can be used repeatedly till the program is running.

SO at the expense of going thru the file once,you gain for all the next times.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 55

Expert Comment

by:Jaime Olivares
Comment Utility
Yes, but this will be done once. Remember when you make some word search in Windows help. The index is built only for the first time access to an specific help file.
0
 

Author Comment

by:ansi_c
Comment Utility
But what if I wish to run the search only once?
That's the index is not needed.
Is it still possible to go get to the middle word using functions like ftell, fscanf, and fseek?
0
 
LVL 55

Expert Comment

by:Jaime Olivares
Comment Utility
Only one word? Once in the life? I guess that is not a real application.
Anyway, you can construct and distribute the dictionary and index together. Don't need to construct when calling your search function, you can make it before, outside your application.
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
You could just use file mapping, and then search for the word as you would search a memory buffer.

File mapping example for Win32:

if (INVALID_HANDLE_VALUE != (m_FileHandle = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL)))
{
      if ((m_MapViewHandle = CreateFileMapping(m_FileHandle, NULL, PAGE_READONLY, 0,0, NULL)) != NULL)
      {
            m_Begin = (LPCTSTR)MapViewOfFile(m_MapViewHandle, FILE_MAP_READ, 0, 0, 0);
            char* FoundPos = strstr(m_Begin, "'sblood");
      }
}
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
If you're programming in Linux/UNIX you can use POSIX file mapping functions.
0
 
LVL 2

Expert Comment

by:sneeuw_chan
Comment Utility
If you really want to make a function that *only* uses fseek, and does everything from the file, that's not too hard.

Just use a standard binary search, but skip past the next newline before reading the word to be compared.

An example: (Looking for 'Beer')  The digits from 0-9 show the positions of the 'start' and 'end' pointers of the bsearch.
Step 1: skip past 'heese ' and compare against 'Yam'.  Result: smaller.
Step 2: skip past 'anana ' and compare against 'Cheese'.  Result: smaller.
Step 3: skip past 'le ' and compare against 'Banana'.  Result: greater.
Step 4: skip past ' ' and compare against 'Banana'.  Result: greater.
Step 5: skip past 'Banana ' and compare against 'Cheese'.  Result: smaller.
difference between pointers is 1, so the word is not found.

A - 012
p
p
l -        3
e
\n         45
B-          5
a -   234
n
a
n
a
\n
C
h - 1
e
e
s
e
\n
Y
a
m
\n
Z
e
b
r
a
EOF - 0
0
 
LVL 45

Expert Comment

by:sunnycoder
Comment Utility
Are you sure you want to use binary search only ?
How many searches do you need to perform? Is it worth building an index for guided search? Have you considered grep as an option?

If all you want is to position FILE * to the middle line, you can do something like...

stat() the file
get the size
divide by 2 and fseek to the result
keep advancing FILE * until you find a \n ...

Now you are approximately in the middle of the file but it may not necessarily be middle line

Alternatively, you might have to read in file in a loop and count all \n and then fseek to the coount/2 number of \n s

something like

count=0;
while ( fread(..buffer..) != 0 )
{
         temp1=buffer;
         while(1)
         {
                  if (*temp1=='\n')
                       count++;
                  if ( end of buffer/data)
                       break;
           }
}
and then again reread the file to (count/2) '\n' ... not worth the effort for one search. If you have mutiple searches, then may be you can build a sort of cache or index as you search the long way.
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 100 total points
Comment Utility
You have three choices:

1. Index the file. Remember that you could always add the index to the beginning of the file instyead of making a sacond file.
2. Fixed-Length the records. I.E. Make a copy of the file where every word is padded out to a known fixed length.
3. Binary search reading blocks.

Let me expand a little on '3'.

The problem with a raw list of words is that they are different lengths so you cant specifically find the 'n'th word. But your question demands only that we guarantee to detect a word if it is there so we only have to either find it or prove that it is not there and we've done our job.

The algorithm therefore reads as follows:

1. Read a block from the middle of the area you are looking.
2. If your word should appear before the first word in the block, set the end of your area to before the first word in the block.
Otherwise, If your word should appear after the last word in the block, set the start of your area to after the last word in the block.
Otherwise, the word should be in the block.
3. Repeat from 1.

This is simple to describe but you must be careful with your file offsets as mistakes can hide for a long time here. Your test data should at least search for the first and last word in the file. You might find it better to test with the whole of the original word list.

Paul

0
 
LVL 2

Expert Comment

by:sneeuw_chan
Comment Utility
You have a fourth choice: Binary search reading characters. Like I described before.
It's actually quite simple:  You have to map each charater in the file to a word in the file, so that all words are mapped, and then do a simple binary search on the character positions.
Mapping a character to a word is simple: just skip forward until you read a newline.  That way, all words are mapped by a character (except the first).  To map the first, you make position 0 a special case and don't skip from there.  The only way this doesn't work is if the file starts with an empty line.

This algorithm is equivalent to binary searching a list where each word is copied a number of times, so you can be sure it works, as long as you can prove that all words can be reached from a position.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Storage devices are generally used to save the data or sometime transfer the data from one computer system to another system. However, sometimes user accidentally erased their important data from the Storage devices. Users have to know how data reco…
In this article, you will read about the trends across the human resources departments for the upcoming year. Some of them include improving employee experience, adopting new technologies, using HR software to its full extent, and integrating artifi…
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.
Viewers will learn how to use the Hootsuite Dashboard.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now