?
Solved

Binary Search a File

Posted on 2004-09-17
18
Medium Priority
?
393 Views
Last Modified: 2013-11-15
I have got a file and it serves as a dictionary with many word entries in it. An example of such file can be downloaded from ftp://puzzlers.org/pub/wordlists/soed.txt . I need a program that will perform a BINARY search on the file and return 1 if word is found and return a 0 if otherwise.

My function prototype is this
    int binary_search_dictionary(char *);
whereby the passed in value is the word to be searched.

I NEED SUGGESTIONS on how to solve it and need not be a complete program!
0
Comment
Question by:ansi_c
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
  • 4
  • +4
18 Comments
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12086835
First suggestion:
Your sample file has just 1 Mb. If this is your definitive case, then I recommend you to store full dictionary in memory. This will give you a far better performance.
0
 

Author Comment

by:ansi_c
ID: 12086938
I wish to check against the file and not wish to use arrays of strings.

I have got 1 problem and that is how to set the file pointer to point to the middle word. How?
0
 
LVL 9

Expert Comment

by:ankuratvb
ID: 12086979
You can use fseek to set the file pointer to the middle word.
0
Build and deliver software with DevOps

A digital transformation requires faster time to market, shorter software development lifecycles, and the ability to adapt rapidly to changing customer demands. DevOps provides the solution.

 
LVL 9

Expert Comment

by:ankuratvb
ID: 12086990
But since the words are of variable lengths,you'd have trouble locating the start of the word i.e. the start of the line.
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087000
Well, you need an extra "index" file, you cannot point to the "middle" word without it.
You can do the following when calling the binary_search_dictionary() function:

- Check the existence of your index file, if not, built it containing offset and size of each word
- At least you have to load the index file in memory (in a static). If not, your binary search will be extremely slow. Index size could have 3 bytes (up to 16 million words) for offset and 1 byte for word size (up to 256 chars per word).
- To jump to an specific word (by order number), just go to proper index array element to obtain offset and size, the go to dictionary file and use fseek/fread to read the word.
- Proceed with the rest as usual.
 
0
 

Author Comment

by:ansi_c
ID: 12087004
I am aware of the fseek function.
fseek moves to the nth byte in the file as specified but then I am talking of words in a file.
How to move the file pointer to the middle word? Note that length of words are not fixed.
I can have this

Words
=====
Apple
Bananna
Cheese
Yam
Zebra


There are 5 words in the file and so file pointer should point to the third word Cheese and of course has got to point to the first character which is C.

How?
0
 

Author Comment

by:ansi_c
ID: 12087036
jaime,

To set the index, does this mean I have to go through the entire file line by line and build the index before I can do the binary search?
0
 
LVL 9

Expert Comment

by:ankuratvb
ID: 12087048
Once the index is created and loaded into memory,it can be used repeatedly till the program is running.

SO at the expense of going thru the file once,you gain for all the next times.
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087056
Yes, but this will be done once. Remember when you make some word search in Windows help. The index is built only for the first time access to an specific help file.
0
 

Author Comment

by:ansi_c
ID: 12087109
But what if I wish to run the search only once?
That's the index is not needed.
Is it still possible to go get to the middle word using functions like ftell, fscanf, and fseek?
0
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 12087175
Only one word? Once in the life? I guess that is not a real application.
Anyway, you can construct and distribute the dictionary and index together. Don't need to construct when calling your search function, you can make it before, outside your application.
0
 
LVL 30

Expert Comment

by:Axter
ID: 12090478
You could just use file mapping, and then search for the word as you would search a memory buffer.

File mapping example for Win32:

if (INVALID_HANDLE_VALUE != (m_FileHandle = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL)))
{
      if ((m_MapViewHandle = CreateFileMapping(m_FileHandle, NULL, PAGE_READONLY, 0,0, NULL)) != NULL)
      {
            m_Begin = (LPCTSTR)MapViewOfFile(m_MapViewHandle, FILE_MAP_READ, 0, 0, 0);
            char* FoundPos = strstr(m_Begin, "'sblood");
      }
}
0
 
LVL 30

Expert Comment

by:Axter
ID: 12090483
If you're programming in Linux/UNIX you can use POSIX file mapping functions.
0
 
LVL 2

Expert Comment

by:sneeuw_chan
ID: 12094865
If you really want to make a function that *only* uses fseek, and does everything from the file, that's not too hard.

Just use a standard binary search, but skip past the next newline before reading the word to be compared.

An example: (Looking for 'Beer')  The digits from 0-9 show the positions of the 'start' and 'end' pointers of the bsearch.
Step 1: skip past 'heese ' and compare against 'Yam'.  Result: smaller.
Step 2: skip past 'anana ' and compare against 'Cheese'.  Result: smaller.
Step 3: skip past 'le ' and compare against 'Banana'.  Result: greater.
Step 4: skip past ' ' and compare against 'Banana'.  Result: greater.
Step 5: skip past 'Banana ' and compare against 'Cheese'.  Result: smaller.
difference between pointers is 1, so the word is not found.

A - 012
p
p
l -        3
e
\n         45
B-          5
a -   234
n
a
n
a
\n
C
h - 1
e
e
s
e
\n
Y
a
m
\n
Z
e
b
r
a
EOF - 0
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12099699
Are you sure you want to use binary search only ?
How many searches do you need to perform? Is it worth building an index for guided search? Have you considered grep as an option?

If all you want is to position FILE * to the middle line, you can do something like...

stat() the file
get the size
divide by 2 and fseek to the result
keep advancing FILE * until you find a \n ...

Now you are approximately in the middle of the file but it may not necessarily be middle line

Alternatively, you might have to read in file in a loop and count all \n and then fseek to the coount/2 number of \n s

something like

count=0;
while ( fread(..buffer..) != 0 )
{
         temp1=buffer;
         while(1)
         {
                  if (*temp1=='\n')
                       count++;
                  if ( end of buffer/data)
                       break;
           }
}
and then again reread the file to (count/2) '\n' ... not worth the effort for one search. If you have mutiple searches, then may be you can build a sort of cache or index as you search the long way.
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 300 total points
ID: 12100026
You have three choices:

1. Index the file. Remember that you could always add the index to the beginning of the file instyead of making a sacond file.
2. Fixed-Length the records. I.E. Make a copy of the file where every word is padded out to a known fixed length.
3. Binary search reading blocks.

Let me expand a little on '3'.

The problem with a raw list of words is that they are different lengths so you cant specifically find the 'n'th word. But your question demands only that we guarantee to detect a word if it is there so we only have to either find it or prove that it is not there and we've done our job.

The algorithm therefore reads as follows:

1. Read a block from the middle of the area you are looking.
2. If your word should appear before the first word in the block, set the end of your area to before the first word in the block.
Otherwise, If your word should appear after the last word in the block, set the start of your area to after the last word in the block.
Otherwise, the word should be in the block.
3. Repeat from 1.

This is simple to describe but you must be careful with your file offsets as mistakes can hide for a long time here. Your test data should at least search for the first and last word in the file. You might find it better to test with the whole of the original word list.

Paul

0
 
LVL 2

Expert Comment

by:sneeuw_chan
ID: 12100347
You have a fourth choice: Binary search reading characters. Like I described before.
It's actually quite simple:  You have to map each charater in the file to a word in the file, so that all words are mapped, and then do a simple binary search on the character positions.
Mapping a character to a word is simple: just skip forward until you read a newline.  That way, all words are mapped by a character (except the first).  To map the first, you make position 0 a special case and don't skip from there.  The only way this doesn't work is if the file starts with an empty line.

This algorithm is equivalent to binary searching a list where each word is copied a number of times, so you can be sure it works, as long as you can prove that all words can be reached from a position.
0

Featured Post

Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article, you will read about the trends across the human resources departments for the upcoming year. Some of them include improving employee experience, adopting new technologies, using HR software to its full extent, and integrating artifi…
There are many software programs on offer that will claim to magically speed up your computer. The best advice I can give you is to avoid them like the plague, because they will often cause far more problems than they solve. Try some of these "do it…
An overview on how to enroll an hourly employee into the employee database and how to give them access into the clock in terminal.
This is used to tweak the memory usage for your computer, it is used for servers more so than workstations but just be careful editing registry settings as it may cause irreversible results. I hold no responsibility for anything you do to the regist…

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question