Solved

Sorting and searching in c++

Posted on 2009-04-02
17
808 Views
Last Modified: 2013-11-17
I am fascinated by the sort and search method of Dbase files.  In this files with extension .idx are created which

store information about the records in the main file.  Indexing may be done on one field or more than one fields.

For e.g. a record may contain name and salary.  Records can be indexed on name or salary or both.  If one  is asked to search names begining with A, all records with name starting with A are displayed or if the query is AC, all records starting with names AC are displayed.  In this example the primary index is name and secoddary is salary.

Through this process retrieval is also fast.

If a new record is added the index file is updated automatically.

Kindly tell me the steps so that I may develop the algorithm using C++.


Regards,


Abdul Mujeeb
0
Comment
Question by:abdul_mujeeb
  • 8
  • 7
17 Comments
 
LVL 40

Expert Comment

by:mrjoltcola
ID: 24048886
Database indexing like this is implemented using B-trees and Hashes. The most common is the B-tree

http://en.wikipedia.org/wiki/B-tree

You might start with buying a book on File Structures. This is one of the core classes in Computer Science, and there are some good books out there for this. I actually have "File Structures: An Object-Oriented Approach with C++"

http://www.amazon.com/File-Structures-Object-Oriented-Approach-C/dp/0201874016

And also a more traditional C one that I actually used when in college, that is highly recommendec:

http://www.amazon.com/File-Structures-Michael-J-Folk/dp/0201557134/ref=pd_sim_b_2
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24049233
>>>> so that I may develop the algorithm using C++.

What's your purpose? Developing a database? Or use indexing mechanisms for transient data in arrays of structures?
0
 

Author Comment

by:abdul_mujeeb
ID: 24056860
hello itsmeandnobodyelse

My purpose is to use indexing mechanism.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24057228
If you don't have to store the data into physical files, things are easier cause you can operate with pointers in memory no matter how far the used pointer addresses were spreaded. If writing to database files things are different cause the index data and real data must be compact.

A good container to start with is a sorted array. Here you can search with a binary search (take the middle index and check whether your search value is in the upper or lower part, then take that and again check the middle, and so on).

Then go on with a hash container, where you calculate a hash value from the keys thus finding the begin of a linked list where all entries with the same hash belong to.

Next would be a balanced binary tree, e. g. an AVL tree where none of the branches is much shorter or longer than that of the other sibling branch.  
0
 

Author Comment

by:abdul_mujeeb
ID: 24059044
I have to keep the data files and index files seperate.  Please tell me the steps to store data, let us say in .dat file and indexed records in .idx file
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24100289
struct HeadRecord
{
     int recsize;        // recordsize in bytes
     int maxrecs;      // max records in db (including head and bitmap)
     int maxbmrecs; // max bitmap records
     int offsetbm;     // offset to bitmap records
     int offsetdata;   // offset to data records
     int nextbmrec;   // next bitmap record
 
 };

struct IndexHeadRecord
{
     int recsize;
     int maxrecs;
};

struct IndexRecord
{
     unsigned int     hash;         // hash code for that record
     unsigned int     used;         // used dwords in record
     unsigned int     prev;          // back pointer to previous of same hash
     unsigned int     buf[253];   // buffer for entries
     unsigned int     next;          // forward pointer to next of same hash
};

// note, if the prev is 0 , it is the begin of a list to same hash
// note, a record can be used for other hash codes as well (if prev is full)

struct IndexEntry
{
    unsigned short lenEntry;    // total length of entry in dwords
    unsigned short indexId;     // index id
    unsigned int     rec;            // record number in DB
    unsigned int     buf[1];       // buffer for key data (variable size)
}


struct Record
{
     int id;
     int age;
     char firstname[32];
     char lastname[32];
 };

Record recs[] = { { 1, 15, "Tom", "Sawyer", },
                            { 2, 35, "Mary", "Higgins", },
                            { 3, 20, "Lewis", "Hamilton", },
                            { 4,  67, "Alan", "Greenspan", },
                          };

// Assuming there is an index on column 1 and a second index on both names.

int createDb()
{
  ofstream ofdb("db.dat", ios::binary | ios:out);
  ofstream ofidx("db.idx", ios::binary | ios:out);
  // write head record
  HeadRecord head = { 0 };
  head.recsize  = sizeof(Record);
  head.maxrecs = 1024;
  head.maxbmrecs  =  (head.maxrecs /  (head.recsize*8) ) + 1  ;
  head.offsetbm = 1;
  head.offsetbm = 1 + head.maxbmrecs;
  head.nextbmrec = 2;

  Record  hrec = { 0 };
  memcpy(&hrec, &head, sizeof(HeadRecord));
  // write head
  if (!ofdb.write((char*)&hrec, sizeof(Record))) return GetLastError();
  // write all other records as zeros
  memset(&hrec, 0, sizeof(Record));
  for (int i = 1; i < head.maxrecs; ++i)
     if (!ofdb.write((char*)&hrec, sizeof(Record))) return GetLastError();
  ofdb.close();

  // the index file was organised via hash
  IndexHeadRecord ihr = { 0 };
  ihr.recsize    = sizeof(IndexRecord);
  ihr.maxRecs = 100;  

  IndexRecord  ihrec = { 0 };
  memcpy(&ihrec, &ihr, sizeof(IndexHeadRecord));
  // write head
  if (!ofidx.write((char*)&ihrec, sizeof(IndexRecord))) return GetLastError();
  // write all other records as zeros
  memset(&ihrec, 0, sizeof(IndexRecord));
  for (int j = 1; j < ihr.maxrecs; ++j)
     if (!ofidx.write((char*)&ihrec, sizeof(IndexRecord))) return GetLastError();
  ofidx.close();

 
  return 0;
}

// ok the above has created the database and the index database

// to write a record with data now, you need to read the head records
// of both files (you would keep them in memory once read)
// get the nextbmrec from head and check for the next free bit
// if none in the record, increment nextbmrec and update head record
// read next bitmap record
// if found free bit, calculate record number from bit.
// set file position to begin of record and write data record binary to db
// for each index in record  
//    - get key (of that index) from Record.  
//    - calculate hash (record) number from key in the range of (2, ihrec.maxrecs)
//    - read the index record the hash is pointing to
//    - check if record has enough space
//      if not, search next free record sequentially and do the chaining
//    - add new entry to index record and store updated index record



// to search for a record via index
// calculate the hash for the index key
// read the index records to that hash and search for entries
// where the key(s) were matching exactly
// read the data records these index entries were pointing to



 
 
 
0
 

Author Comment

by:abdul_mujeeb
ID: 24177845
You are really a genius.  No one has given a reply like this.  I have been successful in creating both the dat and idx files.  However, I have not been able to follow the logic of the code.  Please explain the concept of head record , bitmap record, offset to head record an offset to bitmap record..  if you can explain the logic of indexing, it will help me.  I will be able to understand better if you explain it through a drawing or a chart mentioning the steps taken when a records is added to a dat file and its updation in the idx file.  similarly when a record is to be searched, the steps to be performed.

Apart from your logic what I have in mind is as follows:
suppose we have a dat file of 100 names and telephone numbers and this file is to be indexed on name.
1. read the dat file and load it in vector using STL
2. sort the data in memory
3. write the address of records in idx file
4. when required to display, read the idx file and display the names from the dat file by using the address mentioned in the idx file
5. however i do not know what to do when a new single record is to be added to dat file and how to update the idx file.  Similarly when required to display the names begining with "A" or "AD", what is to be done.
 I know that I may have confused you, but please help me as I am working on a project which requires indexing to be done as it is done by database packages like foxpro etc.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24178966
>>>> Please explain the concept of head record , bitmap record, offset to head record an offset to bitmap record.

Ok a head record simply gives basic information about the file where it is head record of, such as size, offset to known sections such as bitmaps, such as data records, frist free bitmap record (that is the first bitmap record which not has all bits set)

A bitmap record is a record where each bit says whether the corresponding data record is free. The bitmap record normally has same size as data records (not necessarily but easier to handle) So, if you have a data record of 4096 bytes it is 32768 bits for one record and you can manage up to 32768 data records with that one record (is 134,217,728 bytes). For your tests you probably have only one bitmap record therefore.

>>>> offset to head record

No. There is no offset to headrecord. head record is always first record.

>>>> offset to bitmap record
offset to bitmap record  is 1 cause the bitmap records follow immediately the (one and lonely) head record.. It is redundant information therefore. But it doesn't make much efforts to handle that and - maybe - if you have to increase the data base you simply would copy the old datafile into a new greater one *and* put the the (new) bitmap records *after* the old data records. Then update the head record and copy old bitmap records to the new place - mark the bits for the new bitmap records as used and all is done.

>>>> offset to data record
We need an offset to the first data record in order to be able to compute which bit belongs to which record. You always need to make relative pointers so that you could increase bitmap area without  having to change any chaining pointers within the data (record) area. The only thing which might change is the offset to the first data record.

>>>> if you can explain the logic of indexing
Note, the way I described above is only one of many ways to accomplish indexing. Most DBMS use Bayer trees (balanced sorted trees) for indexes and often they use one (physical) index file  for any single index (not one index file for all indexes as I made it above). Nevertheless I described a valid and proofed way of indexing. My first DBMS has one index database for all indexes and there are also modern OO DBMS, e. g. poet DBMS which exactly have an architecture of two physical files, object.dat and object.idx, similar to that I described.

Generaly, indexing is made *after* adding the new database record. That way, you always can thro off index files and newly generate them from the data records.

The method described above is using a hash. Assume you have 10000 index slots. Then you need an algorithm where the given index key is input, that outputs a number between 1 and 10000. The hash algorithm is ideal if any of the 10000 slots has equal probability to be chosen by hash. Nevertheless, even if a slot never was compted by hash, it also could take index data. That is if one of the first records computed has an overflow, you simply would read the next records (or use bitmap records for that) to find a yet unused record. Then the previous record and the new unused one can be 'chained'  (the new one points backwards to the old one and the old one forward to the new one). There are different concepts what to do, if a hash was computed but the record it was pointing to was used as a following record by another hash. Mostly, they use further hash computations until an unused slot was reached. Another method is that each index record contains an additional member where the pointer for the start of the original hash code can be stored.

To be continued ...
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:abdul_mujeeb
ID: 24221337
I am waiting for the continued part.
0
 

Author Comment

by:abdul_mujeeb
ID: 24371615
I am still waiting for the continued part
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24371851
>>>> I am still waiting for the continued part
I'll will find some time this evening.

Sorry for the delay but I am currently involved in some busy project.
0
 

Author Comment

by:abdul_mujeeb
ID: 24373074
It is OK
0
 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 500 total points
ID: 24379666
1. read the dat file and load it in vector using STL
2. sort the data in memory
3. write the address of records in idx file

If you could read all data records to memory, there is no need of a index file. Searching in linear time in memory is a hundred times faster than any indexed read from file regardless of the number of records (say if less than 1 million) . Even if you think linear time in memory is not fast enough you should consider using a std::map for indexing in memory rather than using an index file. Or, after sorting the records read in, use a binary search algorithm on the key (sort criteria) which would give logarithmic time access. But probably with 100 records the linear search is faster ...

4. when required to display, read the idx file and display the names from the dat file by using the address mentioned in the idx file
as told the idx file is only if there is too much data to read them all into memory. If so, the approach is correct.

5. however i do not know what to do when a new single record is to be added to dat file and how to update the idx file.  
You would write the new (1) record by getting the first empty bit of the bitmap records (which were in memory all time) and calculate the next free record  from that. Then write the data to that record, update the bitmap record and perhaps the head record if the bitmap record was full. Extract key from record and look for the key in index file by computing the first record to look for using the hash algorithm. Read that record and check if either it is empty - than you could take it as the first record for the new key - or it is used by the the same key - then it is a following entry to the same key - or if it belongs to some other key. In the latter case most algorithms simply would read sequentially from the current position (and continuing from first index record in case of reaching the end of file) until an empty record was found. Then, this record was *hijacked* and makes up the first record of the new chain. In the record where the hash pointed to, you would add a pointer to the found record so that next access reads two records a maximum to find the begin of the chain.
   
Similarly when required to display the names begining with "A" or "AD", what is to be done.

Hmmm. A hash access is not suitable for wildcard access with a given prefix string. If the fast access via prefixes is a requirement you need a sorted index rather than a hashed index. With a sorted index you would get the first key beginning with "AD" using a binary search algorithm (look for the middle of all key entries. If it is greater than the given key, take the lower part, else take the upper part for the next lookup, and so on until either the key was found or the range to look for contains no keys. Then you simply go on 'sequentially' from that key until the given prefix doesn't match. The problem with sorted index files id that you hardly can move all keys greater than a new one to insert as it would take too much time. Therefore the records containing keys (with pointers to data records) not really can be read sequentially to get all keys but you have a some kind of tree where the root (record) was pointed from head record and where each record has pointers to roots of a left subtree and right subtree where the left subtree contains only keys less than the lowest key of root and the right subtree only keys greatewr than the maximum of the root. Each subtree is following the same rules. The problem of such trees is not to build them (that is rather simple as you could easily find the record to insert by following a simple binary algorithm) but to keep it balanced. Balanced key files are often so-called Bayer trees which - in case of inserting a new record which would spoil the balance of the total tree -  would make additional balancing operations to finally have a balanced tree again.

Assume the following tree:

the root has keys from MM to OX and has a left child with keys from AB to LN.

             [MM  ...  OX]
     [AB ... LN]

If you want to add key GG and find out the child record is full you could split the left child record and get

             [MM  ...  OX]
       [GG ... LN]
  [AA ... FI]

But that tree was not balanced. So you do

             [GG ... LN]
  [AA ... FI]     [MM  ...  OX]

and again have a wonderful balanced tree.


>>>  I know that I may have confused you, but please help me as I am working on a project which requires indexing to be done as it is done by database packages like foxpro etc.

Unfortunately, I don't know what indexing concept FoxPro has. But it is an elder DBMS and probably was using a key file like the last one described.

Instead of programming an own key file you might check your file system whether it wasn't providing a key-indexed file. AFAIK most file systems do provide one where you simply can access records  by key and have a pointer to your data records stored with the key.

0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24393136
Why did you give a B grade?

Is that the way you want to honour my efforts?
0
 

Author Comment

by:abdul_mujeeb
ID: 24393223
I am sorry for that.  any way to change it.  Beleive me u r a genius
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 24393830
>>>> any way to change it.
Yes, you can check the 'Request Attention' link which is in your initial question at top of that thread and as the moderators to change the grade.

I am glad that it happened by accident.

Regards, Alex
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
Jaspersoft Studio is a plugin for Eclipse that lets you create reports from a datasource.  In this article, we'll go over creating a report from a default template and setting up a datasource that connects to your database.
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now