asked on

Sorting and searching in c++

I am fascinated by the sort and search method of Dbase files. In this files with extension .idx are created which

store information about the records in the main file. Indexing may be done on one field or more than one fields.

For e.g. a record may contain name and salary. Records can be indexed on name or salary or both. If one is asked to search names begining with A, all records with name starting with A are displayed or if the query is AC, all records starting with names AC are displayed. In this example the primary index is name and secoddary is salary.

Through this process retrieval is also fast.

If a new record is added the index file is updated automatically.

Kindly tell me the steps so that I may develop the algorithm using C++.

Regards,

Abdul Mujeeb

mrjoltcola

Database indexing like this is implemented using B-trees and Hashes. The most common is the B-tree

http://en.wikipedia.org/wiki/B-tree

You might start with buying a book on File Structures. This is one of the core classes in Computer Science, and there are some good books out there for this. I actually have "File Structures: An Object-Oriented Approach with C++"

http://www.amazon.com/File-Structures-Object-Oriented-Approach-C/dp/0201874016

And also a more traditional C one that I actually used when in college, that is highly recommendec:

http://www.amazon.com/File-Structures-Michael-J-Folk/dp/0201557134/ref=pd_sim_b_2

itsmeandnobodyelse

>>>> so that I may develop the algorithm using C++.

What's your purpose? Developing a database? Or use indexing mechanisms for transient data in arrays of structures?

abdul_mujeeb

ASKER

hello itsmeandnobodyelse

My purpose is to use indexing mechanism.

itsmeandnobodyelse

If you don't have to store the data into physical files, things are easier cause you can operate with pointers in memory no matter how far the used pointer addresses were spreaded. If writing to database files things are different cause the index data and real data must be compact.

A good container to start with is a sorted array. Here you can search with a binary search (take the middle index and check whether your search value is in the upper or lower part, then take that and again check the middle, and so on).

Then go on with a hash container, where you calculate a hash value from the keys thus finding the begin of a linked list where all entries with the same hash belong to.

Next would be a balanced binary tree, e. g. an AVL tree where none of the branches is much shorter or longer than that of the other sibling branch.

abdul_mujeeb

ASKER

I have to keep the data files and index files seperate. Please tell me the steps to store data, let us say in .dat file and indexed records in .idx file

itsmeandnobodyelse

struct HeadRecord
{
int recsize; // recordsize in bytes
int maxrecs; // max records in db (including head and bitmap)
int maxbmrecs; // max bitmap records
int offsetbm; // offset to bitmap records
int offsetdata; // offset to data records
int nextbmrec; // next bitmap record

};

struct IndexHeadRecord
{
int recsize;
int maxrecs;
};

struct IndexRecord
{
unsigned int hash; // hash code for that record
unsigned int used; // used dwords in record
unsigned int prev; // back pointer to previous of same hash
unsigned int buf[253]; // buffer for entries
unsigned int next; // forward pointer to next of same hash
};

// note, if the prev is 0 , it is the begin of a list to same hash
// note, a record can be used for other hash codes as well (if prev is full)

struct IndexEntry
{
unsigned short lenEntry; // total length of entry in dwords
unsigned short indexId; // index id
unsigned int rec; // record number in DB
unsigned int buf[1]; // buffer for key data (variable size)
}

struct Record
{
int id;
int age;
char firstname[32];
char lastname[32];
};

Record recs[] = { { 1, 15, "Tom", "Sawyer", },
{ 2, 35, "Mary", "Higgins", },
{ 3, 20, "Lewis", "Hamilton", },
{ 4, 67, "Alan", "Greenspan", },
};

// Assuming there is an index on column 1 and a second index on both names.

int createDb()
{
ofstream ofdb("db.dat", ios::binary | ios:out);
ofstream ofidx("db.idx", ios::binary | ios:out);
// write head record
HeadRecord head = { 0 };
head.recsize = sizeof(Record);
head.maxrecs = 1024;
head.maxbmrecs = (head.maxrecs / (head.recsize*8) ) + 1 ;
head.offsetbm = 1;
head.offsetbm = 1 + head.maxbmrecs;
head.nextbmrec = 2;

Record hrec = { 0 };
memcpy(&hrec, &head, sizeof(HeadRecord));
// write head
if (!ofdb.write((char*)&hrec, sizeof(Record))) return GetLastError();
// write all other records as zeros
memset(&hrec, 0, sizeof(Record));
for (int i = 1; i < head.maxrecs; ++i)
if (!ofdb.write((char*)&hrec, sizeof(Record))) return GetLastError();
ofdb.close();

// the index file was organised via hash
IndexHeadRecord ihr = { 0 };
ihr.recsize = sizeof(IndexRecord);
ihr.maxRecs = 100;

IndexRecord ihrec = { 0 };
memcpy(&ihrec, &ihr, sizeof(IndexHeadRecord));
// write head
if (!ofidx.write((char*)&ihrec, sizeof(IndexRecord))) return GetLastError();
// write all other records as zeros
memset(&ihrec, 0, sizeof(IndexRecord));
for (int j = 1; j < ihr.maxrecs; ++j)
if (!ofidx.write((char*)&ihrec, sizeof(IndexRecord))) return GetLastError();
ofidx.close();

return 0;
}

// ok the above has created the database and the index database

// to write a record with data now, you need to read the head records
// of both files (you would keep them in memory once read)
// get the nextbmrec from head and check for the next free bit
// if none in the record, increment nextbmrec and update head record
// read next bitmap record
// if found free bit, calculate record number from bit.
// set file position to begin of record and write data record binary to db
// for each index in record
// - get key (of that index) from Record.
// - calculate hash (record) number from key in the range of (2, ihrec.maxrecs)
// - read the index record the hash is pointing to
// - check if record has enough space
// if not, search next free record sequentially and do the chaining
// - add new entry to index record and store updated index record

// to search for a record via index
// calculate the hash for the index key
// read the index records to that hash and search for entries
// where the key(s) were matching exactly
// read the data records these index entries were pointing to

abdul_mujeeb

ASKER

You are really a genius. No one has given a reply like this. I have been successful in creating both the dat and idx files. However, I have not been able to follow the logic of the code. Please explain the concept of head record , bitmap record, offset to head record an offset to bitmap record.. if you can explain the logic of indexing, it will help me. I will be able to understand better if you explain it through a drawing or a chart mentioning the steps taken when a records is added to a dat file and its updation in the idx file. similarly when a record is to be searched, the steps to be performed.

Apart from your logic what I have in mind is as follows:
suppose we have a dat file of 100 names and telephone numbers and this file is to be indexed on name.
1. read the dat file and load it in vector using STL
2. sort the data in memory
3. write the address of records in idx file
4. when required to display, read the idx file and display the names from the dat file by using the address mentioned in the idx file
5. however i do not know what to do when a new single record is to be added to dat file and how to update the idx file. Similarly when required to display the names begining with "A" or "AD", what is to be done.
I know that I may have confused you, but please help me as I am working on a project which requires indexing to be done as it is done by database packages like foxpro etc.

itsmeandnobodyelse

>>>> Please explain the concept of head record , bitmap record, offset to head record an offset to bitmap record.

Ok a head record simply gives basic information about the file where it is head record of, such as size, offset to known sections such as bitmaps, such as data records, frist free bitmap record (that is the first bitmap record which not has all bits set)

A bitmap record is a record where each bit says whether the corresponding data record is free. The bitmap record normally has same size as data records (not necessarily but easier to handle) So, if you have a data record of 4096 bytes it is 32768 bits for one record and you can manage up to 32768 data records with that one record (is 134,217,728 bytes). For your tests you probably have only one bitmap record therefore.

>>>> offset to head record

No. There is no offset to headrecord. head record is always first record.

>>>> offset to bitmap record
offset to bitmap record is 1 cause the bitmap records follow immediately the (one and lonely) head record.. It is redundant information therefore. But it doesn't make much efforts to handle that and - maybe - if you have to increase the data base you simply would copy the old datafile into a new greater one *and* put the the (new) bitmap records *after* the old data records. Then update the head record and copy old bitmap records to the new place - mark the bits for the new bitmap records as used and all is done.

>>>> offset to data record
We need an offset to the first data record in order to be able to compute which bit belongs to which record. You always need to make relative pointers so that you could increase bitmap area without having to change any chaining pointers within the data (record) area. The only thing which might change is the offset to the first data record.

>>>> if you can explain the logic of indexing
Note, the way I described above is only one of many ways to accomplish indexing. Most DBMS use Bayer trees (balanced sorted trees) for indexes and often they use one (physical) index file for any single index (not one index file for all indexes as I made it above). Nevertheless I described a valid and proofed way of indexing. My first DBMS has one index database for all indexes and there are also modern OO DBMS, e. g. poet DBMS which exactly have an architecture of two physical files, object.dat and object.idx, similar to that I described.

Generaly, indexing is made *after* adding the new database record. That way, you always can thro off index files and newly generate them from the data records.

The method described above is using a hash. Assume you have 10000 index slots. Then you need an algorithm where the given index key is input, that outputs a number between 1 and 10000. The hash algorithm is ideal if any of the 10000 slots has equal probability to be chosen by hash. Nevertheless, even if a slot never was compted by hash, it also could take index data. That is if one of the first records computed has an overflow, you simply would read the next records (or use bitmap records for that) to find a yet unused record. Then the previous record and the new unused one can be 'chained' (the new one points backwards to the old one and the old one forward to the new one). There are different concepts what to do, if a hash was computed but the record it was pointing to was used as a following record by another hash. Mostly, they use further hash computations until an unused slot was reached. Another method is that each index record contains an additional member where the pointer for the start of the original hash code can be stored.

To be continued ...

abdul_mujeeb

ASKER

I am waiting for the continued part.

abdul_mujeeb

ASKER

I am still waiting for the continued part

itsmeandnobodyelse

>>>> I am still waiting for the continued part
I'll will find some time this evening.

Sorry for the delay but I am currently involved in some busy project.

abdul_mujeeb

ASKER

It is OK

ASKER CERTIFIED SOLUTION

itsmeandnobodyelse

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

itsmeandnobodyelse

Why did you give a B grade?

Is that the way you want to honour my efforts?

abdul_mujeeb

ASKER

I am sorry for that. any way to change it. Beleive me u r a genius

itsmeandnobodyelse

>>>> any way to change it.
Yes, you can check the 'Request Attention' link which is in your initial question at top of that thread and as the moderators to change the grade.

I am glad that it happened by accident.

Regards, Alex