Solved

Problem to search

Posted on 2015-02-06
37
111 Views
Last Modified: 2015-02-10
Hi,
It is fine to search file like

https://dl.dropboxusercontent.com/u/40211031/flout_w.bin

using the exe file generated from
// 
//

#pragma warning (disable: 4996) 
#include "stdafx.h"
#include <set>
#include <sys/stat.h>
#include <string>
#include <fstream>
#include <sstream>
#include <atlbase.h>
#include <ctype.h>
#include <process.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include "..\..\include\nameval.h"
#include <iomanip>
#include <Windows.h>
struct stat fs = { 0 };
int ret; //
int numRecords;
nameval binrec;
bool LessComp(const nameval& a1, const nameval& a2)
{
	if (strcmp(a1.fld_nm, a2.fld_nm) < 0) return true;
	if (strcmp(a1.fld_nm, a2.fld_nm) > 0) return false;
	if (a1.fld_val < a2.fld_val) return true;
	return false;
}
int _tmain(int argc, _TCHAR* argv[])
{
	if (argc < 1)
	{
		return ERROR;
	}
	unsigned int nbegin = 0;
	unsigned int nend = numRecords - 1;
	unsigned int nmid;
	unsigned int nstop = 0;
	char nm_got[100];
	unsigned int val_got;
	time_t timev, currtime;
	float sec;
	timev = time(0);
	std::ifstream inputfiles;
	nameval names = { 0 };
		std::ostringstream filename;
		filename << "c:\\dp4\\flout_w.bin";
		std::set<nameval> records;
		std::set<nameval>::iterator iter;
		inputfiles.open(filename.str().c_str(), std::ios::binary | std::ios::in);
		if (!inputfiles.is_open())
			return -3; //
		if (!inputfiles.read((char*)&names, sizeof(nameval)))
			return -4; //
		ret = stat(filename.str().c_str(), &fs);
		if (ret != 0)
			return -4;
		numRecords = (int)(fs.st_size / sizeof(nameval));
		nbegin = 0;
		nend = numRecords - 1;
		nstop = 0;
		char szArgv2[512] = { 0 };
		size_t ncharsConverted = 0;
		wcstombs(szArgv2, argv[1], sizeof(szArgv2));
		while (nbegin <= nend && nstop != -1)
		{
			nmid = (nbegin + nend) / 2;
			nameval rec = { 0 };
			inputfiles.seekg(nmid* sizeof(nameval));
			inputfiles.read((char*)&rec, sizeof(nameval));
			if (strcmp(szArgv2, rec.fld_nm)<0)
			{
				nend = nmid - 1;
				nmid = (nbegin + nend) / 2;
			}
			else
			{
				if (strcmp(szArgv2, rec.fld_nm)>0)
				{
					nbegin = nmid + 1;
					nmid = (nbegin + nend) / 2;
				}
				else
				{
					nstop = -1;
					strcpy(rec.fld_nm, nm_got);
					val_got = rec.fld_val;
				}
			}
		}
		if (nstop == -1)
		{
			std::cout << "\nFound it!\n";
			std::cout << "(From vector record: " << nm_got
				<< ' ' << val_got << ")\n";
		}
		else std::cout << "\nDidn't find it within file -'" << filename.str().c_str() << "'!\n";
	return 0;
}

Open in new window


and here are what I get

C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzwBdUCSIZpiPajxmVV"

Found it!
(From vector record: Éï? 588081)

Open in new window


but when I'm to search big file having the same structure, I get this
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfXEJ"

Open in new window


while the string does exist within the file.

here is .h file
// nameval.h
#ifndef NAME_VAL_H
#define NAME_VAL_H

struct nameval
{
     char fld_nm[21];
     long long    fld_val;

     int  get_len() { return (int)min(strlen(fld_nm), sizeof(fld_nm) ) ; }
     void get_uni_nm(wchar_t nm_uni[], int sizfld)
     {
            //mbstowcs_s(nm_uni, fld_nm, min(sizfld-1, strlen(fld_nm)));
			size_t ncharsConverted = 0;
			mbstowcs_s(&ncharsConverted, nm_uni, sizfld, fld_nm, min(sizfld-1, (int)strlen(fld_nm)));
     }
     bool operator< (const nameval & a2) const
     {
           if(strcmp(fld_nm, a2.fld_nm) < 0) return true;
           if(strcmp(fld_nm, a2.fld_nm) > 0) return false;
           if (fld_val < a2.fld_val) return true;
           return false;
     }
};

#endif

Open in new window

0
Comment
Question by:HuaMinChen
  • 18
  • 17
37 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 40595354
Is the file you are searching sorted on fld_nm?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40595355
Yes, sorted already.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40595474
It looks like strcpy(rec.fld_nm, nm_got);
should be strcpy(nm_got, rec.fld_nm);
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40595617
No, it is working fine to search file with smaller size. Why does the problem arise with file in bigger size?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40595657
(From vector record: Éï? 588081)
does not seem to be working fine.  Shouldn't it have been
(From vector record: zzzwBdUCSIZpiPajxmVV 588081)

strcpy from an uninitialized variable would give undefined behavior
which may manifest in inconsistent ways.

If a file is large enough for 2*numRecords to exceed the size of an int, or for st_size to exceed the size of off_t, that could cause problems to arise, bit it should still report Didn't find it within file -
which I don't see in your output.
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40596592
strcpy from an uninitialized variable would give undefined behavior
which may manifest in inconsistent ways.

Thanks. What to adjust to the above codes?

If a file is large enough for 2*numRecords to exceed the size of an int, or for st_size to exceed the size of off_t, that could cause problems to arise, bit it should still report Didn't find it within file -
which I don't see in your output.

How to enhance the codes to read big file?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40596616
strcpy(rec.fld_nm, nm_got); should be strcpy(nm_got, rec.fld_nm);

For very large files, you may need
long numRecords,nbegin,nend,nmid;
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40596765
Sorry, if there is problem with the strcpy line, why is it fine to search the file mentioned in above? Thanks.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40596796
It is not fine to search the file mentioned in above.
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzwBdUCSIZpiPajxmVV"

Found it!
(From vector record: Éï? 588081)
should be
(From vector record: zzzwBdUCSIZpiPajxmVV 588081)

Also, with undefined behavior, anything at all can happen, including failing randomly, or accidentally appearing fine.
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40596814
Thanks a lot.

I've done the change below
// 
//

#pragma warning (disable: 4996) 
#include "stdafx.h"
#include <set>
#include <sys/stat.h>
#include <string>
#include <fstream>
#include <sstream>
#include <atlbase.h>
#include <ctype.h>
#include <process.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include "..\..\include\nameval.h"
#include <iomanip>
#include <Windows.h>
struct stat fs = { 0 };
int ret; //
long numRecords;
nameval binrec;
bool LessComp(const nameval& a1, const nameval& a2)
{
	if (strcmp(a1.fld_nm, a2.fld_nm) < 0) return true;
	if (strcmp(a1.fld_nm, a2.fld_nm) > 0) return false;
	if (a1.fld_val < a2.fld_val) return true;
	return false;
}
int _tmain(int argc, _TCHAR* argv[])
{
	if (argc < 1)
	{
		return ERROR;
	}
	long nbegin = 0;
	long nend = numRecords - 1;
	long nmid;
	unsigned int nstop = 0;
	char nm_got[100];
	unsigned int val_got;
	time_t timev, currtime;
	float sec;
	timev = time(0);
	std::ifstream inputfiles;
	nameval names = { 0 };
		std::ostringstream filename;
		filename << "c:\\dp4\\flout_w.bin";
		std::set<nameval> records;
		std::set<nameval>::iterator iter;
		inputfiles.open(filename.str().c_str(), std::ios::binary | std::ios::in);
		if (!inputfiles.is_open())
			return -3; //
		if (!inputfiles.read((char*)&names, sizeof(nameval)))
			return -4; //
		ret = stat(filename.str().c_str(), &fs);
		if (ret != 0)
			return -4;
		numRecords = (int)(fs.st_size / sizeof(nameval));
		nbegin = 0;
		nend = numRecords - 1;
		nstop = 0;
		char szArgv2[512] = { 0 };
		size_t ncharsConverted = 0;
		wcstombs(szArgv2, argv[1], sizeof(szArgv2));
		while (nbegin <= nend && nstop != -1)
		{
			nmid = (nbegin + nend) / 2;
			nameval rec = { 0 };
			inputfiles.seekg(nmid* sizeof(nameval));
			inputfiles.read((char*)&rec, sizeof(nameval));
			if (strcmp(szArgv2, rec.fld_nm)<0)
			{
				nend = nmid - 1;
				nmid = (nbegin + nend) / 2;
			}
			else
			{
				if (strcmp(szArgv2, rec.fld_nm)>0)
				{
					nbegin = nmid + 1;
					nmid = (nbegin + nend) / 2;
				}
				else
				{
					nstop = -1;
					strcpy(nm_got, rec.fld_nm);
					val_got = rec.fld_val;
				}
			}
		}
		if (nstop == -1)
		{
			std::cout << "\nFound it!\n";
			std::cout << "(From vector record: " << nm_got
				<< ' ' << val_got << ")\n";
		}
		else std::cout << "\nDidn't find it within file -'" << filename.str().c_str() << "'!\n";
	time(&currtime);
	sec = difftime(currtime, timev);
	std::cout << "Search finishes with only " << sec << " seconds";
	system("pause>null");
	return 0;
}

Open in new window


but I still get this
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfX
EJ"

C:\ReadBinaryFile\x64\Release>

Open in new window


when searching against big file. I can further show you the big file, if possible.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40596831
"zzzzzOMoXmtyPzuCfX
EJ"
looks like 21 characters.  If it is really contained in the file, it would either overflow  char fld_nm[21]; or be unterminated, either of which would again cause undefined behavior.

Also, if the file too big, you are still casting to (int)
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40596849
Yes, definitely the string does exist within the 5GB file. I'm afraid of that I may not be able to upload it, as it is still 2.8 GB after having zipped it.

Also, if the file too big, you are still casting to (int)

What do you mean to this?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40596867
2.8GB/sizeof(nameval) should not overflow a signed 32 bit int.

But a 21 character string existing in a fld_nm[21] would cause undefined behavior
0
 
LVL 84

Expert Comment

by:ozo
ID: 40596872
Does your ifstream handle >32 bit streampos?  What is tellg after the seekg?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40596882
Does your ifstream handle >32 bit streampos?  What is tellg after the seekg?

It is one x64 project. What do I need to show to you to check it? thanks
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597480
What return value did you get from the program?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597530
I get nothing like
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfX
EJ"

C:\ReadBinaryFile\x64\Release>

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 40597542
Did you get a return value from one of your
return ERROR;
return -3;
 return -4;
return 0;
statements?
If not, what was the system return value?
If you check it with
echo $?
or
echo %errorlevel%
that should give you a clue to where your program is failing.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597552
I did not get it.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597589
Are you saying the exit status was 0 ?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597611
Sorry, the question is, when running it against big file, it is not showing any output. What to adjust to the codes?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597614
What was the exit status?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597620
Sorry, I run exe file to do search. How to adjust the codes to show exit status?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597624
echo %errorlevel%
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597632
I get

-4
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597635
So it looks like it came from one of
            if (!inputfiles.read((char*)&names, sizeof(nameval)))
                  return -4; //
            ret = stat(filename.str().c_str(), &fs);
            if (ret != 0)
                  return -4;
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597636
How to identify the problem?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597638
It would seem that either the read failed, or the stat failed.
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597643
How to correct the codes (regarding -4), to ensure it is fine?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597644
If the read failed, you might check the state flags eofbit, failbit, badbit
If the stat failed, you might check errno
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597647
Can I have more details to check these? thanks.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40597650
stat can fail due to
EOVERFLOW
    path or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 50 total points
ID: 40597653
Do you have a stat64 function?
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40597661
The project is already x64 project.
0
 
LVL 32

Accepted Solution

by:
sarabande earned 450 total points
ID: 40598727
The project is already x64 project.
yes and std::streambuf is capable for 64-bit as well.

nevertheless ozo is right. you have to use stat64 function if your file is greater 4gb.

also all the 'unsigned int' variables which were used for file positions (nbegin, nmid, nend) must be changed to a 64-bit integer type like 'long long', size_t, fpos_t or _int64.

 return -4;
I told you to add a better error handling instead of only returning a non-zero error code. look into latest code of savebinaryfile where you will find a code sample how to add a informative error message before returning from main.

But a 21 character string existing in a fld_nm[21] would cause undefined behavior
for your information: the struct nameval was used in two projects: savebinaryfile and readbinaryfile. the first creates a lot of binary files where each contains 1 million of records which are defined by the nameval structure. the records are sorted by name and each name has exactly 20 letters randomly generated. at end of the program all 'small' files were merged into one sorted huge file. as long as this file was less than 4gb the second program readbinaryfile still was capable to do a binary search on this file. but when the 4gb boundary was exceeded the binary search could not work any longer as the stat function used would fail when trying to read file information from a 6gb file. if reading records from file would read names which are 21 characters or contain non-printable characters, than it is due to a wrong file position from a calculation that would require 64-bit integer variables but where 32-bit variables were used instead.

when running it against big file, it is not showing any output. What to adjust to the codes?
first use stat64 instead of stat like

struct __stat64 fs64 = { 0 };
ret = _stat64(filename.str().c_str(), &fs64);
if (ret != 0)
{
        // here add an error message to std::cout
        ....
        return -5;
}
numRecords = (fs64.st_size / sizeof(nameval));   // make numRecords a size_t 

Open in new window


note, you were using some global variables above main function what makes no sense as you only have one function. move the variables into main function, or better, define them locally when they were needed.

For very large files, you may need
 long numRecords,nbegin,nend,nmid;
'long' might be a 32-bit integer even in a 64-bit project for compatibility reasons.  you could output sizeof(long) to find out. but to keep simple things simple, I would use size_t wherever a 64-bit integer may be appropriate. that is the case for the variables ozo has told you, but also for local variables that were used in a calculation where (interim) results could be greater than 32-bit (signed) integer boundary (what is about 2.1 billion)

Sara
0
 
LVL 10

Author Comment

by:HuaMinChen
ID: 40602508
OK. Thanks.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
The password reset disk is often mentioned as the best solution to deal with the lost Windows password problem. In Windows 2008, 7, Vista and XP, a password reset disk can be easily created. But besides Windows 7/Vista/XP, Windows Server 2008 and ot…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now