Problem to search

Hi,
It is fine to search file like

https://dl.dropboxusercontent.com/u/40211031/flout_w.bin

using the exe file generated from
// 
//

#pragma warning (disable: 4996) 
#include "stdafx.h"
#include <set>
#include <sys/stat.h>
#include <string>
#include <fstream>
#include <sstream>
#include <atlbase.h>
#include <ctype.h>
#include <process.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include "..\..\include\nameval.h"
#include <iomanip>
#include <Windows.h>
struct stat fs = { 0 };
int ret; //
int numRecords;
nameval binrec;
bool LessComp(const nameval& a1, const nameval& a2)
{
	if (strcmp(a1.fld_nm, a2.fld_nm) < 0) return true;
	if (strcmp(a1.fld_nm, a2.fld_nm) > 0) return false;
	if (a1.fld_val < a2.fld_val) return true;
	return false;
}
int _tmain(int argc, _TCHAR* argv[])
{
	if (argc < 1)
	{
		return ERROR;
	}
	unsigned int nbegin = 0;
	unsigned int nend = numRecords - 1;
	unsigned int nmid;
	unsigned int nstop = 0;
	char nm_got[100];
	unsigned int val_got;
	time_t timev, currtime;
	float sec;
	timev = time(0);
	std::ifstream inputfiles;
	nameval names = { 0 };
		std::ostringstream filename;
		filename << "c:\\dp4\\flout_w.bin";
		std::set<nameval> records;
		std::set<nameval>::iterator iter;
		inputfiles.open(filename.str().c_str(), std::ios::binary | std::ios::in);
		if (!inputfiles.is_open())
			return -3; //
		if (!inputfiles.read((char*)&names, sizeof(nameval)))
			return -4; //
		ret = stat(filename.str().c_str(), &fs);
		if (ret != 0)
			return -4;
		numRecords = (int)(fs.st_size / sizeof(nameval));
		nbegin = 0;
		nend = numRecords - 1;
		nstop = 0;
		char szArgv2[512] = { 0 };
		size_t ncharsConverted = 0;
		wcstombs(szArgv2, argv[1], sizeof(szArgv2));
		while (nbegin <= nend && nstop != -1)
		{
			nmid = (nbegin + nend) / 2;
			nameval rec = { 0 };
			inputfiles.seekg(nmid* sizeof(nameval));
			inputfiles.read((char*)&rec, sizeof(nameval));
			if (strcmp(szArgv2, rec.fld_nm)<0)
			{
				nend = nmid - 1;
				nmid = (nbegin + nend) / 2;
			}
			else
			{
				if (strcmp(szArgv2, rec.fld_nm)>0)
				{
					nbegin = nmid + 1;
					nmid = (nbegin + nend) / 2;
				}
				else
				{
					nstop = -1;
					strcpy(rec.fld_nm, nm_got);
					val_got = rec.fld_val;
				}
			}
		}
		if (nstop == -1)
		{
			std::cout << "\nFound it!\n";
			std::cout << "(From vector record: " << nm_got
				<< ' ' << val_got << ")\n";
		}
		else std::cout << "\nDidn't find it within file -'" << filename.str().c_str() << "'!\n";
	return 0;
}

Open in new window


and here are what I get

C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzwBdUCSIZpiPajxmVV"

Found it!
(From vector record: Éï? 588081)

Open in new window


but when I'm to search big file having the same structure, I get this
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfXEJ"

Open in new window


while the string does exist within the file.

here is .h file
// nameval.h
#ifndef NAME_VAL_H
#define NAME_VAL_H

struct nameval
{
     char fld_nm[21];
     long long    fld_val;

     int  get_len() { return (int)min(strlen(fld_nm), sizeof(fld_nm) ) ; }
     void get_uni_nm(wchar_t nm_uni[], int sizfld)
     {
            //mbstowcs_s(nm_uni, fld_nm, min(sizfld-1, strlen(fld_nm)));
			size_t ncharsConverted = 0;
			mbstowcs_s(&ncharsConverted, nm_uni, sizfld, fld_nm, min(sizfld-1, (int)strlen(fld_nm)));
     }
     bool operator< (const nameval & a2) const
     {
           if(strcmp(fld_nm, a2.fld_nm) < 0) return true;
           if(strcmp(fld_nm, a2.fld_nm) > 0) return false;
           if (fld_val < a2.fld_val) return true;
           return false;
     }
};

#endif

Open in new window

LVL 11
HuaMin ChenSystem AnalystAsked:
Who is Participating?
 
sarabandeConnect With a Mentor Commented:
The project is already x64 project.
yes and std::streambuf is capable for 64-bit as well.

nevertheless ozo is right. you have to use stat64 function if your file is greater 4gb.

also all the 'unsigned int' variables which were used for file positions (nbegin, nmid, nend) must be changed to a 64-bit integer type like 'long long', size_t, fpos_t or _int64.

 return -4;
I told you to add a better error handling instead of only returning a non-zero error code. look into latest code of savebinaryfile where you will find a code sample how to add a informative error message before returning from main.

But a 21 character string existing in a fld_nm[21] would cause undefined behavior
for your information: the struct nameval was used in two projects: savebinaryfile and readbinaryfile. the first creates a lot of binary files where each contains 1 million of records which are defined by the nameval structure. the records are sorted by name and each name has exactly 20 letters randomly generated. at end of the program all 'small' files were merged into one sorted huge file. as long as this file was less than 4gb the second program readbinaryfile still was capable to do a binary search on this file. but when the 4gb boundary was exceeded the binary search could not work any longer as the stat function used would fail when trying to read file information from a 6gb file. if reading records from file would read names which are 21 characters or contain non-printable characters, than it is due to a wrong file position from a calculation that would require 64-bit integer variables but where 32-bit variables were used instead.

when running it against big file, it is not showing any output. What to adjust to the codes?
first use stat64 instead of stat like

struct __stat64 fs64 = { 0 };
ret = _stat64(filename.str().c_str(), &fs64);
if (ret != 0)
{
        // here add an error message to std::cout
        ....
        return -5;
}
numRecords = (fs64.st_size / sizeof(nameval));   // make numRecords a size_t 

Open in new window


note, you were using some global variables above main function what makes no sense as you only have one function. move the variables into main function, or better, define them locally when they were needed.

For very large files, you may need
 long numRecords,nbegin,nend,nmid;
'long' might be a 32-bit integer even in a 64-bit project for compatibility reasons.  you could output sizeof(long) to find out. but to keep simple things simple, I would use size_t wherever a 64-bit integer may be appropriate. that is the case for the variables ozo has told you, but also for local variables that were used in a calculation where (interim) results could be greater than 32-bit (signed) integer boundary (what is about 2.1 billion)

Sara
0
 
ozoCommented:
Is the file you are searching sorted on fld_nm?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Yes, sorted already.
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
ozoCommented:
It looks like strcpy(rec.fld_nm, nm_got);
should be strcpy(nm_got, rec.fld_nm);
0
 
HuaMin ChenSystem AnalystAuthor Commented:
No, it is working fine to search file with smaller size. Why does the problem arise with file in bigger size?
0
 
ozoCommented:
(From vector record: Éï? 588081)
does not seem to be working fine.  Shouldn't it have been
(From vector record: zzzwBdUCSIZpiPajxmVV 588081)

strcpy from an uninitialized variable would give undefined behavior
which may manifest in inconsistent ways.

If a file is large enough for 2*numRecords to exceed the size of an int, or for st_size to exceed the size of off_t, that could cause problems to arise, bit it should still report Didn't find it within file -
which I don't see in your output.
0
 
HuaMin ChenSystem AnalystAuthor Commented:
strcpy from an uninitialized variable would give undefined behavior
which may manifest in inconsistent ways.

Thanks. What to adjust to the above codes?

If a file is large enough for 2*numRecords to exceed the size of an int, or for st_size to exceed the size of off_t, that could cause problems to arise, bit it should still report Didn't find it within file -
which I don't see in your output.

How to enhance the codes to read big file?
0
 
ozoCommented:
strcpy(rec.fld_nm, nm_got); should be strcpy(nm_got, rec.fld_nm);

For very large files, you may need
long numRecords,nbegin,nend,nmid;
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Sorry, if there is problem with the strcpy line, why is it fine to search the file mentioned in above? Thanks.
0
 
ozoCommented:
It is not fine to search the file mentioned in above.
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzwBdUCSIZpiPajxmVV"

Found it!
(From vector record: Éï? 588081)
should be
(From vector record: zzzwBdUCSIZpiPajxmVV 588081)

Also, with undefined behavior, anything at all can happen, including failing randomly, or accidentally appearing fine.
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Thanks a lot.

I've done the change below
// 
//

#pragma warning (disable: 4996) 
#include "stdafx.h"
#include <set>
#include <sys/stat.h>
#include <string>
#include <fstream>
#include <sstream>
#include <atlbase.h>
#include <ctype.h>
#include <process.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include "..\..\include\nameval.h"
#include <iomanip>
#include <Windows.h>
struct stat fs = { 0 };
int ret; //
long numRecords;
nameval binrec;
bool LessComp(const nameval& a1, const nameval& a2)
{
	if (strcmp(a1.fld_nm, a2.fld_nm) < 0) return true;
	if (strcmp(a1.fld_nm, a2.fld_nm) > 0) return false;
	if (a1.fld_val < a2.fld_val) return true;
	return false;
}
int _tmain(int argc, _TCHAR* argv[])
{
	if (argc < 1)
	{
		return ERROR;
	}
	long nbegin = 0;
	long nend = numRecords - 1;
	long nmid;
	unsigned int nstop = 0;
	char nm_got[100];
	unsigned int val_got;
	time_t timev, currtime;
	float sec;
	timev = time(0);
	std::ifstream inputfiles;
	nameval names = { 0 };
		std::ostringstream filename;
		filename << "c:\\dp4\\flout_w.bin";
		std::set<nameval> records;
		std::set<nameval>::iterator iter;
		inputfiles.open(filename.str().c_str(), std::ios::binary | std::ios::in);
		if (!inputfiles.is_open())
			return -3; //
		if (!inputfiles.read((char*)&names, sizeof(nameval)))
			return -4; //
		ret = stat(filename.str().c_str(), &fs);
		if (ret != 0)
			return -4;
		numRecords = (int)(fs.st_size / sizeof(nameval));
		nbegin = 0;
		nend = numRecords - 1;
		nstop = 0;
		char szArgv2[512] = { 0 };
		size_t ncharsConverted = 0;
		wcstombs(szArgv2, argv[1], sizeof(szArgv2));
		while (nbegin <= nend && nstop != -1)
		{
			nmid = (nbegin + nend) / 2;
			nameval rec = { 0 };
			inputfiles.seekg(nmid* sizeof(nameval));
			inputfiles.read((char*)&rec, sizeof(nameval));
			if (strcmp(szArgv2, rec.fld_nm)<0)
			{
				nend = nmid - 1;
				nmid = (nbegin + nend) / 2;
			}
			else
			{
				if (strcmp(szArgv2, rec.fld_nm)>0)
				{
					nbegin = nmid + 1;
					nmid = (nbegin + nend) / 2;
				}
				else
				{
					nstop = -1;
					strcpy(nm_got, rec.fld_nm);
					val_got = rec.fld_val;
				}
			}
		}
		if (nstop == -1)
		{
			std::cout << "\nFound it!\n";
			std::cout << "(From vector record: " << nm_got
				<< ' ' << val_got << ")\n";
		}
		else std::cout << "\nDidn't find it within file -'" << filename.str().c_str() << "'!\n";
	time(&currtime);
	sec = difftime(currtime, timev);
	std::cout << "Search finishes with only " << sec << " seconds";
	system("pause>null");
	return 0;
}

Open in new window


but I still get this
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfX
EJ"

C:\ReadBinaryFile\x64\Release>

Open in new window


when searching against big file. I can further show you the big file, if possible.
0
 
ozoCommented:
"zzzzzOMoXmtyPzuCfX
EJ"
looks like 21 characters.  If it is really contained in the file, it would either overflow  char fld_nm[21]; or be unterminated, either of which would again cause undefined behavior.

Also, if the file too big, you are still casting to (int)
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Yes, definitely the string does exist within the 5GB file. I'm afraid of that I may not be able to upload it, as it is still 2.8 GB after having zipped it.

Also, if the file too big, you are still casting to (int)

What do you mean to this?
0
 
ozoCommented:
2.8GB/sizeof(nameval) should not overflow a signed 32 bit int.

But a 21 character string existing in a fld_nm[21] would cause undefined behavior
0
 
ozoCommented:
Does your ifstream handle >32 bit streampos?  What is tellg after the seekg?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Does your ifstream handle >32 bit streampos?  What is tellg after the seekg?

It is one x64 project. What do I need to show to you to check it? thanks
0
 
ozoCommented:
What return value did you get from the program?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
I get nothing like
C:\ReadBinaryFile\x64\Release>ReadBinaryFile "zzzzzOMoXmtyPzuCfX
EJ"

C:\ReadBinaryFile\x64\Release>

Open in new window

0
 
ozoCommented:
Did you get a return value from one of your
return ERROR;
return -3;
 return -4;
return 0;
statements?
If not, what was the system return value?
If you check it with
echo $?
or
echo %errorlevel%
that should give you a clue to where your program is failing.
0
 
HuaMin ChenSystem AnalystAuthor Commented:
I did not get it.
0
 
ozoCommented:
Are you saying the exit status was 0 ?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Sorry, the question is, when running it against big file, it is not showing any output. What to adjust to the codes?
0
 
ozoCommented:
What was the exit status?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Sorry, I run exe file to do search. How to adjust the codes to show exit status?
0
 
ozoCommented:
echo %errorlevel%
0
 
HuaMin ChenSystem AnalystAuthor Commented:
I get

-4
0
 
ozoCommented:
So it looks like it came from one of
            if (!inputfiles.read((char*)&names, sizeof(nameval)))
                  return -4; //
            ret = stat(filename.str().c_str(), &fs);
            if (ret != 0)
                  return -4;
0
 
HuaMin ChenSystem AnalystAuthor Commented:
How to identify the problem?
0
 
ozoCommented:
It would seem that either the read failed, or the stat failed.
0
 
HuaMin ChenSystem AnalystAuthor Commented:
How to correct the codes (regarding -4), to ensure it is fine?
0
 
ozoCommented:
If the read failed, you might check the state flags eofbit, failbit, badbit
If the stat failed, you might check errno
0
 
HuaMin ChenSystem AnalystAuthor Commented:
Can I have more details to check these? thanks.
0
 
ozoCommented:
stat can fail due to
EOVERFLOW
    path or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
0
 
ozoConnect With a Mentor Commented:
Do you have a stat64 function?
0
 
HuaMin ChenSystem AnalystAuthor Commented:
The project is already x64 project.
0
 
HuaMin ChenSystem AnalystAuthor Commented:
OK. Thanks.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.