Reading files : line breaks and EOF

TwentyFourSeven
TwentyFourSeven used Ask the Experts™
on
I'm trying to read delimited text files into a C++ program.

There are two issues that are bugging me :

(1) Line breaks

My code is curently as below.  If I do not specify "\r" in the getline function, the whole thing stops working.  

The issue I've got is that I cannot assume that all files inputted will terminate with "\r", some might be "\n" and some might be "\r\n".

How to I handle that issue ?  I did some research that suggested opening the file in binary mode would help, but this has had no positive effect.

(2) EOF

There seems to me something wrong with my EOF detection routine but I can't figure out what.

The output from sscanf is fed into a vector of vectors (myData).

The inner vector operates as expected, there are 10 delimited fields and therefore inner vector size is 10 and everyone is happy.

The outer vector does not operate as expected.  There are only 40 lines in the file, however the program will crash and burn if I specify vector size of 40.  If I add a magical extra vector element to the outer vector, everyone is happy again !

Have I missed something obvivous here ?   And I can't figure out how to count the number of lines in the file first due to the line break issue above !

Over to you experts !

vector<vector<string> > myData(41, vector<string> (10));
ifstream infile(fileName, ios_base::in | ios_base::binary);
if (!infile.is_open()) {
		cerr << "Unable to open input file !" << endl;
		return 0;
	}
	while (!infile.eof()) {
		getline(infile, strLine,'\r');
		if (!strLine.length())
			continue;
		sscanf(find stuff........)
}

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
evilrixSenior Software Engineer (Avast)

Commented:
>> My code is curently as below.  If I do not specify "\r" in the getline function, the whole thing stops working.
You are opening the file as binary, if you open it as text you won't need to do this (although see below for more info on this)

>> The issue I've got is that I cannot assume that all files inputted will terminate with "\r", some might be "\n" and some might be "\r\n".
If you open this as text rather than binary it'll make parsing a little simpler since CRLF and LF will both be represented as LF so all you'll need to handler as a special case is the CR. I had this exact problem when parsing PDF files :(

>> How to I handle that issue ?
Well, getline is designed to read a text file so you'll have to code something to parse this yourself.

>> There are only 40 lines in the file, however the program will crash and burn if I specify vector size of 40.
Heh, you didn't provide the important part of the scanf, which is the format specifier. That said, why don't you use a stringstream to extract your values it'll be far simpler and safer?

>> And I can't figure out how to count the number of lines in the file first due to the line break issue above !
The line ending inconsistencies mean you'll really have to code your own parser since the file isn't. Read in each line and then parse each line for embedded CR's that might make it multiple lines.

Author

Commented:
Ah, the evil one, we meet again !

Thanks for your help on yesterday's question by the way.

I'm sure I did try this in text mode, but will try again now and report back....

stringstream ....not another C++ function to learn !!!!   ;-(

I'll go play around with one or two of your suggestions and will be back with an update.....

Author

Commented:
By the way, nothing special in my boring old scanf specifier... seems to work fine if I add that extra magic vector element...

sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
Build an E-Commerce Site with Angular 5

Learn how to build an E-Commerce site with Angular 5, a JavaScript framework used by developers to build web, desktop, and mobile applications.

evilrixSenior Software Engineer (Avast)

Commented:
>> Ah, the evil one, we meet again !
Muhahahahahahah!!!! :)

>> Thanks for your help on yesterday's question by the way
Anytime.

>> stringstream ....not another C++ function to learn !!!!   ;-(
A class, that is a memory bound iostream... very useful for converting and parsing data

>> sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
Using stringstream...

std::stringstream ss(strLine);
int nTempID;
int nTemp1;
std::string nTemp2;

ss >> nTempID;
ss >> nTemp1;
ss >> nTemp2;

if(ss.bad()) { /* something wicked this way comes */ }

More verbose but a lot safer than sscanf.

http://www.cplusplus.com/reference/iostream/stringstream/

By using stringstream in this way you can parse the line and check it parsed ok without setting the original stream bad if the input is not as you'd expect.

Author

Commented:
A quck Google also turned up a blog post with the title....

"stringstream is completely useless "

Although it seems the author is more worried about using stringstream for output (http://www.flamingspork.com/blog/2009/08/04/stringstream-is-completely-useless-and-why-c-should-have-a-snprintf/)


However you do make stringstream look so much neater and easy .... I guess you don't have a "sage" title for nothing..... ;-)

Humpf .... bak to my IDE to delete half my code ..... ;-(
evilrixSenior Software Engineer (Avast)

Commented:
>> A quck Google also turned up a blog post with the title....
Well, I' sure we can all find reasons to state something is useless :)

I'd hardly call 1 or 2 compelling arguments, but even if they were they don't really affect your use case here.

>> I guess you don't have a "sage" title for nothing...
Heh, oh so close to Genius (by about 8000 pts) it hurts now :)

Author

Commented:
>> Well, I' sure we can all find reasons to state something is useless :)

You can call my C++ code that any day.  Be my guest.  ;-)

>> Heh, oh so close to Genius (by about 8000 pts) it hurts now :)

Pah don't worry, with me around it will probably only take about a couple of days of my useless coding to come up with enough questions to earn you a title of God.  ;-)

Author

Commented:
hmmmm.... ;-(

Doing making this one simple change......

// sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
std::stringstream ss(strLine);
ss >> nTempID;
ss >> nTemp1;
ss >> nTemp2;

and cout'ing one of the vector elements for debug results in.....

-1073743052
0
0
0
0
etc. etc.

Switching back and the same cout works fine ?
evilrixSenior Software Engineer (Avast)

Commented:
See if this helps.
#include <sstream>
#include <string>
#include <iostream>
 
int main()
{
   std::string strLine = "999 444 hello";
   std::stringstream ss(strLine);
   int nTempID;
   int nTemp1;
   std::string nTemp2;
 
   ss >> nTempID;
   ss >> nTemp1;
   ss >> nTemp2;
 
   std:: cout
      << nTempID << " "
      << nTemp1 << " "
      << nTemp2 << std::endl;
}

Open in new window

Author

Commented:
That compiles and runs, so are you saying stringstream is only good for TSV and not CSV ?

evilrixSenior Software Engineer (Avast)

Commented:
>> That compiles and runs, so are you saying stringstream is only good for TSV and not CSV ?
My bad... for CSV something like this will work..

#include <sstream>
#include <string>
#include <iostream>
 
int main()
{
   std::string strLine = "999,444,hello world";
   std::stringstream ssl(strLine);
   int nTempID = 0;
   int nTemp1 = 0;
   std::string nTemp2;
   std::string item;
 
   if(std::getline(ssl, item, ','))
   {
      std::stringstream ssi(item);
      ssi >> nTempID;
   }
 
   if(std::getline(ssl, item, ','))
   {
      std::stringstream ssi(item);
      ssi >> nTemp1;
   }
 
   std::getline(ssl, nTemp2, ',');
 
   std:: cout
      << nTempID << " "
      << nTemp1 << " "
      << nTemp2 << std::endl;
}

Open in new window

Author

Commented:
Now that does look quite neat.

Will try it out.

Author

Commented:
Works niecely.

If you don't mind, I'll leave this Q open a little longer whilst I hack away at sorting out the CR CRLF LF business ... ;-)
evilrixSenior Software Engineer (Avast)

Commented:
>> If you don't mind, I'll leave this Q open a little longer whilst I hack away at sorting out the CR CRLF LF business ... ;-)
No worries.

Author

Commented:
Would it be terribly inefficient and silly to use something like peek to solve the CR CRLF LF issue ?
evilrixSenior Software Engineer (Avast)

Commented:
>> Would it be terribly inefficient and silly to use something like peek to solve the CR CRLF LF issue ?
I'd be tempted to suggest reading in a line at a file into a string and then using string.find('\r') to parse out sub-strings might a better (simpler) way.
http://www.cplusplus.com/reference/string/string/find/

However, peek is there for this kind of purpose, but it means you'll end up reading the file 1 char at a time right (or do you have a different plan)?
evilrixSenior Software Engineer (Avast)

Commented:
BTW: If you think you're having a hard time, I have spent ALL day trying to get a test harness I've written linking correctly to the client library for this...

http://www.mongodb.org/

...and all I've managed to achieve is various different ways to get nasty linker errors with boost_threads dependencies :(

aaarrrggggg! :S

Author

Commented:
>> but it means you'll end up reading the file 1 char at a time right (or do you have a different plan)?

Yup, that was unfortunatley the "plan".  Hence the wording of the question as I couldn't believe there was not a better way to do things.

>> BTW: If you think you're having a hard time, I have spent ALL day

Never heard of MongoDB, and if it took you ALL day, I think I might leave it that way ;-)

(The layout of their documentation also leaves a lot to be desired !)

BerkeleyDB doesn't have any nasty linker errors ..... not even with my coding ;-)

Author

Commented:
Hey Evil,

I've been thinking.

How about reading character by character until you know what the ine break is and then stopping ?  i.e. one "line".

Or would that be inefficient too ?

Something along the lines of the code I've come up with below (which I think is a bit buggy !)
	char endType;
	int charnum;
	char chr;
	while (infile.get(chr).good()) {
		if (chr == '\r') {
			endType = '\r';
		} else if (chr == '\n') {
			endType = '\n';
 
		}
		if (!endType) break;
	}
 
cout << int(endType) << endl;

Open in new window

evilrixSenior Software Engineer (Avast)

Commented:
>> How about reading character by character until you know what the ine break is and then stopping ?  i.e. one &quot;line&quot;.Ah, your problem was slightly different from mine in so far as my files had a mix of line endings in one file (PDF files -- pah!).I think I misunderstood your problem, you are saying they will have consistent line endings but each file will be different?If so, I'd probably do this like below but that's just because I like using STL algorithms and your way should work just fine also..
#include <iostream>
#include <sstream>
#include <algorithm>
#include <iterator>
 
 
char GetLineEnding(std::istream & is)
{
	char const eol[] = { '\r','\n' };
	char const * end = eol + sizeof(eol);
	char const * itr = end;
	char c;
 
	while(is.get(c) && itr == end)
		itr = std::find(eol, end, c);
		
	is.clear();
	is.seekg(0); // If you want to preserve the original position use tellg to get it first
	
	return itr == end ? ~0 : *itr;
}
 
int main()
{
	// These string streams are pretending to be text files streams
	std::stringstream ss1("qqq\rwww\reee\rttt\ryyy");
	std::stringstream ss2("qqq\nwww\neee\nttt\nyyy");
	std::stringstream ss3("ham and eggs");
	
	std::cout 
		<< "ss1 = " << (int)GetLineEnding(ss1) << std::endl
		<< "ss2 = " << (int)GetLineEnding(ss2) << std::endl
		<< "ss3 = " << (int)GetLineEnding(ss3) << std::endl;
}

Open in new window

Author

Commented:
>> Ah, your problem was slightly different from mine in so far as my files had a mix of line endings in one file (PDF files -- pah!).

I misunderstood your problem too.  Your problem is (was ?) so much more interesting !

I did not realise PDF files had different line endings !  What sort of person designed those specs ??

However, I am now even more curious to find out how you solved your problem !    To help you get those 8000 points you wanted, how about I open a new question with 500 points on offer ?  ;-)



>> I think I misunderstood your problem, you are saying they will have consistent line endings but each file will be different?

Yes .... for example the wonderful Microsoft Excel saves with "\r" whilst other sources output with "\n" or whatever.  

Your suggestion sounds interesting, and I am guessing it is probably more performance effective to use native C++  STL rather than C. ?
Hey TwentyFour...

How about this... perhaps the evil one has solved all your questions, but take this as a gift... I'm opening CSV files as the last time... it's working... nothing fancy... only the opening and the print of all the values. This is pure C++ :D.

#include <string>
#include <vector>
#include <sstream>
#include <iostream>
#include <fstream>

#define ARGS                                                               (2u)

template <class out_type, class in_value>
out_type type_cast(in_value input)
{
   out_type result;
   std::stringstream* converter = new std::stringstream();

   if(!((*converter) << input)  ||
      !((*converter) >> result) ||
      !((*converter) >> std::ws).eof())
   {
      throw std::__throw_bad_cast;
   }

   delete converter;

   return (result);
}

int main(int argc, char *argv[])
{
   unsigned long x;
   std::string input;
   std::string number;
   std::ifstream* infile = new std::ifstream();
   std::stringstream* parse = new std::stringstream();

   if(ARGS == argc)
   {
      infile->open(argv[1]);

      if (infile->is_open())
      {
         while(!infile->eof())
         {
            std::getline(*infile, input);
            parse->str(input);

            while(std::getline((*parse),number,','))
            {
               try
               {
                  x = type_cast<unsigned long,std::string>(number);
                  std::cout << x;
                  std::cout << std::endl;
               }
               catch (...)
               {
                  /* Do Nothing Here */
               }
            }

            parse->clear();
         }
      }
      else
      {
          std::cout << "File '" << argv[1] << "' doesn't exist";
      }
   }
   else
   {
       std::cout << "Usage is: csv 'filename'";
   }

   delete parse;

   return 0;
}
you should add also:

   infile->close();

   delete infile;

before the return :D...

Author

Commented:
>> Hey TwentyFour...

Hello again phoenix.  

>> perhaps the evil one has solved all your question

For this particular question it would appear so ... however there will always be plenty more questions up for grabs.  

I think both you and evilrix are both fantastic and have been very helpful (and patient) in answering my terrible newbie questions !


>> How about this... perhaps the evil one has solved all your questions, but take this as a gift..

That looks very interesting too !

I can see I'm going to have to come up with a stupidly big input file to see which one is quicker !  ;-)

What an interesting experience C++ is turning out to be !  So many good answers to the same question !

Author

Commented:
evil,

Just a quick thought (have not tried the code yet)....

re:  char const eol[] = { '\r','\n' };

It looks like your suggestion as presently coded would need a little more work to cope with "\r\n" ?

I'm guessing it's easily done with a quick "peek" or something, but just thought I would ask whilst I remember before closing off the question !
Well, this one took  2.041 s to read the measure-100k.csv without printing out the results to the console.
>> BTW: If you think you're having a hard time, I have spent ALL day trying to get a test harness I've written linking correctly to the client library for this...

aaarrrggggg! :S

Even a tiger looses it's prey sometimes my evil one friend... I hope you solve your problem ;)
Senior Software Engineer (Avast)
Commented:
>> However, I am now even more curious to find out how you solved your problem !
It wasn't that complex... I just treated it as a text file, used getline and then sub-parsed each line for '\r'.

The PDF format is pretty messy unfortunately... and far too complex for me to explain here (mainly cos I've forgotten a lot of it).

>> I think both you and evilrix are both fantastic
Thanks TF7, the feeling is mutual I assure you. It makes a big difference when you are helping someone who is present and has a SOH ;)

>> It looks like your suggestion as presently coded would need a little more work to cope with "\r\n" ?
On Windows (for some reason I thought that was your platform) this will work fine since CRLF (the system default line ending) is converted to LF. If you are working on different a different platform it may not, that being the case the small modification below should definitely work on all platforms. Note that I leave error handling to you :)

>> Even a tiger looses it's prey sometimes my evil one friend... I hope you solve your problem ;)
Small baby steps closer -- but it is proving to be a right pain in the mongo's :D
#include <iostream>
#include <fstream>
#include <algorithm>
#include <vector>
#include <iterator>
 
 
char GetLineEnding(std::istream & is) 
{
   char const eol[] = { '\r','\n' };
   char const * end = eol + sizeof(eol);
   char const * itr = end;
   char c;
 
   while(itr == end && is.get(c))
      itr = std::find(eol, end, c);
 
   if(itr != end && *itr == '\r' && is.peek() == '\n')
      c = '\n';
   else
      c = itr == end ? ~0 : *itr;
 
   is.clear();
   is.seekg(0);
   
   return c;
}
 
int main()
{
   typedef std::vector<std::string> vs_t;
   vs_t v;
   v.push_back("qqq\rwww\reee\rttt\ryyy\r");
   v.push_back("qqq\nwww\neee\nttt\nyyy\n");
   v.push_back("qqq\r\nwww\r\neee\r\nttt\r\nyyy\r\n");
 
   for(vs_t::const_iterator itr = v.begin() ; itr != v.end() ; ++itr)
   {   
      {   
         std::ofstream ofs("data.txt", std::ios::binary);
         ofs.write(itr->c_str(), itr->size());
      }   
     
      {   
         std::ifstream ifs("data.txt", std::ios::binary);
         std::cout 
            << (int)GetLineEnding(ifs) << std::endl;
      }   
   }   
}

Open in new window

Author

Commented:
phoenix

>> Well, this one took  2.041 s to read the measure-100k.csv without printing out the results to the console.

Heh..... I almost forgot about those test files !

This little project is actually doing something with a few log files, so I will probably soon trying to see whether it's possible to implement regex in C++ without going completely crazy !  

I know I should probably learn C++ in a more structured fashion, but learning with a "project" goal in mind is so much more realistic than textbook learning.


evil.......

>> Windows (for some reason I thought that was your platform)

heh .... I'm pleased to say Windows is my secondary platform as far as my personal IT goes ... ;-)

As I learnt from phoenix's solution to the number comparison question, C++/CLI does have it's benefits.... a nice GUI without too  many pages of code ...... ;-)


I'll leave this question open over the weekend ... but I think it's probably all sorted now.

Author

Commented:
x

Author

Commented:
Nice work Evil, and good luck with your present challenge.

phoenix_   , although your answer was a gift, I have given you a small token number of points just so that your answer gets marked out as a helpful one !

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial