Link to home
Start Free TrialLog in
Avatar of TwentyFourSeven
TwentyFourSeven

asked on

Reading files : line breaks and EOF

I'm trying to read delimited text files into a C++ program.

There are two issues that are bugging me :

(1) Line breaks

My code is curently as below.  If I do not specify "\r" in the getline function, the whole thing stops working.  

The issue I've got is that I cannot assume that all files inputted will terminate with "\r", some might be "\n" and some might be "\r\n".

How to I handle that issue ?  I did some research that suggested opening the file in binary mode would help, but this has had no positive effect.

(2) EOF

There seems to me something wrong with my EOF detection routine but I can't figure out what.

The output from sscanf is fed into a vector of vectors (myData).

The inner vector operates as expected, there are 10 delimited fields and therefore inner vector size is 10 and everyone is happy.

The outer vector does not operate as expected.  There are only 40 lines in the file, however the program will crash and burn if I specify vector size of 40.  If I add a magical extra vector element to the outer vector, everyone is happy again !

Have I missed something obvivous here ?   And I can't figure out how to count the number of lines in the file first due to the line break issue above !

Over to you experts !

vector<vector<string> > myData(41, vector<string> (10));
ifstream infile(fileName, ios_base::in | ios_base::binary);
if (!infile.is_open()) {
		cerr << "Unable to open input file !" << endl;
		return 0;
	}
	while (!infile.eof()) {
		getline(infile, strLine,'\r');
		if (!strLine.length())
			continue;
		sscanf(find stuff........)
}

Open in new window

Avatar of evilrix
evilrix
Flag of United Kingdom of Great Britain and Northern Ireland image

>> My code is curently as below.  If I do not specify "\r" in the getline function, the whole thing stops working.
You are opening the file as binary, if you open it as text you won't need to do this (although see below for more info on this)

>> The issue I've got is that I cannot assume that all files inputted will terminate with "\r", some might be "\n" and some might be "\r\n".
If you open this as text rather than binary it'll make parsing a little simpler since CRLF and LF will both be represented as LF so all you'll need to handler as a special case is the CR. I had this exact problem when parsing PDF files :(

>> How to I handle that issue ?
Well, getline is designed to read a text file so you'll have to code something to parse this yourself.

>> There are only 40 lines in the file, however the program will crash and burn if I specify vector size of 40.
Heh, you didn't provide the important part of the scanf, which is the format specifier. That said, why don't you use a stringstream to extract your values it'll be far simpler and safer?

>> And I can't figure out how to count the number of lines in the file first due to the line break issue above !
The line ending inconsistencies mean you'll really have to code your own parser since the file isn't. Read in each line and then parse each line for embedded CR's that might make it multiple lines.
Avatar of TwentyFourSeven
TwentyFourSeven

ASKER

Ah, the evil one, we meet again !

Thanks for your help on yesterday's question by the way.

I'm sure I did try this in text mode, but will try again now and report back....

stringstream ....not another C++ function to learn !!!!   ;-(

I'll go play around with one or two of your suggestions and will be back with an update.....
By the way, nothing special in my boring old scanf specifier... seems to work fine if I add that extra magic vector element...

sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
>> Ah, the evil one, we meet again !
Muhahahahahahah!!!! :)

>> Thanks for your help on yesterday's question by the way
Anytime.

>> stringstream ....not another C++ function to learn !!!!   ;-(
A class, that is a memory bound iostream... very useful for converting and parsing data

>> sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
Using stringstream...

std::stringstream ss(strLine);
int nTempID;
int nTemp1;
std::string nTemp2;

ss >> nTempID;
ss >> nTemp1;
ss >> nTemp2;

if(ss.bad()) { /* something wicked this way comes */ }

More verbose but a lot safer than sscanf.

http://www.cplusplus.com/reference/iostream/stringstream/

By using stringstream in this way you can parse the line and check it parsed ok without setting the original stream bad if the input is not as you'd expect.
A quck Google also turned up a blog post with the title....

"stringstream is completely useless "

Although it seems the author is more worried about using stringstream for output (http://www.flamingspork.com/blog/2009/08/04/stringstream-is-completely-useless-and-why-c-should-have-a-snprintf/)


However you do make stringstream look so much neater and easy .... I guess you don't have a "sage" title for nothing..... ;-)

Humpf .... bak to my IDE to delete half my code ..... ;-(
>> A quck Google also turned up a blog post with the title....
Well, I' sure we can all find reasons to state something is useless :)

I'd hardly call 1 or 2 compelling arguments, but even if they were they don't really affect your use case here.

>> I guess you don't have a "sage" title for nothing...
Heh, oh so close to Genius (by about 8000 pts) it hurts now :)
>> Well, I' sure we can all find reasons to state something is useless :)

You can call my C++ code that any day.  Be my guest.  ;-)

>> Heh, oh so close to Genius (by about 8000 pts) it hurts now :)

Pah don't worry, with me around it will probably only take about a couple of days of my useless coding to come up with enough questions to earn you a title of God.  ;-)

hmmmm.... ;-(

Doing making this one simple change......

// sscanf(strLine.c_str(), "%d,%d,%s", &nTempID, &nTemp1,&nTemp2);
std::stringstream ss(strLine);
ss >> nTempID;
ss >> nTemp1;
ss >> nTemp2;

and cout'ing one of the vector elements for debug results in.....

-1073743052
0
0
0
0
etc. etc.

Switching back and the same cout works fine ?
See if this helps.
#include <sstream>
#include <string>
#include <iostream>
 
int main()
{
   std::string strLine = "999 444 hello";
   std::stringstream ss(strLine);
   int nTempID;
   int nTemp1;
   std::string nTemp2;
 
   ss >> nTempID;
   ss >> nTemp1;
   ss >> nTemp2;
 
   std:: cout
      << nTempID << " "
      << nTemp1 << " "
      << nTemp2 << std::endl;
}

Open in new window

That compiles and runs, so are you saying stringstream is only good for TSV and not CSV ?

>> That compiles and runs, so are you saying stringstream is only good for TSV and not CSV ?
My bad... for CSV something like this will work..

#include <sstream>
#include <string>
#include <iostream>
 
int main()
{
   std::string strLine = "999,444,hello world";
   std::stringstream ssl(strLine);
   int nTempID = 0;
   int nTemp1 = 0;
   std::string nTemp2;
   std::string item;
 
   if(std::getline(ssl, item, ','))
   {
      std::stringstream ssi(item);
      ssi >> nTempID;
   }
 
   if(std::getline(ssl, item, ','))
   {
      std::stringstream ssi(item);
      ssi >> nTemp1;
   }
 
   std::getline(ssl, nTemp2, ',');
 
   std:: cout
      << nTempID << " "
      << nTemp1 << " "
      << nTemp2 << std::endl;
}

Open in new window

Now that does look quite neat.

Will try it out.

Works niecely.

If you don't mind, I'll leave this Q open a little longer whilst I hack away at sorting out the CR CRLF LF business ... ;-)
>> If you don't mind, I'll leave this Q open a little longer whilst I hack away at sorting out the CR CRLF LF business ... ;-)
No worries.
Would it be terribly inefficient and silly to use something like peek to solve the CR CRLF LF issue ?
>> Would it be terribly inefficient and silly to use something like peek to solve the CR CRLF LF issue ?
I'd be tempted to suggest reading in a line at a file into a string and then using string.find('\r') to parse out sub-strings might a better (simpler) way.
http://www.cplusplus.com/reference/string/string/find/

However, peek is there for this kind of purpose, but it means you'll end up reading the file 1 char at a time right (or do you have a different plan)?
BTW: If you think you're having a hard time, I have spent ALL day trying to get a test harness I've written linking correctly to the client library for this...

http://www.mongodb.org/

...and all I've managed to achieve is various different ways to get nasty linker errors with boost_threads dependencies :(

aaarrrggggg! :S
>> but it means you'll end up reading the file 1 char at a time right (or do you have a different plan)?

Yup, that was unfortunatley the "plan".  Hence the wording of the question as I couldn't believe there was not a better way to do things.

>> BTW: If you think you're having a hard time, I have spent ALL day

Never heard of MongoDB, and if it took you ALL day, I think I might leave it that way ;-)

(The layout of their documentation also leaves a lot to be desired !)

BerkeleyDB doesn't have any nasty linker errors ..... not even with my coding ;-)
Hey Evil,

I've been thinking.

How about reading character by character until you know what the ine break is and then stopping ?  i.e. one "line".

Or would that be inefficient too ?

Something along the lines of the code I've come up with below (which I think is a bit buggy !)
	char endType;
	int charnum;
	char chr;
	while (infile.get(chr).good()) {
		if (chr == '\r') {
			endType = '\r';
		} else if (chr == '\n') {
			endType = '\n';
 
		}
		if (!endType) break;
	}
 
cout << int(endType) << endl;

Open in new window

>> How about reading character by character until you know what the ine break is and then stopping ?  i.e. one &quot;line&quot;.Ah, your problem was slightly different from mine in so far as my files had a mix of line endings in one file (PDF files -- pah!).I think I misunderstood your problem, you are saying they will have consistent line endings but each file will be different?If so, I'd probably do this like below but that's just because I like using STL algorithms and your way should work just fine also..
#include <iostream>
#include <sstream>
#include <algorithm>
#include <iterator>
 
 
char GetLineEnding(std::istream & is)
{
	char const eol[] = { '\r','\n' };
	char const * end = eol + sizeof(eol);
	char const * itr = end;
	char c;
 
	while(is.get(c) && itr == end)
		itr = std::find(eol, end, c);
		
	is.clear();
	is.seekg(0); // If you want to preserve the original position use tellg to get it first
	
	return itr == end ? ~0 : *itr;
}
 
int main()
{
	// These string streams are pretending to be text files streams
	std::stringstream ss1("qqq\rwww\reee\rttt\ryyy");
	std::stringstream ss2("qqq\nwww\neee\nttt\nyyy");
	std::stringstream ss3("ham and eggs");
	
	std::cout 
		<< "ss1 = " << (int)GetLineEnding(ss1) << std::endl
		<< "ss2 = " << (int)GetLineEnding(ss2) << std::endl
		<< "ss3 = " << (int)GetLineEnding(ss3) << std::endl;
}

Open in new window

>> Ah, your problem was slightly different from mine in so far as my files had a mix of line endings in one file (PDF files -- pah!).

I misunderstood your problem too.  Your problem is (was ?) so much more interesting !

I did not realise PDF files had different line endings !  What sort of person designed those specs ??

However, I am now even more curious to find out how you solved your problem !    To help you get those 8000 points you wanted, how about I open a new question with 500 points on offer ?  ;-)



>> I think I misunderstood your problem, you are saying they will have consistent line endings but each file will be different?

Yes .... for example the wonderful Microsoft Excel saves with "\r" whilst other sources output with "\n" or whatever.  

Your suggestion sounds interesting, and I am guessing it is probably more performance effective to use native C++  STL rather than C. ?
SOLUTION
Avatar of _phoenix_
_phoenix_
Flag of Sweden image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
you should add also:

   infile->close();

   delete infile;

before the return :D...
>> Hey TwentyFour...

Hello again phoenix.  

>> perhaps the evil one has solved all your question

For this particular question it would appear so ... however there will always be plenty more questions up for grabs.  

I think both you and evilrix are both fantastic and have been very helpful (and patient) in answering my terrible newbie questions !


>> How about this... perhaps the evil one has solved all your questions, but take this as a gift..

That looks very interesting too !

I can see I'm going to have to come up with a stupidly big input file to see which one is quicker !  ;-)

What an interesting experience C++ is turning out to be !  So many good answers to the same question !
evil,

Just a quick thought (have not tried the code yet)....

re:  char const eol[] = { '\r','\n' };

It looks like your suggestion as presently coded would need a little more work to cope with "\r\n" ?

I'm guessing it's easily done with a quick "peek" or something, but just thought I would ask whilst I remember before closing off the question !
Well, this one took  2.041 s to read the measure-100k.csv without printing out the results to the console.
>> BTW: If you think you're having a hard time, I have spent ALL day trying to get a test harness I've written linking correctly to the client library for this...

aaarrrggggg! :S

Even a tiger looses it's prey sometimes my evil one friend... I hope you solve your problem ;)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
phoenix

>> Well, this one took  2.041 s to read the measure-100k.csv without printing out the results to the console.

Heh..... I almost forgot about those test files !

This little project is actually doing something with a few log files, so I will probably soon trying to see whether it's possible to implement regex in C++ without going completely crazy !  

I know I should probably learn C++ in a more structured fashion, but learning with a "project" goal in mind is so much more realistic than textbook learning.


evil.......

>> Windows (for some reason I thought that was your platform)

heh .... I'm pleased to say Windows is my secondary platform as far as my personal IT goes ... ;-)

As I learnt from phoenix's solution to the number comparison question, C++/CLI does have it's benefits.... a nice GUI without too  many pages of code ...... ;-)


I'll leave this question open over the weekend ... but I think it's probably all sorted now.
Nice work Evil, and good luck with your present challenge.

phoenix_   , although your answer was a gift, I have given you a small token number of points just so that your answer gets marked out as a helpful one !