Link to home
Start Free TrialLog in
Avatar of allmer
allmerFlag for Türkiye

asked on

ifstream get parts of a formatted textfile

I am trying to extract to parts of unknown size from a database (textfile).
It is composed as follows
>Title\n                      //The '>' preceedes a section of the database and \n terminates this header
                                //That line thus contains the header or title for the following section
AACCCTAGCTAGCATCAGCACGACG\n
ACGACTAGCACTNACGGCGACCTCG\n
ACACGAGCTGCGCCATAGCAGCAGG\n
                               //Each line is terminated after some charackters with a linefeed '\n'
                               //Can be between some and some hundred thousands chaarackters.
                               //Only ACGNT however
>Next section\n
....
>Third section\n
...
>and so on for 100 MB\n
I would like to extract the header to a variable called scaffold
and the corresponding section of the database to a variable called code.
Boh are char*.
I use the following approach, but it only works for the first section thereafter it creates wierd
resutlts.
bool CDna2Protein::GetNextPart(const char * _filePath)
{
      if(!databasein.is_open())                         //Declared as class variable so I do not have
                                                                            //to keep opening and closing a file.
                                                                            //ifstream databasein;
            databasein.open(_filePath,ios_base::in);
      char          ch,
            test = '>';
      int pos = 0;
      char *line = new char[256];
      //Read in header for this part of the database
      databasein.getline(line,256,'\n');
      scaffold = new char[256];
      sprintf(scaffold,"%s",line);
      delete line;
      //Now get the current position in the file
      int before = databasein.tellg();
      before--; //Does seem to give the next position so account for that
      int after = 0;
      while(databasein.get(ch)) {
            if(ch == test) {    //test = '>';
                  databasein.putback(ch);
                           //I would like the next read starting with the charackter I just putback.
                  after = databasein.tellg();
                  after--;    //see above
                  break;
            }
      }
      int len = after-before;
                //Knowing the size of the database section create a buffer to hold it
      code = new char[len+1];
      databasein.seekg(before);
      databasein.get(code,len,test);   //test='>';
      databasein.seekg(after);
                //should hopefully end up at desired position within file
      code[len]='\0';

                //do some output for testing
      len = int(strlen(code));
      char *temp = new char[10];
      sprintf(temp,"%d",len);
      CString str = "Laenge von "+CString(scaffold)+": "+CString(temp);
      delete temp;
      StringToFile("c:\\gpfc++\\len.log",str);
      trim();
                //Need to figure out how to test for termination.
      return(true);
}
Besides the problem of getting wrong parts of the database posing for scaffold,
I also need to return false on reaching feof.
Thank you for any suggestions
Jens
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

I don't have the rest of your class, so I rewrote this as just a function, you should be able to retrofit this into your class again. All member variables are globals in my example. I think I came up with a better way of handling this (no need to calculate the size of the buffer anymore, this is all done with a string):

#include <string>
#include <fstream>
#include <iostream>

using namespace std;

ifstream input;
string scaffold;
string data;

bool GetNextPart(const char * _filePath)
{
      bool done = false;
      bool inRecord = false;

      if (!input.is_open())
      {
            input.open(_filePath);
      }

      while (!input.eof() && !done)
      {
            char buf[256];
            char beginOfLine;
            input >> beginOfLine;

            input.putback(beginOfLine);

            if (beginOfLine == '>')
            {
                  inRecord = !inRecord;
                  done = (!inRecord);


                  if (inRecord)
                  {
                        input.getline(buf, 256, '\n');
                        scaffold = buf;
                        data.erase();
                  }
            }
            else
            {
                  if (inRecord)
                  {
                        input.getline(buf, 256, '\n');
                        data += buf;
                  }
            }
      }

      cout << "scaffold: " << scaffold << endl;
      cout << "data: " << data << endl;

}

int main()
{
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");

      input.close();

      return 0;
}
Avatar of allmer

ASKER

Nice try khkremer.
I tried dong it with adding up CStrings, too; but it turned out
to be much slower than the char* approach.
The process of loading, trimming and translating to Protein ;-)
is about 10 times faster when not done with cstring.
When I originally wrote this thing in JAVA it actually crashed all
the time when using operator += on strings.
Well, I would like to see the approach above working. I don't see
a flaw in it so it really bugs me that it doesn't do what I want it
to.
Do tellg() and seekg() have any problems or did I call the functions
at the wrong moment.


Another approach I could think of would be:
Readline(theHeader);
Use a function to get position of the following '>'
read anything in between into char *buffer.


Somehow the function crashed on reaching the end of the database.
bool CDna2Protein::GetNextPart(const char * _filePath, long _fP)
{
     bool done = false;
     bool inRecord = false;
      CString data;
     if (!databasein.is_open())
     {
          databasein.open(_filePath);
     }

     while (!databasein.eof() && !done)
     {
          char buf[256];
          char beginOfLine;
          databasein >> beginOfLine;

          databasein.putback(beginOfLine);

          if (beginOfLine == '>')
          {
               inRecord = !inRecord;
               done = (!inRecord);


               if (inRecord)
               {
                    databasein.getline(buf, 256, '\n');
          scaffold = new char[256];
                    sprintf(scaffold,"%s",buf);         //had to really copy the data.
                    //data.erase();       //Didn't know what this is good for.
               }
          }
          else
          {
               if (inRecord)
               {
                    databasein.getline(buf, 256, '\n');
                    data += buf;
               }
          }
     }
    code = data.GetBuffer();    //Conversion back to my charackter array
    trim();
  //Needs to return something
  //false when done and true otherwise
}
I'll take a second look tomorrow.
Avatar of Salte
Salte

For one thing, be careful with those new char[] things.....

Especially the

char * scaffold = new char[256];

Looks suspicious to me.

This allocates a char array with room for 256 characters. However, if you want to store a C-style strings into it you must leave room for the NUL char at end so there is really only room for 255 characters.

Another thing that looks out of place is the sprintf there. What you are really doing is a simple strcpy and so I don't know why you bother to allcoate the line buffer in the first place. sprintf( dest, "%s", source) is exactly equivalent to strcpy(dest,source) so use strcpy and not sprintf in this situation.

Also, you seem to be doing things slightly inefficient, especially if this tends to be long buffers, since you essentially read the file twice - once for finding the length and then once again to actually read the data.

I suggest:

1. Have ONE buffer or string to hold the current data read. This buffer should grow but never shrink. Put the code in that buffer and once you get the end, copy the code from that buffer to a new string buffer and re-use the same buffer for next code/scaffold. Always read the file forward, never go back and read again. :-)

2. What happens to scaffold and code later? I never see them deleted in your code so I assume they are placed in some mapping or collection unless you read only one.

3. To return false upon eof or error is easy, just have your function return a reference to the ifstream you are reading. Oh, wait, you open the file in your function. Why would you want to return true/false depending on eof when you open the file in the function? Usually that is only done if you are looping on reading the same file many times and your code does not do that.

The "standard" way to read an istream and return eof/error/etc indication is this:

istream & read_data(istream & is, destination_type & dest)
{
      // read data from is and place result in dest.
      return is;
}

Then you can do something like this in your program:

void read_all_data(const char * filename, destination_type & dest)
{
    dest.clear(); // only if you want to clear it before reading the file.
    ifstream file(filename);
    while (read_data(file,dest))
        ;
}

Now, this assumes that destination_type is a collection which can hold an "infinite" amount of data and read_data will push the next set. Alternatively the read_data will only read one set and the destination type of read_all_data is a colelction of those data:

void read_all_data(const char * filename, destination_collection_type & dest)
{
    dest.clear(); // only if you want to clear it before reading the file.
    ifstream file(filename);
    destination_type elem;

    while (read_data(file,elem))
        dest.store(elem);
}

is the way to go if the read_data handle a single element while the read_all_data handle a collection type.

Of course, instead of read_data you might want to contemplate using overload of operator >>:

istream & operator >> (istream & is, destination_type & dest);

The code is the same but caller can now use:

while (file >> elem)
    dest.store(elem);

instead. This is pure syntax, the code is essentially the same.

Of course, if you have two different types for the single element and the collection of elements you can also overload operator >> for the collection as well.

The single element destination type should then simply be something that can hold a single code no matter how long and possibly also the scaffold - or is that scaffold shared among several codes?

If there is one scaffold for each code then simply make a pair. You can then either use STL pair type or make a separate class, give it a reasonable name code_and_scaffold doesn't quite cut it. I can see that this has to do with genetics but I am not that much into genetics that I can tell exactly what you would name such a thing.

If the scaffold is shared among several codes you should have a vector or list of codes associated with each scaffold title, if so you might want to use:

class scaffold {
private:
    string M_title;
    vector<string> M_codes;
public:
    .....
};

istream & operator >> (istream & is, scaffold & s);

Something like that. If you declare the operator as friend it has access to the members M_title and M_codes.

Also, I think you make heavy use of the allocator:

char * temp = new char[10];

// use temp here...

delete temp;

This is both wrong and bad. It is wrong because you use new [] and then use delete when you should have used delete [] temp; You make the same mistake with line above as well. Note that many platforms will still get it right but some platforms will choke on code like you have above. Using delete [] when you used new [] is always correct.

However, what is wrong with simply:

char temp[10];

// use temp here...

No new/delete at all, faster run time.

Alf
Try this instead - it now uses char* to store the data. The memory to hold the data is dynamically allocated and reallocated when more space is needed. I'm allocating twice the amount that is currently used, and reallocating at the end again (to not waste any memory):

#include <string>
#include <fstream>
#include <iostream>

const unsigned int initialDataSize = 1024;

using namespace std;

ifstream input;
string scaffold;
char * data = NULL;
unsigned int dataSize = 0;  // currently allocated for data
unsigned int validData = 0; // valid characters in data

bool GetNextPart(const char * _filePath)
{
  bool done = false;
  bool inRecord = false;

  if (!input.is_open())
  {
    input.open(_filePath);
  }

  while (!input.eof() && !done)
  {
    char buf[256];
    char beginOfLine;
    input >> beginOfLine;

    input.putback(beginOfLine);

    if (beginOfLine == '>')
    {
      inRecord = !inRecord;
      done = (!inRecord);


      if (inRecord)
      {
        input.getline(buf, 256, '\n');
        scaffold = buf;
        free(data);

        // initialize the data area
        data = (char *) malloc(initialDataSize);
        dataSize = initialDataSize;
        validData = 0;
      }
    }
    else
    {
      if (inRecord)
      {
        input.getline(buf, 256, '\n');
        int charCount = strlen(buf);

        if (validData + charCount > dataSize -1)
        {
          // need to allocate more space
          dataSize = (validData + charCount) * 2;
          data = (char *) realloc(data, dataSize);
          if (data == NULL)
          {
            // handle error
          }
        }
          // append the new line to the existing data
        strcpy(&(data[validData]), buf);
        validData += charCount;
      }
    }
  }

  // adjust the size of data
  data = (char *) realloc(data, validData + 1);

  cout << "scaffold: " << scaffold << endl;
  cout << "data: " << data << endl;
}

int main()
{
  GetNextPart("./protein.txt");
  GetNextPart("./protein.txt");
  GetNextPart("./protein.txt");

  input.close();

  return 0;
}

// end

You of course have to put this into your class again (and make sure, that you free the memory again once you are done with this record).
Avatar of allmer

ASKER

Thank you for your great answers.
I am very busy until tomorrow morning and cannot
try it out at the moment.
I will do it first thing tomorrow morning.
Have a nice day.
I gotta bake some cakes now ;-)

Avatar of allmer

ASKER

@Salte
this is intended as a function to fill up
scaffold with the header part of the database and code with the corresponding
data inbetween two headers or feof.
code and scaffold are two class variables both are used in this class and
also in a derived class.
The data read is always directly processed and the results if any are then stored
whereafter the next part of the database is loaded and then processed.
The class destructor takes care of deleting these values.
There is no need of storing the complete database in memory using a windows
system for the size of these textbased databases (FASTA) can have several 100 MB.
 
I am cosidering this for a version of my software I will be writing for
a server running linux as soon as this pc/windows version is done.

Running twice through the file to first get the positions actually seems more
efficient than any other thing I tried.
This makes it feasable to allocate just the right amount of buffer for code.
This is crucial for performance because the length of code varies greatly.
For about 7000 entries in the "database" the sizes range from 1 KB
to several hundred KB or even a couple MB. But I only now this specific "database"
Others may contain larger chunks of data.

I tried to use a fixed size buffer just bigger than the largest scaffold and ended up
with an outrageous slow runtime behaviour.
 
I tried the first function from khkremer which is as I now measured close to what
I experienced with my function but yet somewhat slower doesn't have the bugs, though ;-)

You are very right about new delete I didn't get a warning or error because
vstudio .net seems to be able to delete whatever was newed before.  
About not using new and delete I haven't tried, but will later today.

You are saying I should only read forward in a file, but why do all the file classes provide
a means of seeking within the file.
I would like these functions to work as I would assume they would. Isn't that one of the major
points in using classes anyway?
I am wondering why the functions tellg(), seekg() do not behave as I assume they should.
I am probably doing something wrong there so that could be a problem.
I will need random file access later on, too so understanding why it doesn't work would
be of great help.
With JAVA the RandomFileAccess class and the seeking and telling worked fine for me, so
resolving this matter would be great.

Thank you for the insights, Salte, some of your suggestions will definetly
go into source.


@khkremer
Wouldn't all this malloc/calloc/realloc ..or.. new/delete operations consume a lot of runtime?

The performance of the JAVA program (http://hippler.bio.upenn.edu/gpf/gpfeng) I designed
runs for hours if using a 100 MB database and a short query of about 10-20 chars.
I have to run through the whole database read in the strings do some string comparisions and other operations.
Of course JAVA is alot slower (100 times some say) when compared to c++.

So while translating from JAVA to c++ I would like to always have the function with
the best performance
I guess I am down to something like 10 min for a comparision like the above but there are
still functions to be optimized. The loding of the code is one, that is quite time consuming
something like 360ms for larger chunks of data (so called scaffolds). Whereas trimming the code and translating it to protein into six reading Frames takes something lik 90 ms.
This means I take three chars from code (code[x],x1,x2) translate that to one of 125 posibilities 3 ^ 5 (ACGTN) and then writing it into  char *readingFrame[6] where each char array would have a length of one third of the code charackter array.
The actual comprison of a strings (query) for exact match within these six readingFrame[]s clocks actually at 0ms. But will become measurable as soon as I add some more functions to it.

As soon as I have this program running on a biiiiger server I will of course load the complete
database(s) into memory and do the searching for a couple hours so the time for loading
would not matter at all in that case, but for this pc/windows version I do have to optimize
this loading routine.  

So what I would like to understand is why the function I wrote did not work for me.
It ,however, actually loads the code in something like 70ms (larger chunks) having some offset at the beginning and end of the individual parts but still loading about 99% of each scaffold.
For the speed matter, I would like to see it running correctly.

So where is the problem with the random access file approach (tellg(),seekg()).
Thank you so much for your answers so far.


Current version of the function:
bool CDna2Protein::GetNextPart(const char * _filePath)
{
     if(!databasein.is_open())                         //Declared as class variable so I do not have
                                                                 //to keep opening and closing a file.
                                                                 // private: ifstream databasein;
          databasein.open(_filePath,ios_base::in);
     char          ch,
                     test = '>';
     int             pos = 0;
     char          line[256];
     //Read in header for this part of the database
     databasein.getline(line,256,'\n');
     scaffold = new char[256];
     strcpy(scaffold,line);               //Have to copy to the class variable

     //Now get the current position in the file
     int before = databasein.tellg();    //This should get the first position within the code after
                                                     //the header for each section (>Header\nACCGT...)
                                                     //Should point to A for the next read.
     before--;                      //Does, however, seem to give the next position so account for that
     int after = 0;
     while(databasein.get(ch)) {
          if(ch == test) {    //test = '>';
               databasein.putback(ch);
                           //I would like the next read starting with the charackter I just putback.
               after = databasein.tellg();  //Should now point to charackter before >.
                                                      //>Header1\nACCAAT....AACGT\n>Header2....
                                                      //The next read should thus return '>'.
               after--;    //see above
               break;
          }
     }
     int len = after-before;
                //Knowing the size of the database section create a buffer to hold it
     code = new char[len+1];
     databasein.seekg(before);
     databasein.get(code,len,test);   //test='>';
     databasein.seekg(after);
                //should hopefully end up at desired position within file
     code[len]='\0';
                //Need to figure out how to test for termination.

     return(true);
     if((databasein.eof()) && ((data.GetLength() < MINSCAFFOLDSIZE)))
             return(false);
     else
             return(true);
}

Thank you for any further suggestions.

I'll look into your tellg()/seekg() problems later today.

I'm always allocating twice as much memory as is currently used, this cuts down on the number of reallocations, but you are right, this will have an impact on performance, but I think it will be less than reading the complete file twice. I don't have any hard numbers, but file io is a lot slower than memory operations, and we are talking a large amount of data that needs to be re-read, so OS caching will not help.

How's the cake? :-)
It took me a bit longer than I tought, but try this:

#include <string>
#include <fstream>
#include <iostream>

const unsigned int initialDataSize = 1024;

using namespace std;

ifstream databasein;
int fileSize;                  // put this into your class

char * scaffold;
char * data = NULL;
char * code = NULL;
unsigned int dataSize = 0;      // currently allocated for data
unsigned int validData = 0;      // valid characters in data


bool GetNextPart(const char * _filePath)
{
      if(!databasein.is_open())
      {
            databasein.open(_filePath,ios_base::in);
            int currentPos = databasein.tellg();
            databasein.seekg (0, ios::end);
            fileSize = databasein.tellg();
            databasein.seekg (currentPos);
      }

      char          ch,
      test = '>';
      int             pos = 0;
      char          line[256];


      databasein.getline(line,256,'\n');
      scaffold = new char[256];
      strcpy(scaffold,line);               //Have to copy to the class variable

      int before = databasein.tellg();    //This should get the first position within the code after

      int after = 0;
      while(!databasein.eof())
      {
            databasein.get(ch);
            if(ch == test) {    //test = '>';
                  databasein.putback(ch);
                  after = databasein.tellg();
                  after--;    //see above
                  break;
            }
      }

      int len;

      if (after == 0)      // this is the last data segment
      {
            // we hit the eof state, need to clear it
            databasein.clear();
            len = fileSize-before-2;
      }
      else
      {
            len = after-before + 1;
      }
      //Knowing the size of the database section create a buffer to hold it
      code = new char[len+1];
      databasein.seekg(before);
      databasein.read(code, len);
      code[len]='\0';

      if (after != 0)
      {
            databasein.seekg(after+1);
      }

cout << scaffold << endl;
cout << code << endl;

      return(true);
}


int main()
{
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");

      databasein.close();

      return 0;
}
Avatar of allmer

ASKER

Hello KHKremer,
Is the KH for Karl Heinz?
the cake was good we had some red wine along with that.
Pretty nice party, indeed. Thanks for asking.
Well the function still produces wierd results:
It should print out:
>scaffold_1
>scaffold_2
...
>scaffold_6000

what it actually outputs is:
>scaffold_1     //Just fine
CACTGT          //Should be >scaffold_2 The offset from >scaffold_2
ffold_125         //And many variations like that.
...
d_3000

The offset for scaffold 2 is 55 charackters:
>scaffold_2
AGTGAGGGGCACGTGGCATGCGTTGGTGTGCGTTGAACGGATGGCACTGT
CACTGT//This is than taken as scaffold
I am very sure that neither you nor me assigned these 55 extra charackters in any place ;-)
So where do they come from.
I haven't checked if that is a constant offset which wouldn't be too bad.
I will look into that on the morrow.
Cheers
Jens
Hi Jens, yes, KH stands for Karl Heinz.

I made up a small data file with three records, and that gets parsed without any problems. Could you make a (small) file available that contains a few data records (or tell me where I can download one).
Avatar of allmer

ASKER

Hi Karl Heinz,
I was away for the weekend. I chose to rather have a good time than wrestling with my source.
The dataset which I am using is located here:
http://hippler.bio.upenn.edu/gpf/downloads/chlre.fasta
the documentation on how to create and use these fasta files as far as I know about it can be found here:
http://hippler-data.bio.upenn.edu/gpf/gpfeng/files.shtml

The funny thing about it is that the code you wrote above should work, but doesn't.
I even think that I tried the code I wrote on a different machine in a Linux environement and
I am not sure but I think it worked.
The dataset has 100 MB so I think I will upload a differetn testset of around 1 or 2 MB later today.
It should be here:
http://hippler.bio.upenn.edu/gpf/downloads/test.fasta
Hope to solve this problem soon
and best regards,
Jens
ASKER CERTIFIED SOLUTION
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of allmer

ASKER

Great job khkremer!
Sorry, that I did't have much time lately, so I couldn't respond
faster to changes in this thread. I hope the problem is resolved now
and I can turn to something else. The next question, an easy one this time,
will be up in a couple minutes.
If you like to try give it a shot ;-)
Anyway, thanks for your help and patience.
Sheers Jens
Jens,

let me guess... Wir haetten das ganze auch auf Deutsch machen konnen, oder? :-)