Question

ifstream get parts of a formatted textfile

Asked by: allmer

I am trying to extract to parts of unknown size from a database (textfile).
It is composed as follows
>Title\n                      //The '>' preceedes a section of the database and \n terminates this header
                                //That line thus contains the header or title for the following section
AACCCTAGCTAGCATCAGCACGACG\n
ACGACTAGCACTNACGGCGACCTCG\n
ACACGAGCTGCGCCATAGCAGCAGG\n
                               //Each line is terminated after some charackters with a linefeed '\n'
                               //Can be between some and some hundred thousands chaarackters.
                               //Only ACGNT however
>Next section\n
....
>Third section\n
...
>and so on for 100 MB\n
I would like to extract the header to a variable called scaffold
and the corresponding section of the database to a variable called code.
Boh are char*.
I use the following approach, but it only works for the first section thereafter it creates wierd
resutlts.
bool CDna2Protein::GetNextPart(const char * _filePath)
{
      if(!databasein.is_open())                         //Declared as class variable so I do not have
                                                                            //to keep opening and closing a file.
                                                                            //ifstream databasein;
            databasein.open(_filePath,ios_base::in);
      char          ch,
            test = '>';
      int pos = 0;
      char *line = new char[256];
      //Read in header for this part of the database
      databasein.getline(line,256,'\n');
      scaffold = new char[256];
      sprintf(scaffold,"%s",line);
      delete line;
      //Now get the current position in the file
      int before = databasein.tellg();
      before--; //Does seem to give the next position so account for that
      int after = 0;
      while(databasein.get(ch)) {
            if(ch == test) {    //test = '>';
                  databasein.putback(ch);
                           //I would like the next read starting with the charackter I just putback.
                  after = databasein.tellg();
                  after--;    //see above
                  break;
            }
      }
      int len = after-before;
                //Knowing the size of the database section create a buffer to hold it
      code = new char[len+1];
      databasein.seekg(before);
      databasein.get(code,len,test);   //test='>';
      databasein.seekg(after);
                //should hopefully end up at desired position within file
      code[len]='\0';

                //do some output for testing
      len = int(strlen(code));
      char *temp = new char[10];
      sprintf(temp,"%d",len);
      CString str = "Laenge von "+CString(scaffold)+": "+CString(temp);
      delete temp;
      StringToFile("c:\\gpfc++\\len.log",str);
      trim();
                //Need to figure out how to test for termination.
      return(true);
}
Besides the problem of getting wrong parts of the database posing for scaffold,
I also need to return false on reaching feof.
Thank you for any suggestions
Jens

This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2004-04-05 at 08:35:16ID20943958
Tags

get

,

ifstream

Topic

C++ Programming Language

Participating Experts
2
Points
50
Comments
15

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. (CString) BOOL
    Hi Is there an esy way to convert a BOOL to a string, like "1" = TRUE, "0" = FALSE CB.
  2. Problem with ifstream
    Hi! I'm using CBuilder5 and I'm implementing a small class that holds some data. The class has the friend method's << and >>. The compiler doesn't complain when compiling the class, but when I'm using the class I get the following error :[C++ Error] eCalendar.cpp...
  3. CString
    I am trying to work with CString types and I included the "CString.h" file but it can't find it. I searched my computer and it's not on it. I am using VS 6.0. Am I supposed to have that header file or do I download it from somewhere or what's the deal? Any help ...
  4. Convert a CString to a bool
    What's an easy to convert a CString to a bool. i.e. CString m_str; bool m_bl; m_str = "true"; m_bl = (bool)m_str; // I know this does not work I know i could just make a simple function that translate it... but i want to know if you guys know a better way.
  5. Convert CString into bool
    What's an easy to convert a CString to a bool. i.e. CString m_str; bool m_bl; m_str = "true"; m_bl = (bool)m_str; // I know this does not work I know i could just make a simple function that translate it... but i want to know if you guys know a better way.

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: khkremerPosted on 2004-04-05 at 11:19:16ID: 10759727

I don't have the rest of your class, so I rewrote this as just a function, you should be able to retrofit this into your class again. All member variables are globals in my example. I think I came up with a better way of handling this (no need to calculate the size of the buffer anymore, this is all done with a string):

#include <string>
#include <fstream>
#include <iostream>

using namespace std;

ifstream input;
string scaffold;
string data;

bool GetNextPart(const char * _filePath)
{
      bool done = false;
      bool inRecord = false;

      if (!input.is_open())
      {
            input.open(_filePath);
      }

      while (!input.eof() && !done)
      {
            char buf[256];
            char beginOfLine;
            input >> beginOfLine;

            input.putback(beginOfLine);

            if (beginOfLine == '>')
            {
                  inRecord = !inRecord;
                  done = (!inRecord);


                  if (inRecord)
                  {
                        input.getline(buf, 256, '\n');
                        scaffold = buf;
                        data.erase();
                  }
            }
            else
            {
                  if (inRecord)
                  {
                        input.getline(buf, 256, '\n');
                        data += buf;
                  }
            }
      }

      cout << "scaffold: " << scaffold << endl;
      cout << "data: " << data << endl;

}

int main()
{
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");

      input.close();

      return 0;
}

 

by: allmerPosted on 2004-04-05 at 12:41:10ID: 10760420

Nice try khkremer.
I tried dong it with adding up CStrings, too; but it turned out
to be much slower than the char* approach.
The process of loading, trimming and translating to Protein ;-)
is about 10 times faster when not done with cstring.
When I originally wrote this thing in JAVA it actually crashed all
the time when using operator += on strings.
Well, I would like to see the approach above working. I don't see
a flaw in it so it really bugs me that it doesn't do what I want it
to.
Do tellg() and seekg() have any problems or did I call the functions
at the wrong moment.


Another approach I could think of would be:
Readline(theHeader);
Use a function to get position of the following '>'
read anything in between into char *buffer.


Somehow the function crashed on reaching the end of the database.
bool CDna2Protein::GetNextPart(const char * _filePath, long _fP)
{
     bool done = false;
     bool inRecord = false;
      CString data;
     if (!databasein.is_open())
     {
          databasein.open(_filePath);
     }

     while (!databasein.eof() && !done)
     {
          char buf[256];
          char beginOfLine;
          databasein >> beginOfLine;

          databasein.putback(beginOfLine);

          if (beginOfLine == '>')
          {
               inRecord = !inRecord;
               done = (!inRecord);


               if (inRecord)
               {
                    databasein.getline(buf, 256, '\n');
          scaffold = new char[256];
                    sprintf(scaffold,"%s",buf);         //had to really copy the data.
                    //data.erase();       //Didn't know what this is good for.
               }
          }
          else
          {
               if (inRecord)
               {
                    databasein.getline(buf, 256, '\n');
                    data += buf;
               }
          }
     }
    code = data.GetBuffer();    //Conversion back to my charackter array
    trim();
  //Needs to return something
  //false when done and true otherwise
}

 

by: khkremerPosted on 2004-04-05 at 19:08:02ID: 10762547

I'll take a second look tomorrow.

 

by: SaltePosted on 2004-04-06 at 00:57:26ID: 10763786

For one thing, be careful with those new char[] things.....

Especially the

char * scaffold = new char[256];

Looks suspicious to me.

This allocates a char array with room for 256 characters. However, if you want to store a C-style strings into it you must leave room for the NUL char at end so there is really only room for 255 characters.

Another thing that looks out of place is the sprintf there. What you are really doing is a simple strcpy and so I don't know why you bother to allcoate the line buffer in the first place. sprintf( dest, "%s", source) is exactly equivalent to strcpy(dest,source) so use strcpy and not sprintf in this situation.

Also, you seem to be doing things slightly inefficient, especially if this tends to be long buffers, since you essentially read the file twice - once for finding the length and then once again to actually read the data.

I suggest:

1. Have ONE buffer or string to hold the current data read. This buffer should grow but never shrink. Put the code in that buffer and once you get the end, copy the code from that buffer to a new string buffer and re-use the same buffer for next code/scaffold. Always read the file forward, never go back and read again. :-)

2. What happens to scaffold and code later? I never see them deleted in your code so I assume they are placed in some mapping or collection unless you read only one.

3. To return false upon eof or error is easy, just have your function return a reference to the ifstream you are reading. Oh, wait, you open the file in your function. Why would you want to return true/false depending on eof when you open the file in the function? Usually that is only done if you are looping on reading the same file many times and your code does not do that.

The "standard" way to read an istream and return eof/error/etc indication is this:

istream & read_data(istream & is, destination_type & dest)
{
      // read data from is and place result in dest.
      return is;
}

Then you can do something like this in your program:

void read_all_data(const char * filename, destination_type & dest)
{
    dest.clear(); // only if you want to clear it before reading the file.
    ifstream file(filename);
    while (read_data(file,dest))
        ;
}

Now, this assumes that destination_type is a collection which can hold an "infinite" amount of data and read_data will push the next set. Alternatively the read_data will only read one set and the destination type of read_all_data is a colelction of those data:

void read_all_data(const char * filename, destination_collection_type & dest)
{
    dest.clear(); // only if you want to clear it before reading the file.
    ifstream file(filename);
    destination_type elem;

    while (read_data(file,elem))
        dest.store(elem);
}

is the way to go if the read_data handle a single element while the read_all_data handle a collection type.

Of course, instead of read_data you might want to contemplate using overload of operator >>:

istream & operator >> (istream & is, destination_type & dest);

The code is the same but caller can now use:

while (file >> elem)
    dest.store(elem);

instead. This is pure syntax, the code is essentially the same.

Of course, if you have two different types for the single element and the collection of elements you can also overload operator >> for the collection as well.

The single element destination type should then simply be something that can hold a single code no matter how long and possibly also the scaffold - or is that scaffold shared among several codes?

If there is one scaffold for each code then simply make a pair. You can then either use STL pair type or make a separate class, give it a reasonable name code_and_scaffold doesn't quite cut it. I can see that this has to do with genetics but I am not that much into genetics that I can tell exactly what you would name such a thing.

If the scaffold is shared among several codes you should have a vector or list of codes associated with each scaffold title, if so you might want to use:

class scaffold {
private:
    string M_title;
    vector<string> M_codes;
public:
    .....
};

istream & operator >> (istream & is, scaffold & s);

Something like that. If you declare the operator as friend it has access to the members M_title and M_codes.

Also, I think you make heavy use of the allocator:

char * temp = new char[10];

// use temp here...

delete temp;

This is both wrong and bad. It is wrong because you use new [] and then use delete when you should have used delete [] temp; You make the same mistake with line above as well. Note that many platforms will still get it right but some platforms will choke on code like you have above. Using delete [] when you used new [] is always correct.

However, what is wrong with simply:

char temp[10];

// use temp here...

No new/delete at all, faster run time.

Alf

 

by: khkremerPosted on 2004-04-06 at 05:33:12ID: 10765182

Try this instead - it now uses char* to store the data. The memory to hold the data is dynamically allocated and reallocated when more space is needed. I'm allocating twice the amount that is currently used, and reallocating at the end again (to not waste any memory):

#include <string>
#include <fstream>
#include <iostream>

const unsigned int initialDataSize = 1024;

using namespace std;

ifstream input;
string scaffold;
char * data = NULL;
unsigned int dataSize = 0;  // currently allocated for data
unsigned int validData = 0; // valid characters in data

bool GetNextPart(const char * _filePath)
{
  bool done = false;
  bool inRecord = false;

  if (!input.is_open())
  {
    input.open(_filePath);
  }

  while (!input.eof() && !done)
  {
    char buf[256];
    char beginOfLine;
    input >> beginOfLine;

    input.putback(beginOfLine);

    if (beginOfLine == '>')
    {
      inRecord = !inRecord;
      done = (!inRecord);


      if (inRecord)
      {
        input.getline(buf, 256, '\n');
        scaffold = buf;
        free(data);

        // initialize the data area
        data = (char *) malloc(initialDataSize);
        dataSize = initialDataSize;
        validData = 0;
      }
    }
    else
    {
      if (inRecord)
      {
        input.getline(buf, 256, '\n');
        int charCount = strlen(buf);

        if (validData + charCount > dataSize -1)
        {
          // need to allocate more space
          dataSize = (validData + charCount) * 2;
          data = (char *) realloc(data, dataSize);
          if (data == NULL)
          {
            // handle error
          }
        }
          // append the new line to the existing data
        strcpy(&(data[validData]), buf);
        validData += charCount;
      }
    }
  }

  // adjust the size of data
  data = (char *) realloc(data, validData + 1);

  cout << "scaffold: " << scaffold << endl;
  cout << "data: " << data << endl;
}

int main()
{
  GetNextPart("./protein.txt");
  GetNextPart("./protein.txt");
  GetNextPart("./protein.txt");

  input.close();

  return 0;
}

// end

You of course have to put this into your class again (and make sure, that you free the memory again once you are done with this record).

 

by: allmerPosted on 2004-04-06 at 14:11:52ID: 10769804

Thank you for your great answers.
I am very busy until tomorrow morning and cannot
try it out at the moment.
I will do it first thing tomorrow morning.
Have a nice day.
I gotta bake some cakes now ;-)

 

by: allmerPosted on 2004-04-07 at 10:16:42ID: 10776634

@Salte
this is intended as a function to fill up
scaffold with the header part of the database and code with the corresponding
data inbetween two headers or feof.
code and scaffold are two class variables both are used in this class and
also in a derived class.
The data read is always directly processed and the results if any are then stored
whereafter the next part of the database is loaded and then processed.
The class destructor takes care of deleting these values.
There is no need of storing the complete database in memory using a windows
system for the size of these textbased databases (FASTA) can have several 100 MB.
 
I am cosidering this for a version of my software I will be writing for
a server running linux as soon as this pc/windows version is done.

Running twice through the file to first get the positions actually seems more
efficient than any other thing I tried.
This makes it feasable to allocate just the right amount of buffer for code.
This is crucial for performance because the length of code varies greatly.
For about 7000 entries in the "database" the sizes range from 1 KB
to several hundred KB or even a couple MB. But I only now this specific "database"
Others may contain larger chunks of data.

I tried to use a fixed size buffer just bigger than the largest scaffold and ended up
with an outrageous slow runtime behaviour.
 
I tried the first function from khkremer which is as I now measured close to what
I experienced with my function but yet somewhat slower doesn't have the bugs, though ;-)

You are very right about new delete I didn't get a warning or error because
vstudio .net seems to be able to delete whatever was newed before.  
About not using new and delete I haven't tried, but will later today.

You are saying I should only read forward in a file, but why do all the file classes provide
a means of seeking within the file.
I would like these functions to work as I would assume they would. Isn't that one of the major
points in using classes anyway?
I am wondering why the functions tellg(), seekg() do not behave as I assume they should.
I am probably doing something wrong there so that could be a problem.
I will need random file access later on, too so understanding why it doesn't work would
be of great help.
With JAVA the RandomFileAccess class and the seeking and telling worked fine for me, so
resolving this matter would be great.

Thank you for the insights, Salte, some of your suggestions will definetly
go into source.


@khkremer
Wouldn't all this malloc/calloc/realloc ..or.. new/delete operations consume a lot of runtime?

The performance of the JAVA program (http://hippler.bio.upenn.edu/gpf/gpfeng) I designed
runs for hours if using a 100 MB database and a short query of about 10-20 chars.
I have to run through the whole database read in the strings do some string comparisions and other operations.
Of course JAVA is alot slower (100 times some say) when compared to c++.

So while translating from JAVA to c++ I would like to always have the function with
the best performance
I guess I am down to something like 10 min for a comparision like the above but there are
still functions to be optimized. The loding of the code is one, that is quite time consuming
something like 360ms for larger chunks of data (so called scaffolds). Whereas trimming the code and translating it to protein into six reading Frames takes something lik 90 ms.
This means I take three chars from code (code[x],x1,x2) translate that to one of 125 posibilities 3 ^ 5 (ACGTN) and then writing it into  char *readingFrame[6] where each char array would have a length of one third of the code charackter array.
The actual comprison of a strings (query) for exact match within these six readingFrame[]s clocks actually at 0ms. But will become measurable as soon as I add some more functions to it.

As soon as I have this program running on a biiiiger server I will of course load the complete
database(s) into memory and do the searching for a couple hours so the time for loading
would not matter at all in that case, but for this pc/windows version I do have to optimize
this loading routine.  

So what I would like to understand is why the function I wrote did not work for me.
It ,however, actually loads the code in something like 70ms (larger chunks) having some offset at the beginning and end of the individual parts but still loading about 99% of each scaffold.
For the speed matter, I would like to see it running correctly.

So where is the problem with the random access file approach (tellg(),seekg()).
Thank you so much for your answers so far.


Current version of the function:
bool CDna2Protein::GetNextPart(const char * _filePath)
{
     if(!databasein.is_open())                         //Declared as class variable so I do not have
                                                                 //to keep opening and closing a file.
                                                                 // private: ifstream databasein;
          databasein.open(_filePath,ios_base::in);
     char          ch,
                     test = '>';
     int             pos = 0;
     char          line[256];
     //Read in header for this part of the database
     databasein.getline(line,256,'\n');
     scaffold = new char[256];
     strcpy(scaffold,line);               //Have to copy to the class variable

     //Now get the current position in the file
     int before = databasein.tellg();    //This should get the first position within the code after
                                                     //the header for each section (>Header\nACCGT...)
                                                     //Should point to A for the next read.
     before--;                      //Does, however, seem to give the next position so account for that
     int after = 0;
     while(databasein.get(ch)) {
          if(ch == test) {    //test = '>';
               databasein.putback(ch);
                           //I would like the next read starting with the charackter I just putback.
               after = databasein.tellg();  //Should now point to charackter before >.
                                                      //>Header1\nACCAAT....AACGT\n>Header2....
                                                      //The next read should thus return '>'.
               after--;    //see above
               break;
          }
     }
     int len = after-before;
                //Knowing the size of the database section create a buffer to hold it
     code = new char[len+1];
     databasein.seekg(before);
     databasein.get(code,len,test);   //test='>';
     databasein.seekg(after);
                //should hopefully end up at desired position within file
     code[len]='\0';
                //Need to figure out how to test for termination.

     return(true);
     if((databasein.eof()) && ((data.GetLength() < MINSCAFFOLDSIZE)))
             return(false);
     else
             return(true);
}

Thank you for any further suggestions.

 

by: khkremerPosted on 2004-04-07 at 11:12:13ID: 10777055

I'll look into your tellg()/seekg() problems later today.

I'm always allocating twice as much memory as is currently used, this cuts down on the number of reallocations, but you are right, this will have an impact on performance, but I think it will be less than reading the complete file twice. I don't have any hard numbers, but file io is a lot slower than memory operations, and we are talking a large amount of data that needs to be re-read, so OS caching will not help.

How's the cake? :-)

 

by: khkremerPosted on 2004-04-08 at 05:41:15ID: 10782883

It took me a bit longer than I tought, but try this:

#include <string>
#include <fstream>
#include <iostream>

const unsigned int initialDataSize = 1024;

using namespace std;

ifstream databasein;
int fileSize;                  // put this into your class

char * scaffold;
char * data = NULL;
char * code = NULL;
unsigned int dataSize = 0;      // currently allocated for data
unsigned int validData = 0;      // valid characters in data


bool GetNextPart(const char * _filePath)
{
      if(!databasein.is_open())
      {
            databasein.open(_filePath,ios_base::in);
            int currentPos = databasein.tellg();
            databasein.seekg (0, ios::end);
            fileSize = databasein.tellg();
            databasein.seekg (currentPos);
      }

      char          ch,
      test = '>';
      int             pos = 0;
      char          line[256];


      databasein.getline(line,256,'\n');
      scaffold = new char[256];
      strcpy(scaffold,line);               //Have to copy to the class variable

      int before = databasein.tellg();    //This should get the first position within the code after

      int after = 0;
      while(!databasein.eof())
      {
            databasein.get(ch);
            if(ch == test) {    //test = '>';
                  databasein.putback(ch);
                  after = databasein.tellg();
                  after--;    //see above
                  break;
            }
      }

      int len;

      if (after == 0)      // this is the last data segment
      {
            // we hit the eof state, need to clear it
            databasein.clear();
            len = fileSize-before-2;
      }
      else
      {
            len = after-before + 1;
      }
      //Knowing the size of the database section create a buffer to hold it
      code = new char[len+1];
      databasein.seekg(before);
      databasein.read(code, len);
      code[len]='\0';

      if (after != 0)
      {
            databasein.seekg(after+1);
      }

cout << scaffold << endl;
cout << code << endl;

      return(true);
}


int main()
{
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");
      GetNextPart("./protein.txt");

      databasein.close();

      return 0;
}

 

by: allmerPosted on 2004-04-08 at 13:03:33ID: 10786710

Hello KHKremer,
Is the KH for Karl Heinz?
the cake was good we had some red wine along with that.
Pretty nice party, indeed. Thanks for asking.
Well the function still produces wierd results:
It should print out:
>scaffold_1
>scaffold_2
...
>scaffold_6000

what it actually outputs is:
>scaffold_1     //Just fine
CACTGT          //Should be >scaffold_2 The offset from >scaffold_2
ffold_125         //And many variations like that.
...
d_3000

The offset for scaffold 2 is 55 charackters:
>scaffold_2
AGTGAGGGGCACGTGGCATGCGTTGGTGTGCGTTGAACGGATGGCACTGT
CACTGT//This is than taken as scaffold
I am very sure that neither you nor me assigned these 55 extra charackters in any place ;-)
So where do they come from.
I haven't checked if that is a constant offset which wouldn't be too bad.
I will look into that on the morrow.
Cheers
Jens

 

by: khkremerPosted on 2004-04-08 at 13:34:12ID: 10786938

Hi Jens, yes, KH stands for Karl Heinz.

I made up a small data file with three records, and that gets parsed without any problems. Could you make a (small) file available that contains a few data records (or tell me where I can download one).

 

by: allmerPosted on 2004-04-13 at 12:29:37ID: 10816638

Hi Karl Heinz,
I was away for the weekend. I chose to rather have a good time than wrestling with my source.
The dataset which I am using is located here:
http://hippler.bio.upenn.edu/gpf/downloads/chlre.fasta
the documentation on how to create and use these fasta files as far as I know about it can be found here:
http://hippler-data.bio.upenn.edu/gpf/gpfeng/files.shtml

The funny thing about it is that the code you wrote above should work, but doesn't.
I even think that I tried the code I wrote on a different machine in a Linux environement and
I am not sure but I think it worked.
The dataset has 100 MB so I think I will upload a differetn testset of around 1 or 2 MB later today.
It should be here:
http://hippler.bio.upenn.edu/gpf/downloads/test.fasta
Hope to solve this problem soon
and best regards,
Jens

 

by: khkremerPosted on 2004-04-13 at 13:28:59ID: 10817211

I noticed that there was a problem with recognizing the end of the file, so I made one small adjustment. GetNextPart() now returns false when the EOF is encountered.

#include <string>
#include <fstream>
#include <iostream>
const unsigned int initialDataSize = 1024;
using namespace std;
ifstream databasein;
int fileSize;                  // put this into your class
char *scaffold;
char *data = NULL;
char *code = NULL;
unsigned int dataSize = 0;      // currently allocated for data
unsigned int validData = 0;      // valid characters in data
bool
GetNextPart (const char *_filePath)
{
  bool retVal = true;
  if (!databasein.is_open ())
    {
      databasein.open (_filePath, ios_base::in);
      int currentPos = databasein.tellg ();
      databasein.seekg (0, ios::end);
      fileSize = databasein.tellg ();
      databasein.seekg (currentPos);
    }
  char ch, test = '>';
  int pos = 0;
  char line[256];
  databasein.getline (line, 256, '\n');
  scaffold = new char[256];
  strcpy (scaffold, line);      //Have to copy to the class variable
  int before = databasein.tellg ();      //This should get the first position within the code after
  int after = 0;
  while (!databasein.eof ())
    {
      databasein.get (ch);
      if (ch == test)
      {                  //test = '>';
        databasein.putback (ch);
        after = databasein.tellg ();
        after--;            //see above
        break;
      }
    }
  int len;
  if (after == 0)            // this is the last data segment
    {
      // we hit the eof state, need to clear it
      databasein.clear ();
      len = fileSize - before - 2;
      retVal = false;
    }
  else
    {
      len = after - before + 1;
    }
  //Knowing the size of the database section create a buffer to hold it
  code = new char[len + 1];
  databasein.seekg (before);
  databasein.read (code, len);
  code[len] = '\0';
  if (after != 0)
    {
      databasein.seekg (after + 1);
    }
  cout << scaffold << endl;
  cout << code << endl;
  return (retVal);
}

int
main (int argc, char * argv[])
{
  if (argc != 2)
     return -1;
  while (GetNextPart (argv[1]))
  {
  }
  databasein.close ();
  return 0;
}


It works correctly with the smaller file  you just posted. I am running this on a Linux system (SuSE 8.2 to be exact).

 

by: allmerPosted on 2004-04-16 at 11:53:31ID: 10845124

Great job khkremer!
Sorry, that I did't have much time lately, so I couldn't respond
faster to changes in this thread. I hope the problem is resolved now
and I can turn to something else. The next question, an easy one this time,
will be up in a couple minutes.
If you like to try give it a shot ;-)
Anyway, thanks for your help and patience.
Sheers Jens

 

by: khkremerPosted on 2004-04-16 at 15:29:50ID: 10846547

Jens,

let me guess... Wir haetten das ganze auch auf Deutsch machen konnen, oder? :-)

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...