[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

Text File Read Question

Posted on 2013-05-17
3
Medium Priority
?
507 Views
Last Modified: 2013-05-23
Hello,
I have a humongous text file with elements like this... (This is an example only and NOT the real file)...Note that in each item... First half before the underscore is the major item name and the other half is the minor descriptions.

item1_data1
item1_data2
item1_data3
.............
item2_data1
item2_data2
item2_data3
.............
item3_data1
item3_data2
item3_data3
............

There are about 75000 items in this file.

I am writing a C++ class to pickup only the major name of each item. I.E. My result from the above humongous file should be...

item1
item2
item3
.......

I know thare are tons of techniques out there. What is the real efficient method I should use so that my result is produced in nano seconds :-) (serioulsly efficiency is extreamly important for me)
0
Comment
Question by:prain
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 86

Expert Comment

by:jkr
ID: 39175364
You'll have to read the entire file anyway, so there's not much room for improvement. The simplest way I can imagine would be to

#include <fstream>
#include <string>
#include <list>

using namespace std;

//...

string line;
size_t pos;
list<string> items;
ifstream is("file.txt");

if (!is.is_open()) {

  // error, no such file
}

while (!is.eof()) {

  getline(is,line);

  if (string::npos == (pos = line.find('_'))) {

    // error, malformed line w/o underscore
  }

  items.push_back(line.substr(0,pos));
}

Open in new window

0
 
LVL 35

Accepted Solution

by:
sarabande earned 800 total points
ID: 39185173
the nanoseconds is not realistic, even for fast ssd or flash storage, each file access to a new file not currently in cache would need milliseconds to position at file-begin and read all blocks to memory, manage the cache, handle overhead of the filesystem, schedule the thread, ... if the file was not stored contiguously, all times must be multiplied with the number of file pieces. same applies to debug mode which also would generate extra times.

generally if you read a file in total to memory in binary mode you normally can halve the reading times for a file of significant size.

#include <sys/stat.h>
#include <fstream>
#include <string>
...
struct stat filestatus = { 0 };
if (stat(szfilepath, &filestatus) == 0)
{
     std::ifstream file(szfilepath, std::ios::binary | std::ios::in);
     if (file)
     {
           std::string buf(filestatus.st_size+1, '\0');
           if (file.read(&buf[0], filestatus.st_size))
           {
                  std::string crlf = "\r\n";  
                  std::string line;
                  buf += crlf; // add carriagereturn-linefeed for easier parsing
                  size_t pos, lpos = 0;
                  while ((pos = buf.find(crlf, lpos)) != std::string::npos)
                  {
                      if (pos > lpos)
                      {
                           line = buf.substr(lpos, pos - lpos);
                           // here you have one line extracted
                           ...
                      }
                      lpos = pos + crlf.length();

Open in new window


the above code would need contiguous memory of 75,000 times (average line length + 2). That can be a problem on a low-memory or busy system. if that could be a problem you might think of reading the file in - say - 64k chunks.

 
At the ... you could extract the first string part from line and add it to a std::set<std::string> container. a set would add a new string only if it is not already in the set.    

std::set<std::string> items;
...
       size_t undl = line.find('_');
       if (undl != std::string::npos)
       {
           line.resize(undl);   // truncate line string
           items.insert(line);

Open in new window

                   

finally, you could iterate the set to get all entries (sorted alphabetically).

std::set<std::string>::iterator i;
for (i = items.begin(); i != items.end(); ++i)
{
      std::string & item = *i;  // get a reference of the current item in set 

Open in new window

   

note, a std::set or a std::map would handle duplicates. a std::list or std::vector would not. if using one of the latter you would need to sort after filling and remove or skip duplicates after sorting. the last method could be faster if there are only few duplicates.

Sara
0
 

Author Closing Comment

by:prain
ID: 39192341
Thanks
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
This article shows you how to optimize memory allocations in C++ using placement new. Applicable especially to usecases dealing with creation of large number of objects. A brief on problem: Lets take example problem for simplicity: - I have a G…
This video teaches viewers about errors in exception handling.
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
Suggested Courses

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question