Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Read a large file line by line into a map

Posted on 2012-04-03
7
Medium Priority
?
890 Views
Last Modified: 2012-06-21
Hi

I have a large file, that cannot be loaded into the memory all at once.
Say the file has the following format

A   1
A   2
B   2
A   4
B   4
A   8
B   8
A   4
I want to read the file line by line, and dynamically create a "map?" in C++ such that

map user, count

where user = A, B ....
Count = a list of sets? (not sure what the best data structure is) to hold
 {{1,2,4} {2,4,8} {4,8,4} }
basically the flow is

1) read file line by line
2) initialize the key of some map by the first column of first row (user) example A
3) take the value of this and add it into a list? so it would be {{1}}
4) look at the next line.The map would now contain

map ={
key:A value{{1,2}}
}


5) look at third line
map={
key=A value :{{1,2}}
key =B value :{{2}}
}

6) 4th line

map={
key=A value :{{1,2,4}}
key =B value :{{2}}
}


Now that there are 3 elements are reached {1,2,4} for A, i want to create a new map with this triplet as the key

map2 = {
key = {1,2,4} and value = 1}

5th line =

map={
key=A value :{{1,2,4}}
key =B value :{{2,9}}
}

and 6th line we find the 4th value for A so

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4}}
}

and also create a new entry in map2 for {2,4,8}

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 1
}
for the next line

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4, 8}}
}

so we increment map2

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 2
}

After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

how do i do this in c++, is reading tab delimited entries simple?
0
Comment
Question by:Vlearns
7 Comments
 

Author Comment

by:Vlearns
ID: 37804324
i want to use ifstream
0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 668 total points
ID: 37804833
>> how do i do this in c++, is reading tab delimited entries simple?

It's not hard ;)

You can use getline to read one line of input from the file (and doing so in a loop until the end of the file is reached will read all lines).

        http://www.cplusplus.com/reference/string/getline/

That will give you the line in a std::string, which you can then parse to get whatever you need from it.

If a line contains two tab-separated items, you could either use a substr-based approach (look for the tab, and then take two substrings - one left and one right of the tab),

        http://www.cplusplus.com/reference/string/string/find/
        http://www.cplusplus.com/reference/string/string/find_first_of/
        http://www.cplusplus.com/reference/string/string/substr/

or use a stringstream, and simply "stream" the two items out of the string.

        http://www.cplusplus.com/reference/iostream/istream/operator%3E%3E/
0
 
LVL 35

Expert Comment

by:sarabande
ID: 37805278
for the container you could use std::map<std::string, std::set<int> >.

if you have parsed key and value from line you would add that pair by

mymap[key].insert(value);

Open in new window


the mymap[key] would return a reference to a set<int> which is either empty when the key was new or was the set already stored to the key. the insert would add the value to the set unless it already contains it.

Sara
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 35

Expert Comment

by:sarabande
ID: 37805668
if the file is too large for loading it into memory, it also could give memory problems with the map unless there are many duplicates for keys or for key-value pairs.

did you try to load the whole text file into memory? if yes, did you use one big char buffer for all or did you use a dynamic container like std::vector<std::string> where you added line by line? for the second container you should be able to allocate much more memory cause it allocates memory in (small) chunks and doesn't need one huge contiguous free memory space.

Sara
0
 
LVL 22

Expert Comment

by:ambience
ID: 37806638
>> After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

I dont think you need a map for that, that is just a simple search for the maximum and given the nature of your algorithm you should be able to do it easily.

1) I suggest to use std::tuple<> if your STL supports it. That would save you from have to write a lot of basic operators. See the example here: http://msdn.microsoft.com/en-us/library/bb982837.aspx

2) Looking at the data and your approach - the way I understood the problem

A   1
A   2
A   4
A   8
A   4

You want to find the max contiguous triplet 4,8,4 from the sequence. This should be done this way

// pseudocode

typedef std::tuple<int, int, int> triplet_t;

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
      triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
      if( temp > tmax)
         tmax = temp;
}

Of course this is for just one user 'A', and you'd need a map but just one map.
0
 
LVL 22

Assisted Solution

by:ambience
ambience earned 664 total points
ID: 37806674
I think I missed one point. You havent defined the comparison rules for triplets, like which is greater 4,8,4 or 8,1,1. But - then all you need is define the comparison operators like

bool operator<(const triplet_t& t1, const triplet_t& t2) {
 ...
}

Just in case, if theres a conflict with std::tuple then just define a replica or even do it manually like

int compare(const triplet_t& t1, const triplet_t& t2) {
 ... return -1 if <, 0 for equal and 1 for t1 > t2
}

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
      triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
      if( compare(temp, tmax) < 0 )
         tmax = temp;
}
0
 
LVL 35

Assisted Solution

by:sarabande
sarabande earned 668 total points
ID: 37808080
sorry for my wrong suggestion using std::set. i didn't understand the requirements.

i think now the triplet is the key and the required result is the triplet with the maximum count for different keys.

if so, it could not be made with one map if the same triple sequence should not be counted twice for a key of the first map.

i would suggest std::map<std::string, std::string > for map1 and std::map<std::string, int> for map2 . don't think that we need a vector of triples for first map cause by storing the sequence as comma separated list of integers we could use std::find to check whether the next triple for a key already had a count.

for the sample sequence we would have

1) map1:  "A"   ",1,"
2) map1:  "A"   ",1,2,"
3) map1:  "A"   ",1,2,"
                 "B"   ",2,"
4) map1:  "A"   ",1,2,4,"
                 "B"   ",2,"
    map2:  "1,2,4"    1
5) map1:  "A"   ",1,2,4,"
                 "B"   ",2,4,"
    map2:  "1,2,4"    1
6) map1:  "A"   ",1,2,4,8,"
                 "B"   ",2,4,"
    map2:  "1,2,4"    1  
                 "2,4,8"    1
7) map1:  "A"   ",1,2,4,8,"
                 "B"   ",2,4,8,"
    map2:  "1,2,4"    1  
                 "2,4,8"    2
8) map1:  "A"   ",1,2,4,8,4,"
                 "B"   ",2,4,8,"
    map2:  "1,2,4"    1  
                 "2,4,8"    2
                 "4,8,4"    1

we would parse a line end extract key and value both as strings.

then we could get a reference of a map entry of map1 by

std::string & seq = map1[key];

Open in new window


the 'seq' string is empty for new keys and would contain a comma separated sequence of value(s) if the key already existed in map1.

you now would check whether the seq already contains at least two values. you could do that by calling seq.rfind(',', pos) up to 3 times where pos starts with seq.length()-1 and then is pos-1. of course you need to break whenever you reached the begin of the string before you counted 3 commas. if you found two values you extract a temporary substring from seq and build a a temporary triplet string with 3 values where the last is the new value (for example ",1,2,4,"). then you could check whether the current seq already contains the new triplet by calling seq.find(triplet). if the result is std::string::npos you have a new triplet for the key.
 
in any case you would now add value + comma to a non-empty seq string or comma + value + comma to an empty seq. as seq is a reference to a string of a map1 entry  you would already edit in the map and don't have to store a new value.

map2 only would be filled in case of a new triplet. that would be done by stripping off the leading and trailing commas and do

if ((m = map2[triplet]++) > lastmax)
{
     lastmax = m;
     maxtriplet = triplet;
}

Open in new window


again we would use that std::map::operator[] returns a reference to an already allocated and initialized value, which would be 0 if the triplet didn't exist and is the count so far if the triplet already existed. the operator++ then would increment the map entry value.

finally the maximum count is in lastmax and the maxtriplet would contain the associated triplet.

Sara
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In real business world data are crucial and sometimes data are shared among different information systems. Hence, an agreeable file transfer protocol need to be established.
Article by: evilrix
Looking for a way to avoid searching through large data sets for data that doesn't exist? A Bloom Filter might be what you need. This data structure is a probabilistic filter that allows you to avoid unnecessary searches when you know the data defin…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

927 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question