# Read a large file line by line into a map

Hi

I have a large file, that cannot be loaded into the memory all at once.
Say the file has the following format

A   1
A   2
B   2
A   4
B   4
A   8
B   8
A   4
I want to read the file line by line, and dynamically create a "map?" in C++ such that

map user, count

where user = A, B ....
Count = a list of sets? (not sure what the best data structure is) to hold
{{1,2,4} {2,4,8} {4,8,4} }
basically the flow is

1) read file line by line
2) initialize the key of some map by the first column of first row (user) example A
3) take the value of this and add it into a list? so it would be {{1}}
4) look at the next line.The map would now contain

map ={
key:A value{{1,2}}
}

5) look at third line
map={
key=A value :{{1,2}}
key =B value :{{2}}
}

6) 4th line

map={
key=A value :{{1,2,4}}
key =B value :{{2}}
}

Now that there are 3 elements are reached {1,2,4} for A, i want to create a new map with this triplet as the key

map2 = {
key = {1,2,4} and value = 1}

5th line =

map={
key=A value :{{1,2,4}}
key =B value :{{2,9}}
}

and 6th line we find the 4th value for A so

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4}}
}

and also create a new entry in map2 for {2,4,8}

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 1
}
for the next line

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4, 8}}
}

so we increment map2

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 2
}

After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

how do i do this in c++, is reading tab delimited entries simple?
Asked:
###### Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Author Commented:
i want to use ifstream
Commented:
>> how do i do this in c++, is reading tab delimited entries simple?

It's not hard ;)

You can use getline to read one line of input from the file (and doing so in a loop until the end of the file is reached will read all lines).

http://www.cplusplus.com/reference/string/getline/

That will give you the line in a std::string, which you can then parse to get whatever you need from it.

If a line contains two tab-separated items, you could either use a substr-based approach (look for the tab, and then take two substrings - one left and one right of the tab),

http://www.cplusplus.com/reference/string/string/find/
http://www.cplusplus.com/reference/string/string/find_first_of/
http://www.cplusplus.com/reference/string/string/substr/

or use a stringstream, and simply "stream" the two items out of the string.

http://www.cplusplus.com/reference/iostream/istream/operator%3E%3E/

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Commented:
for the container you could use std::map<std::string, std::set<int> >.

if you have parsed key and value from line you would add that pair by

``````mymap[key].insert(value);
``````

the mymap[key] would return a reference to a set<int> which is either empty when the key was new or was the set already stored to the key. the insert would add the value to the set unless it already contains it.

Sara
Commented:
if the file is too large for loading it into memory, it also could give memory problems with the map unless there are many duplicates for keys or for key-value pairs.

did you try to load the whole text file into memory? if yes, did you use one big char buffer for all or did you use a dynamic container like std::vector<std::string> where you added line by line? for the second container you should be able to allocate much more memory cause it allocates memory in (small) chunks and doesn't need one huge contiguous free memory space.

Sara
Commented:
>> After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

I dont think you need a map for that, that is just a simple search for the maximum and given the nature of your algorithm you should be able to do it easily.

1) I suggest to use std::tuple<> if your STL supports it. That would save you from have to write a lot of basic operators. See the example here: http://msdn.microsoft.com/en-us/library/bb982837.aspx

2) Looking at the data and your approach - the way I understood the problem

A   1
A   2
A   4
A   8
A   4

You want to find the max contiguous triplet 4,8,4 from the sequence. This should be done this way

// pseudocode

typedef std::tuple<int, int, int> triplet_t;

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
if( temp > tmax)
tmax = temp;
}

Of course this is for just one user 'A', and you'd need a map but just one map.
Commented:
I think I missed one point. You havent defined the comparison rules for triplets, like which is greater 4,8,4 or 8,1,1. But - then all you need is define the comparison operators like

bool operator<(const triplet_t& t1, const triplet_t& t2) {
...
}

Just in case, if theres a conflict with std::tuple then just define a replica or even do it manually like

int compare(const triplet_t& t1, const triplet_t& t2) {
... return -1 if <, 0 for equal and 1 for t1 > t2
}

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
if( compare(temp, tmax) < 0 )
tmax = temp;
}
Commented:
sorry for my wrong suggestion using std::set. i didn't understand the requirements.

i think now the triplet is the key and the required result is the triplet with the maximum count for different keys.

if so, it could not be made with one map if the same triple sequence should not be counted twice for a key of the first map.

i would suggest std::map<std::string, std::string > for map1 and std::map<std::string, int> for map2 . don't think that we need a vector of triples for first map cause by storing the sequence as comma separated list of integers we could use std::find to check whether the next triple for a key already had a count.

for the sample sequence we would have

1) map1:  "A"   ",1,"
2) map1:  "A"   ",1,2,"
3) map1:  "A"   ",1,2,"
"B"   ",2,"
4) map1:  "A"   ",1,2,4,"
"B"   ",2,"
map2:  "1,2,4"    1
5) map1:  "A"   ",1,2,4,"
"B"   ",2,4,"
map2:  "1,2,4"    1
6) map1:  "A"   ",1,2,4,8,"
"B"   ",2,4,"
map2:  "1,2,4"    1
"2,4,8"    1
7) map1:  "A"   ",1,2,4,8,"
"B"   ",2,4,8,"
map2:  "1,2,4"    1
"2,4,8"    2
8) map1:  "A"   ",1,2,4,8,4,"
"B"   ",2,4,8,"
map2:  "1,2,4"    1
"2,4,8"    2
"4,8,4"    1

we would parse a line end extract key and value both as strings.

then we could get a reference of a map entry of map1 by

``````std::string & seq = map1[key];
``````

the 'seq' string is empty for new keys and would contain a comma separated sequence of value(s) if the key already existed in map1.

you now would check whether the seq already contains at least two values. you could do that by calling seq.rfind(',', pos) up to 3 times where pos starts with seq.length()-1 and then is pos-1. of course you need to break whenever you reached the begin of the string before you counted 3 commas. if you found two values you extract a temporary substring from seq and build a a temporary triplet string with 3 values where the last is the new value (for example ",1,2,4,"). then you could check whether the current seq already contains the new triplet by calling seq.find(triplet). if the result is std::string::npos you have a new triplet for the key.

in any case you would now add value + comma to a non-empty seq string or comma + value + comma to an empty seq. as seq is a reference to a string of a map1 entry  you would already edit in the map and don't have to store a new value.

map2 only would be filled in case of a new triplet. that would be done by stripping off the leading and trailing commas and do

``````if ((m = map2[triplet]++) > lastmax)
{
lastmax = m;
maxtriplet = triplet;
}
``````

again we would use that std::map::operator[] returns a reference to an already allocated and initialized value, which would be 0 if the triplet didn't exist and is the count so far if the triplet already existed. the operator++ then would increment the map entry value.

finally the maximum count is in lastmax and the maxtriplet would contain the associated triplet.

Sara
###### It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.