Solved

# Read a large file line by line into a map

Posted on 2012-04-03
863 Views
Hi

I have a large file, that cannot be loaded into the memory all at once.
Say the file has the following format

A   1
A   2
B   2
A   4
B   4
A   8
B   8
A   4
I want to read the file line by line, and dynamically create a "map?" in C++ such that

map user, count

where user = A, B ....
Count = a list of sets? (not sure what the best data structure is) to hold
{{1,2,4} {2,4,8} {4,8,4} }
basically the flow is

1) read file line by line
2) initialize the key of some map by the first column of first row (user) example A
3) take the value of this and add it into a list? so it would be {{1}}
4) look at the next line.The map would now contain

map ={
key:A value{{1,2}}
}

5) look at third line
map={
key=A value :{{1,2}}
key =B value :{{2}}
}

6) 4th line

map={
key=A value :{{1,2,4}}
key =B value :{{2}}
}

Now that there are 3 elements are reached {1,2,4} for A, i want to create a new map with this triplet as the key

map2 = {
key = {1,2,4} and value = 1}

5th line =

map={
key=A value :{{1,2,4}}
key =B value :{{2,9}}
}

and 6th line we find the 4th value for A so

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4}}
}

and also create a new entry in map2 for {2,4,8}

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 1
}
for the next line

map={
key=A value :{{1,2,4} {2,4,8 }
key =B value :{{2,4, 8}}
}

so we increment map2

map2 = {
key = {1,2,4} and value = 1
key = {2,4,8} and value = 2
}

After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

how do i do this in c++, is reading tab delimited entries simple?
0
Question by:Vlearns

Author Comment

i want to use ifstream
0

LVL 53

Accepted Solution

Infinity08 earned 167 total points
>> how do i do this in c++, is reading tab delimited entries simple?

It's not hard ;)

You can use getline to read one line of input from the file (and doing so in a loop until the end of the file is reached will read all lines).

http://www.cplusplus.com/reference/string/getline/

That will give you the line in a std::string, which you can then parse to get whatever you need from it.

If a line contains two tab-separated items, you could either use a substr-based approach (look for the tab, and then take two substrings - one left and one right of the tab),

http://www.cplusplus.com/reference/string/string/find/
http://www.cplusplus.com/reference/string/string/find_first_of/
http://www.cplusplus.com/reference/string/string/substr/

or use a stringstream, and simply "stream" the two items out of the string.

http://www.cplusplus.com/reference/iostream/istream/operator%3E%3E/
0

LVL 32

Expert Comment

for the container you could use std::map<std::string, std::set<int> >.

if you have parsed key and value from line you would add that pair by

``````mymap[key].insert(value);
``````

the mymap[key] would return a reference to a set<int> which is either empty when the key was new or was the set already stored to the key. the insert would add the value to the set unless it already contains it.

Sara
0

LVL 32

Expert Comment

if the file is too large for loading it into memory, it also could give memory problems with the map unless there are many duplicates for keys or for key-value pairs.

did you try to load the whole text file into memory? if yes, did you use one big char buffer for all or did you use a dynamic container like std::vector<std::string> where you added line by line? for the second container you should be able to allocate much more memory cause it allocates memory in (small) chunks and doesn't need one huge contiguous free memory space.

Sara
0

LVL 22

Expert Comment

>> After i finish reading the file, i want to be able to read map2 and pullout the triplet key with the maximum value

I dont think you need a map for that, that is just a simple search for the maximum and given the nature of your algorithm you should be able to do it easily.

1) I suggest to use std::tuple<> if your STL supports it. That would save you from have to write a lot of basic operators. See the example here: http://msdn.microsoft.com/en-us/library/bb982837.aspx

2) Looking at the data and your approach - the way I understood the problem

A   1
A   2
A   4
A   8
A   4

You want to find the max contiguous triplet 4,8,4 from the sequence. This should be done this way

// pseudocode

typedef std::tuple<int, int, int> triplet_t;

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
if( temp > tmax)
tmax = temp;
}

Of course this is for just one user 'A', and you'd need a map but just one map.
0

LVL 22

Assisted Solution

ambience earned 166 total points
I think I missed one point. You havent defined the comparison rules for triplets, like which is greater 4,8,4 or 8,1,1. But - then all you need is define the comparison operators like

bool operator<(const triplet_t& t1, const triplet_t& t2) {
...
}

Just in case, if theres a conflict with std::tuple then just define a replica or even do it manually like

int compare(const triplet_t& t1, const triplet_t& t2) {
... return -1 if <, 0 for equal and 1 for t1 > t2
}

triple_t tmax = make_tuple(1,2,4)
while(!eof)
{
triplet_t temp = make_tuple(get<1>(tmax), get<2>(tmax), new value for A);
if( compare(temp, tmax) < 0 )
tmax = temp;
}
0

LVL 32

Assisted Solution

sarabande earned 167 total points
sorry for my wrong suggestion using std::set. i didn't understand the requirements.

i think now the triplet is the key and the required result is the triplet with the maximum count for different keys.

if so, it could not be made with one map if the same triple sequence should not be counted twice for a key of the first map.

i would suggest std::map<std::string, std::string > for map1 and std::map<std::string, int> for map2 . don't think that we need a vector of triples for first map cause by storing the sequence as comma separated list of integers we could use std::find to check whether the next triple for a key already had a count.

for the sample sequence we would have

1) map1:  "A"   ",1,"
2) map1:  "A"   ",1,2,"
3) map1:  "A"   ",1,2,"
"B"   ",2,"
4) map1:  "A"   ",1,2,4,"
"B"   ",2,"
map2:  "1,2,4"    1
5) map1:  "A"   ",1,2,4,"
"B"   ",2,4,"
map2:  "1,2,4"    1
6) map1:  "A"   ",1,2,4,8,"
"B"   ",2,4,"
map2:  "1,2,4"    1
"2,4,8"    1
7) map1:  "A"   ",1,2,4,8,"
"B"   ",2,4,8,"
map2:  "1,2,4"    1
"2,4,8"    2
8) map1:  "A"   ",1,2,4,8,4,"
"B"   ",2,4,8,"
map2:  "1,2,4"    1
"2,4,8"    2
"4,8,4"    1

we would parse a line end extract key and value both as strings.

then we could get a reference of a map entry of map1 by

``````std::string & seq = map1[key];
``````

the 'seq' string is empty for new keys and would contain a comma separated sequence of value(s) if the key already existed in map1.

you now would check whether the seq already contains at least two values. you could do that by calling seq.rfind(',', pos) up to 3 times where pos starts with seq.length()-1 and then is pos-1. of course you need to break whenever you reached the begin of the string before you counted 3 commas. if you found two values you extract a temporary substring from seq and build a a temporary triplet string with 3 values where the last is the new value (for example ",1,2,4,"). then you could check whether the current seq already contains the new triplet by calling seq.find(triplet). if the result is std::string::npos you have a new triplet for the key.

in any case you would now add value + comma to a non-empty seq string or comma + value + comma to an empty seq. as seq is a reference to a string of a map1 entry  you would already edit in the map and don't have to store a new value.

map2 only would be filled in case of a new triplet. that would be done by stripping off the leading and trailing commas and do

``````if ((m = map2[triplet]++) > lastmax)
{
lastmax = m;
maxtriplet = triplet;
}
``````

again we would use that std::map::operator[] returns a reference to an already allocated and initialized value, which would be 0 if the triplet didn't exist and is the count so far if the triplet already existed. the operator++ then would increment the map entry value.

finally the maximum count is in lastmax and the maxtriplet would contain the associated triplet.

Sara
0

## Featured Post

Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…