asked on

STL map question and example required.

Hi,

I'm wanting to use something similar to a hash table where I can insert data into the table and if it already exists then the number of occurences increases by one; BUT where the occurences are sorted so that the highest occuring is at one end and the lowest occuring at the other end of the table.

Is this possible with a map and if so could a simple example be provided...

Many thanks,
Uni.

pepr

How many items do you expect to process? How big (say in bytes) is the data element? If there is only few of them and they are relatively small and you insert the item once and then you want to get a histogram or such things, you may want to use a dynamic array (std::vector) of pairs (std::pair) od [key, counter] and build the sorted array using slightly modified insertion sort (http://en.wikipedia.org/wiki/Insertion_sort).

How often do you want to search for the items in other cases than insertion of another one? If getting the sorted sequence by frequency (the histogram) is only occasional, you may still want to use the map (i.e. fast insertion and fast search), get the pairs when needed and sort them -- insertion sort can also be used here.

If the number of data is somehow huge and if the data are large records where only some attribute is the key--the data may possibly be stored already in a database--then some SQL engine can do the job.

You should tell more details about the data...

efn

No, this is not possible with a map, at least not efficiently. To count occurrences of data values efficiently, you would want to have counts as map values and data values as map keys. That would mean that the map would be sorted by data values, not by counts, as you wanted.

There are various ways to do what you want with either more complexity or low efficiency. The choice would depend on your specific needs for count-sorted access: do you just want to be able to read the data in count order after reading all the data, or do you want to be able to read in count order at any time? Do you want to read all the data in count order, or do you only care about the highest or the lowest or some other subset?

Unimatrix_001

ASKER

Hi,

- The data element is 4 bytes.
- The number of them is unknown and cannot be predicted, but what I can say is that the data element is an array and every element is the same size.

>>then some SQL engine can do the job.
I'd rather not have the program rely on anything.

>>do you just want to be able to read the data in count order after reading all the data
Correct, only once all the data is inputted will I get the sorted data back out.

>>Do you want to read all the data in count order
Correct, it doesn't matter if it is from the least occuring to the most, just as long as it is accessed in order.

lucky_james

Yes, i feel you can make use of it.

make map of no. of occurrence - Data pairs.
Then whenever you get any data, you need to iterate thru the map pairs.

a more optimized implementation would be to use your own custom structures.
get a struct1 which has number of occurrence and a pointer to struct2.
struct2 would have data value and a pointer to struct1.

now you can sort the struct1. when you want the data to be in order.
and you can get the number of occurrences from data directly by accessing the struct1 pointer.

This second approach would be very gud to use.

Infinity08

>> >>do you just want to be able to read the data in count order after reading all the data
>> Correct, only once all the data is inputted will I get the sorted data back out.

In that case, a good approach would be to use a map with the data as the key, and you can simply do :

map<Data, int> myMap;

// insert data :
Data data;
(myMap[data])++; // <--- will increment the counter if it exists, and create one if it doesn't exist yet

And then when you want to retrieve the data ordered by counter, you simply move the data to a different container (a vector should be sufficient) that orders on the counter ...

Unimatrix_001

ASKER

lucky_james: So you're saying that the map would be delcared as follows:
map<Struct1, Struct2> mappings;
and once all the data is sorted, I then use some STL sort algorithm on it by as you said getting the occurences from one pointer or the other?

Infinity - long time no speak! ;) Looks a nice clean solution, although wouldn't moving the data to a different container really hamper performance in comparison to the solution offered by lucky_james? Again, once shifted to the vector, I then use some STL sort on it?

Cheers,
Uni

pepr

Well, my above mentioned idea of building sorted array is bad. The reason is that it is fast to search in the sorted array only when you search for the key that was used also for sorting. Here the search is for a key, but the sort is by count.

Infinity08

>> Infinity - long time no speak! ;)

:) Good to see you too ...

>> although wouldn't moving the data to a different container really hamper performance in comparison to the solution offered by lucky_james?

I'm not sure I understand lucky_james's solution, but it seems to involve sorting twice ... First on the data, then on the counter. That's the same as what I proposed, just using more complicated structures. But I could be mistaken ... lucky_james, can you clarify your solution ?

Btw, a small update on what I proposed : the second container should probably also be a map instead of a vector.

If you need the application to be more performant than that, then you'll have to somehow be able to order on two keys within the same container. No STL container does that, so you'll have to create your own container.

lucky_james

map<Struct1, Struct2> mappings would lead to moving of data and thus, may hamper performance.

your can have a list,array or vector (any 1D structure) of struct1. struct2 would be linked to struct1 thru pointers.
While sorting you need not to move data then.

lucky_james

>>>>>I'm not sure I understand lucky_james's solution, but it seems to involve sorting twice ... First on the data, then on the counter.

No infinity, all you need to sort on the counter. No need to move the data. The second chain would get the updated pointer.

My basic idea lies as to keep counters in one chain and data in other.
Now sort one chain but you need not to move the corresponding places of the second chain as the chains are linked thru pointers. of course, you need to update the pointers in the second chain so that i can have a valid pointer to reference back.
So, it has two advantages:
1. sorting on counter only.
2. No moving of data.

pepr

I suspect the solution of lucky_james is somehow confused. Why would you keep map of key+counter and one more structure that sorts the pairs by the counter? I can see no optimization here.

You probably have to move the pairs into a different container. The map cannot be sorted and the vector is not suitable for insertion here.

Infinity08

>> No need to move the data. The second chain would get the updated pointer.

So, you're still sorting twice, and still have two containers ... I don't see the optimization compared to what I suggested ...

lucky_james

you need not to have a map...once you are going for your custom structures then you can go for two 1-D structures (i have already stated that!)
Confusion may arise due to use of map and my proposed custom structure.
My funda of increasing performance is to sort on one Dimensional chain and then reference the second chain from it. It also has the facility to reference back to prevent reiterating.

I hope it should be clear by now. Let me know if it is not clear.

<<<<The map cannot be sorted and the vector is not suitable for insertion here.
back to top

Thats why my design was more towards 1D structures.

Unimatrix_001

ASKER

lucky_james: Could you possibly put down some pseudo-code... I'm not grasping what you mean.

lucky_james

<<So, you're still sorting twice, and still have two containers ... I don't see the optimization compared to what I suggested ...

no infinity, it simply means to add another instruction while sorting.
There would be one sorting only, and that would be one first chain.

Unimatrix_001

ASKER

lucky_james: How do you count the occurences in a fast manner without the map?

Infinity08

>> you need not to have a map...once you are going for your custom structures then you can go for two 1-D structures (i have already stated that!)

A map is also a 1D container ...

What you propose is to duplicate the data from the beginning into two containers, and sort both containers as you insert data, correct ?

How's that better than inserting in one container, while sorting on the first key, and at the end moving everything to a second container, and sorting on the second key. You're still using two containers, and you're still sorting each container on a different key.

Your solution just involves more complicated structures, and a custom container (with more chance for bugs).

pepr

Try this:

=======================================================
#include <iostream>
#include <utility>
#include <map>
#include <vector>
#include <algorithm>

using namespace std;

typedef map<int, int> tMap;
typedef pair<int, int> tPair;
typedef vector<tPair> tVec;

bool gt(const tPair & a, const tPair & b)
{
return a.second > b.second;
}

int main()
{
tMap m;

++m[1];
++m[1];
++m[1];
++m[1];
++m[1];
++m[1];
++m[1];

++m[2];
++m[2];
++m[2];
++m[2];

++m[0];

++m[3];
++m[3];
++m[3];
++m[3];
++m[3];

for (tMap::const_iterator it = m.begin(); it != m.end(); ++it)
{
cout << it->first
<< " occurs " << it->second << " times.\n";
}

cout << "Now copy the pairs to the vector...\n";
tVec vec;
for (tMap::const_iterator it = m.begin(); it != m.end(); ++it)
{
vec.push_back(*it);
}

cout << "Show the content of the vector...\n";
for (tVec::const_iterator it = vec.begin(); it != vec.end(); ++it)
{
cout << it->first
<< " occurs " << it->second << " times.\n";
}

cout << "Sort and show the result...\n";
sort(vec.begin(), vec.end(), gt);
for (tVec::const_iterator it = vec.begin(); it != vec.end(); ++it)
{
cout << it->first
<< " occurs " << it->second << " times.\n";
}

return 0;
}
=======================================================

It produces the following:

C:\tmp\___cpp>main.exe
0 occurs 1 times.
1 occurs 7 times.
2 occurs 4 times.
3 occurs 5 times.
Now copy the pairs to the vector...
Show the content of the vector...
0 occurs 1 times.
1 occurs 7 times.
2 occurs 4 times.
3 occurs 5 times.
Sort and show the result...
1 occurs 7 times.
3 occurs 5 times.
2 occurs 4 times.
0 occurs 1 times.

Infinity08

>> Try this:

That illustrates my suggestion, yes ;) >Except that it might be easier to use a map for the second container instead of a vector - that way you don't have to call sort manually ... Just a matter of preference really.

Unimatrix_001

ASKER

Hm, I get the impression from the code that it's not too fast...?

lucky_james

ok.....

struct 1 :
counter.
pointer to struct2.

struct2:
data.
pointer to struct1.

now have struct1 in array1

Now while sorting:

sort array1.
while sorting update pointers to struct1 in struct2.

yup that would have o(n) while updating counters.

pepr

lucky_james: "you need not to have a map...once you are going for your custom structures then you can go for two 1-D structures"

The problem is that you need to search for the existing element to increase the counter. It is not efficient in any combination of vectors.

Unimatrix_001

ASKER

Actually, could this be done something similar to this:

MyStruct{
char data;
int occurences;
};

struct MyStructEqual{
bool operator()(const MyStruct* a, const MyStruct* b) const{
return a.occurences==b.occurences;
}
};

hash_map<MyStruct, MyStruct*, hash<MyStruct>, MyStructEqual> myMap;

I'm unsure if value can be a pointer to the key, but if it can - perhaps this could be a solution?

pepr

lucky_james: "yup that would have o(n) while updating counters."

This is a complexity of updating one counter in one insertion step. So, when inserting n elements, you have O(n^2). And the final sort has O(n log n) in the best case. (Well, mixture of theory and praxis as the result of both steps combined cannod be summed, but the real times can.) The overall complexity is about O(n^2).

The map+vector solution has about O(n log n) for insertion, copying O(n), and sort also O(n log n). The final complexity is O(n log n), i.e. better than in your solution.

ASKER CERTIFIED SOLUTION

Infinity08

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Unimatrix_001

ASKER

How about this:

MyStruct{
char data;
int occurences;
};

vector<MyStruct> myVector;
hash_map<char, MyStruct*> myMap;

MyStruct myStructObj1;
//Populate myStructObj1
myVector.push_back(myStructObj1);
myMap[myStructObj1.data]->occurences++;

MyStruct myStructobj2;
//Populate myStructobj2
myVector.push_back(myStructobj2);
myMap[myStructobj2.data]->occurences++;

MyStruct myStructobj3;
//Populate myStructobj3
myVector.push_back(myStructobj3);
myMap[myStructobj3.data]->occurences++;

//Sort the vector based on the occurences...

What would that be like for speed?

Unimatrix_001

ASKER

Sorry, a bit of a mistake in that... I think because of the mistake that idea doesn't work... :(

pepr

Unimatrix_001: The counter should not be the part of the key. It does not make sense.

Corrections to my notes about the complexity. The problem is that the size of map/vector is smaller than the number of inserted elements because only unique occurences are captured... plus the counter.

Infinity08: Please, explain how the second map would be used? The problem is that map does not guarantee iteration in the sorted sequence.

SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Unimatrix_001

ASKER

Thank you. :)

Infinity08

>> Infinity08: Please, explain how the second map would be used? The problem is that map does not guarantee iteration in the sorted sequence.

?? Of course it does : a map is a sorted container, and has iterators. (btw, notice that I used a multimap to be able to store duplicate counter values).

>> Infinity08: Well but the insertion to the second map and traversal is at least as time/resources consuming as insertion into the vector and sorting.

Not necessarily. Insertion in a map already sorts on the key, so it inserts it in the correct location. Sorting a vector requires moving around the elements in the array ... I don't have performance measurements, but I think the map solution will be faster. Maybe something you can test, Unimatrix_001 ?

>> Moreover, it may not be as readable as:

I don't see why ... compare your code to mine - how's mine not readable ?

pepr

Infinity08: "a map is a sorted container"
You are right. I thoutght that it is not stated in the reference.

"Insertion in a map already sorts on the key, so it inserts it in the correct location."
Right. It is probably always implemented as an ordered binary tree in C++ (i.e. not a linear structure, probably always red-black tree). It means that insertion of all elements has complexity about O(n log n). Then iterators can traverse the tree in the in-order way (i.e. process the left subtree, process the node, process the right subtree) -- time complexity about O(n). It probably will be faster.

When using hash_map<>, the result may be even faster but it does not guarantee ordering.

For readability -- forget my comments ;)