• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 239
  • Last Modified:

Grouping Columns and Sorting - For performance

Hi guys,

I'm reading 1000 records each from a file. Each of the record are shown below in [examples]. How do I group records based on similiarity of the first 3 columns.. For example, records such as below are same group..

5010 |90 |110|TGE93I+E93I+E39W|3923902932325
5010 |90 |110|F893K3+40+3R0OE2|3923902932322
5010 |90 |110|OUKKGD+R+4EJDDJ2 |3923902932321

And assign them to a unique group ID.. and within the group ID, sort the 4th column, note the 4th column is made up of 3 internal fields, delimited by +. By the end of operation, every record would belong to a group id. Then output them as

Group ID|5th Column
..
1|3923902932322
1|3923902932321
1|3923902932325
2|..
2|..
3|..

I using C# .NET for performance issues as I'm sorting close to ten thousands of lines. I was thinking of using hashtable as compared to ArrayList.

field name:

customer_id|store_id|store_type|reference_code|transaction_id

field type:

decimal(10)|smallint(2)|smallint(2)|char(15)|decimal(10)

[examples] (unsorted):

20330|2  |140|HRR8353904    |1600000104661
5010 |90 |110|F893K3403R0OE2|3923902932324
1020 |3  |350|LS34FKF8DD    |1493893040343
5010 |90 |110|OUKKGDR4EJDDJ |3923902932324
3125 |503|100|UFKO3DDFE930F |1493893040343
1020 |4  |150|HGKO9FJEF3    |3829483984843
5010 |90 |101|TGE93IE93IE39W|3923902932324


-
Any help & suggestions for the above will be appreciated.

Thanks! ;-)
0
jedistar
Asked:
jedistar
  • 21
  • 18
  • 2
  • +1
1 Solution
 
Bob LearnedCommented:
I would suggest using a SortedList, with the first 3 column values as the key, and then an ArrayList of objects that contain the information for each row.  If you have 2005, then you can use Generics to make this strong-typed.

Bob
0
 
jedistarAuthor Commented:
Problem is i'm using 1.1 and 2003. Do you have some sample codes that can help me?

Thanks!
0
 
jedistarAuthor Commented:
How do i use the first columns as key and sort according to the 4th col.
And the 4 column consist of internal fields that has "," delimiter to be
sorted as well..
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Bob LearnedCommented:
Something like this:

1) Create a class:

public class Example
{
    private string _name = "";
    private string _address = "";
    public string Name { get { return _name; } set { _name = value; } }
    public string Address { get { return _address ; } set { _address = value; } }
}

2) Create a SortedList to sort the values by a key:
     private SortedList _sortedExample = new SortedList();

3) Create an instance of a class:
    Example example = new Example();
    example.Name = "Bob";
    example.Address = "Cincinnati, OH, USA";

4) Add the instance to an ArrayList:
    ArrayList list = new ArrayList();
    list.Add(example);

5) Add the ArrayList to the SortedList:
    _sortedExample.Add(example.Name, list);

Bob
0
 
Jose ParrotGraphics ExpertCommented:
Hi,

If you're looking for "easy to code", C# is a good choice. If you are looking for performance, C# is a very bad choice. It is a poor speeding guy...

I see you are to run a hand made code... not high level powerful commands.
If you have to do this task many times a day, then chose C++ for performance.
If you have to do it few times a month, then the performance isn't an issue.

Anyway, the singlest way I see it is:
1. Create 10 new empty files:
    0.file, 1.file,...,9.file
2. Read each original file and copy each register (or line) to the new files, according to its 1st field:
    1???|???... to 1.file, 2???|???... to 2.file, so on
3. Order each new file individually by field 1
4. Order each new file again, this time by field 2
5. Order for the 3rd time, ordering by field 3
6. Again, for field 4 then
7. Again for field 5. Now 0.file,...,9.file are full ordered
8. If you need just one consolideted file, just chain them.

This so simple approach can perform better than database tools, mainly if the original files are flat files, as seem to be. Anyway, you have C# functions to help you.

Jose
0
 
JimBrandleyCommented:
jedistar - Did you want the groupID columns added to the five you already have?

Jim
0
 
JimBrandleyCommented:
Also, is this a different list than the one we worked last night?
0
 
jedistarAuthor Commented:
Thanks guys, I will evaluate the methods into methods now.

Hey Jim ;)

PS: In the way the dataset is the same, I'm trying a new approach to sort better to fit some business rules. You can ref to the same question answered by you and Greg earlier.

I added a 6th col called Group ID, groupID to group records who has the same first 3 cols. i.e:

20000|10|20|DFE+FE...
20000|10|20|JFK+J83...
20000|10|20|F83+JD...

A group must contain same first 3 cols, as this is a criteria... It must be sorted according to the 4th col. However I need to limit each grp to max of 300 records and it has to go into a new group ID if the current group already stores 300 records.. However it must be sorted according to the 4th col before splitting them up into different groups if it exceeds 300.

The 4th column is tricky.. as it could be like HF7+38UFF+MEKF+J323+GR. Each of these internal fields are delimited by "+" sign. I was thinking of doing icompare in each of them but not too sure how.. The 5th col doesnt matter.. The 6th col is the group ID to store which grp it belongs to..

All I actually need is to group the records based on first 3 columns and sorting them according to 4th col.. then limit them to 300 records per group.. new records should be in new group id.
0
 
JimBrandleyCommented:
Either way, if you use the sort following the load, the list is already in the order you want. Then:
1. Add a new member to your Proposal class say ProposalID, and initialize it to zero.

2.Add another method to your Form class
public void GroupProposals(ArrayList proposals)
{
   int proposalCount = proposals.Count;
   if (proposals.Count > 0)
   {
      int groupID = 1;
      int index = 1;
      Proposal pCurrent, pPrevious;

      ((Proposal)proposals[0]).GroupID = groupID;
      while( index < proposalCount)
      {
         pCurrent = (Proposal)proposals[index];
         pPrevious = (Proposal)proposals[index -1];
         if ((pCurrent.p1 != pPrevious.p1) || (pCurrent.p2 != pPrevious.p2) || (pCurrent.p2 != pPrevious.p2))
            groupID++;
         ((Proposal)proposals[index]).GroupID = groupID;
      }
   }
}

3. After
   recordList.Sort(new Sorter());
add
   GroupProposals(recordList);
0
 
JimBrandleyCommented:
That will still work. The plus symbol sorts lower than alphas or numerics, so it's not a problem. Make one change to GroupProposals.
Add int groupCount = 1; (For the first we do outside the loop. Then increment it each time you write. Change:
        if ((pCurrent.p1 != pPrevious.p1) || (pCurrent.p2 != pPrevious.p2) || (pCurrent.p2 != pPrevious.p2))
            groupID++;
 to:
if ((groupCount == 300) || (pCurrent.p1 != pPrevious.p1) || (pCurrent.p2 != pPrevious.p2) || (pCurrent.p2 != pPrevious.p2))
{
     groupID++;
     groupCount = 0;
}
Then after:
((Proposal)proposals[index]).GroupID = groupID;
groupCount++;
0
 
jedistarAuthor Commented:
Thanks Jim

just a query.. would that be more efficient or sorting them from the start when i'm processing each record from reading the input file.. ?

How do i sort them based on the 4th col
0
 
JimBrandleyCommented:
I think it would be difficult to beat the performance you were getting last night. I do not know of anyone who knows .Net performance better than Greg Young, so if he says that's a good way to go, it is likely about the best there is.

The 4th col is just a string, since '+' sorts lower than alphas or numerics, it will do exactly as you want. So,
GX88+....
will sort earlier than
GX88B+...
This way , you still have a pretty fast sort, and only have to iterate through the list one time to set the group ID.
0
 
JimBrandleyCommented:
One other point, you cannot set group ID till the sort is complete. The sort you have is good, so this reduces the change necessary to support the group.
0
 
JimBrandleyCommented:
I also noticed in GroupProposals, I was comparing p2 twice - second should be p3.
0
 
jedistarAuthor Commented:
oh no, don't mistaken me. i'm not doing this to beat the performance. am doing this bcos of a business rules to fulfill:

(a) inside recordList.Sort(new...); now.. how do i sort so that it breaks the 4th column into internal fields and sorts according to them as + as delimiter. I could use , as well if its needed. The only sort i need now is sorting within each grp in ascending order based on 4th col.

2. this is how i group and sort currently..

                  recordList = new ArrayList(size);
                  Hashtable h1 = new Hashtable(); // store group id
                  Hashtable h2 = new Hashtable(); // store count


                  try
                  {
                        if (File.Exists("1.txt"))
                        {
                              // open input file
                              sr = new StreamReader("1.txt");

                              while (sLine != null)
                              {
                                    sLine = sr.ReadLine();

                                    if (sLine != null)
                                    {
                                          string [] temp = sLine.Split('|');
                                          // key is first 3 cols eg: 300+2+105
                                          key = temp[0] + "," + temp[1] + "," + temp[2];
                                          val = "";

                                          // Check if key is in HashTable
                                          if (h1.ContainsKey(key))
                                                      val = h1[key].ToString(); // group id
                                                      h2[key] = Convert.ToInt16(h2[key]) + 1;
                                          else
                                          {
                                                groupID++;
                                                val = groupID.ToString();
                                                h1.Add(key, val);
                                                h2.Add(key, 1);
                                          }

                                          Proposal p = new Proposal();

                                          p.ID = recordList.Count;
                                          p.p1 = Convert.ToDecimal(temp[0]);
                                          p.p2 = Convert.ToInt16(temp[1]);
                                          p.p3 = Convert.ToInt16(temp[2]);
                                          p.p4 = AddZero(temp[3]);
                                          p.p5 = Convert.ToDecimal(temp[4]);
                                          p.p6 = Convert.ToInt16(val);
                                          // add record
                                          recordList.Add(p);
                                    }                                    
                              }

                              recordList.Sort(new Sorter());

            private class Sorter : IComparer  
            {
                  public int Compare(Object x, Object y)  
                  {
                        Proposal pX = (Proposal)x;
                        Proposal pY = (Proposal)y;

                        if (pX.p6 < pY.p6)
                              return -1;
                        else if (pX.p6 > pY.p6)
                              return 1;
                        else                        
                              return string.Compare(pX.p4, pY.p4);
                  }
            }            
0
 
jedistarAuthor Commented:
removed: p.p4 = AddZero(temp[3]);
added: p.p4 = temp[3];
0
 
jedistarAuthor Commented:
hope you get my idea what i mean..
0
 
JimBrandleyCommented:
I think I do. Consider this list:
TGE93I+E93I+E38W
TGE93I+E93I+E39W
TGE93I+E93I+E39Z
TGE93I+E93I+F39W
F893K3+40+3R0OE2
F893K3+41+3R0OE2
F893K3+42+3R0OE2
F893K3+42+3R0OE3
OUKKGD+R+4EJDDJ2
OUKKGD+S+4EJDDJ2
OUKKGD+T+4EJDDJ2

Just sorting them as strings produces:
F893K3+40+3R0OE2
F893K3+41+3R0OE2
F893K3+42+3R0OE2
F893K3+42+3R0OE3
OUKKGD+R+4EJDDJ2
OUKKGD+S+4EJDDJ2
OUKKGD+T+4EJDDJ2
TGE93I+E93I+E38W
TGE93I+E93I+E39W
TGE93I+E93I+E39Z
TGE93I+E93I+F39W

Isn't that the order you want? If it is, you can just leave it that way in the Icomparer.

I see you are only comparing p6 (twice) before p4. Didn't you want to compare:
5010 |90 |110
So there should be three different compares before you get to p4, assuming p4 is:
TGE93I+E93I+F39W
0
 
JimBrandleyCommented:
Fix the Icomparer so it does the first three columns of data (I get confused by the pn), plus the one with the plus symbols, then take it for a test drive. I think you'll find it's just what you want.
0
 
jedistarAuthor Commented:
Yes thats the order! :)

What should I have for the comparer now?

How do i integrate everything? seems messy now sorry!
0
 
JimBrandleyCommented:
I would comment out the hash as well. You will find it runs faster without it.
0
 
jedistarAuthor Commented:
what happens if the length of inetrnal fields of 4th col is inconsistent, seems like it will affect the sort:

T93I+E93I+E38W
TGE93I+E3I+E39W
TGE93I++E39Z
TGE93I+E93I+F39W
F893K3+40+3R0OE2
F893K3+41+3RE2
F893K3+42+3R0OE2
F893K3+42+3R03
OUKGD+R+4EJDDJ2
OUKKGD+S+4EJDDJ2
OUKKGD+T+4EJDDJ2
0
 
JimBrandleyCommented:
Nope - Try it with a simple test program. I'm gluing things together now - will post as soon as I finish the comparer...
0
 
JimBrandleyCommented:
     recordList = new ArrayList(size);
      //Hashtable h1 = new Hashtable(); // store group id
      //Hashtable h2 = new Hashtable(); // store count


      try
      {
            if (File.Exists("1.txt"))
            {
                  // open input file
                  sr = new StreamReader("1.txt");

                  while (sLine != null)
                  {
                        sLine = sr.ReadLine();

                        if (sLine != null)
                        {
                              // (I needed a program to identify the players)
                              // P1    P2   P3        P4           P5
                              // 5010 |90 |110|TGE93I+E93I+E39W|3923902932325
                              string [] temp = sLine.Split('|');
                              // key is first 3 cols eg: 300+2+105
                              //key = temp[0] + "," + temp[1] + "," + temp[2];
                              //val = "";

                              // Check if key is in HashTable
                              //if (h1.ContainsKey(key))
                              //            val = h1[key].ToString(); // group id
                              //            h2[key] = Convert.ToInt16(h2[key]) + 1;
                              //else
                              //{
                              //      groupID++;
                              //      val = groupID.ToString();
                              //      h1.Add(key, val);
                              //      h2.Add(key, 1);
                              //}

                              Proposal p = new Proposal();

                              p.ID = recordList.Count;
                              p.p1 = Convert.ToDecimal(temp[0]);
                              p.p2 = Convert.ToInt16(temp[1]);
                              p.p3 = Convert.ToInt16(temp[2]);
                              p.p4 = temp[3];
                              p.p5 = Convert.ToDecimal(temp[4]);
                              //p.p6 = Convert.ToInt16(val);
                              // add record
                              recordList.Add(p);
                        }                                    
                  }

                  recordList.Sort(new ProposalComparer());
                  GroupProposals(recordList);
            }
      }
}
public void GroupProposals(ArrayList proposals)
{
   int proposalCount = proposals.Count;
   if (proposals.Count > 0)
   {
      int groupCount = 1;
      int groupID = 1;
      int index = 1;
      Proposal pCurrent, pPrevious;

      ((Proposal)proposals[0]).GroupID = groupID;
      while( index < proposalCount)
      {
         pCurrent = (Proposal)proposals[index];
         pPrevious = (Proposal)proposals[index -1];
         if ((groupCount == 300) || (pCurrent.p1 != pPrevious.p1) || (pCurrent.p2 != pPrevious.p2) || (pCurrent.p3 != pPrevious.p3))
         {
            groupID++;
            groupCount = 0;
         }
         ((Proposal)proposals[index]).GroupID = groupID;
         groupCount++;
      }
   }
}


    class ProposalComparer : IComparer<Proposal>
    {

        #region IComparer<Proposal> Members

        public int Compare(Proposal x, Proposal y)
        {
            if (x.p1 != y.p1)
            {
                return x.p1.CompareTo(y.p1);
            }
            if (x.p2 != y.p2)
            {
                return x.p2.CompareTo(y.p2);
            }
            if (x.p3 != y.p3)
            {
                return x.p3.CompareTo(y.p3);
            }
            return string.Compare(x.p4, y.p4);
        }

        #endregion
    }
0
 
JimBrandleyCommented:
OOps - I forgot this is not 2.0 - Get rid of <Proposal> in the Icomparer class definition.
0
 
jedistarAuthor Commented:
no worries..
quite slow, still processing 10000 records
0
 
jedistarAuthor Commented:
taking over 5 mins now, how is it for u
0
 
JimBrandleyCommented:
Yikes!

Can you post your current code?
0
 
jedistarAuthor Commented:
same as the above you posted
0
 
JimBrandleyCommented:
Stop the debugger. I forgoot to increment index in the grouping loop. Should be:
((Proposal)proposals[index++]).GroupID = groupID;
0
 
jedistarAuthor Commented:
By the way.. the sorter im using isnt sorting well for example this is an output i got:

312|2|101|9900+8+100+10002+8+102+5309+4659|57021533|2
312|2|101|9900+8+99+10001+9+100+5292+8053|81377863|2
312|2|101|9900+9+100+9998+8+98+5191+532|66982241|2
312|2|101|9901+10+99+10000+8+100+5391+5454|94312748|2

the 2nd row should precede the first row?


            private class Sorter : IComparer  
            {
                  public int Compare(Object x, Object y)  
                  {
                        Proposal pX = (Proposal)x;
                        Proposal pY = (Proposal)y;

                        if (pX.p6 < pY.p6)
                              return -1;
                        else if (pX.p6 > pY.p6)
                              return 1;
                        else                        
                              return string.Compare(pX.p4, pY.p4);
                  }
            }            

These are the kinda problems i mean..
Thanks!
0
 
JimBrandleyCommented:
I guess this needs more thought - These are all numeric - all the examples had primarily alpha characters in these columns, and all started with alpha characters. So the column with the + symbols can contain entirely numeric strings as well? Is there any pattern or limit on what they can contain?

For example:
FGF+12+GG
1+FTF+3
100+FTF+300
9+FTF+90

None of the three columns can be sorted as numeric or alpha. There needs to be some sort of a pattern before a solution is possible.

0
 
jedistarAuthor Commented:
both alphanumerics n num
0
 
jedistarAuthor Commented:
it will sort according to the first field.. then 2nd field.. then 3rd field so on..
0
 
JimBrandleyCommented:
If you have some control over the content of these columns, and some come in as all numeric characters, the only way to sort them (mixed with alpha characters in other rows) is if you add leading zeros to make the numeric ones all the same length. Then:

312|2|101|9900+08+099+10001+9+100+5292+8053|81377863|2
312|2|101|9900+08+100+10002+8+102+5309+4659|57021533|2
312|2|101|9900+08+099+10001+9+100+5292+8053|81377863|2
312|2|101|9900+09+100+09998+8+098+5191+0532|66982241|2
312|2|101|9901+10+099+10000+8+100+5391+5454|94312748|2
312|2|101|GX337+ etc.

Will sort correctly. The most efficient time to do the padding is likely when the file we are reading here is constructed. On the other hand, if you have no control over them, this will be difficult - and likely pretty slow, because it will take a lot of string manipulation.
0
 
JimBrandleyCommented:
"it will sort according to the first field.. then 2nd field.. then 3rd field so on.."
How? If one column contains:
FFGX
1
1000
2
9

The only way I can think of to make numbers sort like numbers, mixed in with non-numbers is to:
Scan the entire list once just to find the length.
Go over the entire list again, padding with zero on the left till all are the same length.
Then you can perform an alpha sort.
0
 
JimBrandleyCommented:
I need to hit the sack - 01:00 where I am. I'll check back with you in the morning.

Jim
0
 
jedistarAuthor Commented:
Oh okay, yes my mistake, there will be rules to prevent mixture.

Thanks again! nites
0
 
jedistarAuthor Commented:
Will padding it with zero work just fine
0
 
JimBrandleyCommented:
Yes. If they are padded with leading zeros, the alpha sort will work correctly.
0
 
jedistarAuthor Commented:
Thanks Jim
0
 
JimBrandleyCommented:
My pleasure. Glad that worked for you.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 21
  • 18
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now