Link to home
Start Free TrialLog in
Avatar of labradorchik
labradorchikFlag for United States of America

asked on

In SAS how to create sum of values of counts for each variable?

I am trying to sum (sumcnt) the values of counts (cnt) for each variable (name).
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?

Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.

Example of how "sumcnt" variable should look like:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C

My current code:

data combined (keep = name cnt location);
    length location $ 1;    

    set data1 (in=a)
          data2 (in=b)
          data3 (in=c);

location = "*";              

if a then location = "A";
if b then location = "B";
if c then location = "C";
run;

Open in new window

Avatar of theartfuldazzler
theartfuldazzler
Flag of South Africa image

Hi

Try:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from combined
group by name;
RUN;QUIT;

Open in new window

Avatar of labradorchik

ASKER

Thank you very much!!
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?
ASKER CERTIFIED SOLUTION
Avatar of theartfuldazzler
theartfuldazzler
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Ian
Hi there labradorchik,

Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work.  Note there is a sort but generally sorts have been well optimized and go quite fast.

First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the   cnt  values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt  will be just partial sums for the other locations for each name.

Next we sort it all so that the names and locations are now in increasing order.

Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total).  We drop the variable (sumC) which had all the previous partial sums.

NOTE
====

proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.

data combinedTemp1 (keep = name cnt sumC location);
    length location $ 1;    
    retain sumC;

    set data3 (in=c)
               data2 (in=b)
               data1 (in=a);
    by name;
** This set (with 3 data sets and the following by statement produces an interleaving **;


    ** statement below may look a bit funny, but if does NOT use
     conditionals (with the associated branches which can slow us down).
     note that only one of a, b and c can be true (= 1) at any one time, the others
     are false (= 0).
    **;
    location = substr("ABC", a + 2*b + 3*c, 1);              

   if first.name then sumC = 0;

   sumC = sumC + cnt;

run;

proc sort data=combinedTemp1 out=combinedTemp2;
by  name location;
run;

data combined(drop=sumC);

    ******************************************************************
    ** of the following 4 statements we really only need the retain **
    ** sumcnt, but listing all these names in order results in the  **
    ** output dataset having them in in the required order          **
    ******************************************************************;
    length name $ 12;    
    length cnt 8;
    retain sumcnt;
    length location $ 1;
       

   set combinedTemp2;
   by name location;

   if first.name then sumcnt = sumC;

run;

Open in new window


Hope this helps.

Ian
Thank you very much to both of you!!
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.

One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".
Yes:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from
  (select "A" as location, name, cnt from data1
   union all
   select "B" as location, name, cnt from data2
   union all
   select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;
for some reason it is still comes out by "name". Is there a way that dataset can come out as by decreasing "sumcnt"?  

for example:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C
Adam            01            12              A
Adam            04            12              B
Adam            03            12              C
Adam            04            12              B
Anyone have any comments or suggestions?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ian, thank you very much!
Sort worked fine! I just had to change it to "descending" instead of "decreasing".

I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :)  It would take forever to process all those records.