asked on

In SAS how to create sum of values of counts for each variable?

I am trying to sum (sumcnt) the values of counts (cnt) for each variable (name).
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?

Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.

Example of how "sumcnt" variable should look like:

name cnt sumcnt location

John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C

My current code:

data combined (keep = name cnt location);
    length location $ 1;    

    set data1 (in=a)
          data2 (in=b)
          data3 (in=c);

location = "*";              

if a then location = "A";
if b then location = "B";
if c then location = "C";
run;

Open in new window

theartfuldazzler

Hi

Try:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from combined
group by name;
RUN;QUIT;

Open in new window

labradorchik

ASKER

Thank you very much!!
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?

ASKER CERTIFIED SOLUTION

theartfuldazzler

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ian

Hi there labradorchik,

Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work. Note there is a sort but generally sorts have been well optimized and go quite fast.

First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the cnt values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt will be just partial sums for the other locations for each name.

Next we sort it all so that the names and locations are now in increasing order.

Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total). We drop the variable (sumC) which had all the previous partial sums.

NOTE
====
proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.

data combinedTemp1 (keep = name cnt sumC location);
    length location $ 1;    
    retain sumC;

    set data3 (in=c)
               data2 (in=b)
               data1 (in=a);
    by name;
** This set (with 3 data sets and the following by statement produces an interleaving **;


    ** statement below may look a bit funny, but if does NOT use
     conditionals (with the associated branches which can slow us down).
     note that only one of a, b and c can be true (= 1) at any one time, the others
     are false (= 0).
    **;
    location = substr("ABC", a + 2*b + 3*c, 1);              

   if first.name then sumC = 0;

   sumC = sumC + cnt;

run;

proc sort data=combinedTemp1 out=combinedTemp2;
by  name location;
run;

data combined(drop=sumC);

    ******************************************************************
    ** of the following 4 statements we really only need the retain **
    ** sumcnt, but listing all these names in order results in the  **
    ** output dataset having them in in the required order          **
    ******************************************************************;
    length name $ 12;    
    length cnt 8;
    retain sumcnt;
    length location $ 1;
       

   set combinedTemp2;
   by name location;

   if first.name then sumcnt = sumC;

run;

Open in new window

Hope this helps.

Ian

labradorchik

ASKER

Thank you very much to both of you!!
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.

One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".

theartfuldazzler

Yes:

PROC SQL;
Create table combined2 as
select name, cnt, sum(cnt) as sumcnt, location from
(select "A" as location, name, cnt from data1
union all
select "B" as location, name, cnt from data2
union all
select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;

labradorchik

ASKER

for some reason it is still comes out by "name". Is there a way that dataset can come out as by decreasing "sumcnt"?

for example:

name cnt sumcnt location

John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C
Adam 01 12 A
Adam 04 12 B
Adam 03 12 C
Adam 04 12 B

labradorchik

ASKER

Anyone have any comments or suggestions?

SOLUTION

Ian

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

labradorchik

ASKER

Ian, thank you very much!
Sort worked fine! I just had to change it to "descending" instead of "decreasing".

I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :) It would take forever to process all those records.