Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 642
  • Last Modified:

In SAS how to create sum of values of counts for each variable?

I am trying to sum (sumcnt) the values of counts (cnt) for each variable (name).
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?

Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.

Example of how "sumcnt" variable should look like:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C

My current code:

data combined (keep = name cnt location);
    length location $ 1;    

    set data1 (in=a)
          data2 (in=b)
          data3 (in=c);

location = "*";              

if a then location = "A";
if b then location = "B";
if c then location = "C";
run;

Open in new window

0
labradorchik
Asked:
labradorchik
  • 5
  • 3
  • 2
2 Solutions
 
theartfuldazzlerCommented:
Hi

Try:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from combined
group by name;
RUN;QUIT;

Open in new window

0
 
labradorchikAuthor Commented:
Thank you very much!!
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?
0
 
theartfuldazzlerCommented:
One could combine the two steps into one step:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from
  (select "A" as location, name, cnt from data1
   union all
   select "B" as location, name, cnt from data2
   union all
   select "C" as location, name, cnt from data3)
group by name;
RUN;QUIT;

Open in new window


Otherwise, one needs to look at other options like hardware, or changing the location of your work directory.
0
Microsoft Certification Exam 74-409

VeeamĀ® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
ShannonEECommented:
Hi there labradorchik,

Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work.  Note there is a sort but generally sorts have been well optimized and go quite fast.

First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the   cnt  values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt  will be just partial sums for the other locations for each name.

Next we sort it all so that the names and locations are now in increasing order.

Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total).  We drop the variable (sumC) which had all the previous partial sums.

NOTE
====

proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.

data combinedTemp1 (keep = name cnt sumC location);
    length location $ 1;    
    retain sumC;

    set data3 (in=c)
               data2 (in=b)
               data1 (in=a);
    by name;
** This set (with 3 data sets and the following by statement produces an interleaving **;


    ** statement below may look a bit funny, but if does NOT use
     conditionals (with the associated branches which can slow us down).
     note that only one of a, b and c can be true (= 1) at any one time, the others
     are false (= 0).
    **;
    location = substr("ABC", a + 2*b + 3*c, 1);              

   if first.name then sumC = 0;

   sumC = sumC + cnt;

run;

proc sort data=combinedTemp1 out=combinedTemp2;
by  name location;
run;

data combined(drop=sumC);

    ******************************************************************
    ** of the following 4 statements we really only need the retain **
    ** sumcnt, but listing all these names in order results in the  **
    ** output dataset having them in in the required order          **
    ******************************************************************;
    length name $ 12;    
    length cnt 8;
    retain sumcnt;
    length location $ 1;
       

   set combinedTemp2;
   by name location;

   if first.name then sumcnt = sumC;

run;

Open in new window


Hope this helps.

Ian
0
 
labradorchikAuthor Commented:
Thank you very much to both of you!!
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.

One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".
0
 
theartfuldazzlerCommented:
Yes:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from
  (select "A" as location, name, cnt from data1
   union all
   select "B" as location, name, cnt from data2
   union all
   select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;
0
 
labradorchikAuthor Commented:
for some reason it is still comes out by "name". Is there a way that dataset can come out as by decreasing "sumcnt"?  

for example:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C
Adam            01            12              A
Adam            04            12              B
Adam            03            12              C
Adam            04            12              B
0
 
labradorchikAuthor Commented:
Anyone have any comments or suggestions?
0
 
ShannonEECommented:
Hi there labradorchik,

Do you really want all the possible rows across the locations, or just the one sumcnt for each Name?  If that is the case then the whole excercise drops out simply! ((For the same price you can get the number of cnt values make up the sumcnt value - ie insert another column, drop the location column and drop all but one row for each name. And in that case a SQL solution will be quite good and easy to write and understand.


Assuming that you really want all those other records with individual cnt values, then change the sort to -

proc sort data=combinedTemp1 out=combinedTemp2;
by  decreasing sumC name location;
run;

Open in new window

and then change the later data step to

   ...
   set combinedTemp2;
   by decreasing sumC name location;
   ...

Open in new window



-----

Alternatively you can try the SQL way, but sometimes using "calculated" in other than select values gives problems.  Hence try slight change to the theartfuldazzler solution -

proc sql;

  Create table combined2 as
    select *
    from
    (
        select name, cnt, sum(cnt) as sumcnt, location from
               (select "A" as location, name, cnt from data1
                union all
                select "B" as location, name, cnt from data2
                union all
                select "C" as location, name, cnt from data3
               )
        group by name
    )
order by sumcnt desc name location;

quit;

Open in new window


In the end, with a small dataset I'm not surprised that the sql ran faster - it has only one set up whereas the datastep code has 3 set ups, which will produce relative big overheads for small processing.

I would be interested in knowing how you go with 100's of millions of records.

Ian
0
 
labradorchikAuthor Commented:
Ian, thank you very much!
Sort worked fine! I just had to change it to "descending" instead of "decreasing".

I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :)  It would take forever to process all those records.
0

Featured Post

Get free NFR key for Veeam Availability Suite 9.5

Veeam is happy to provide a free NFR license (1 year, 2 sockets) to all certified IT Pros. The license allows for the non-production use of Veeam Availability Suite v9.5 in your home lab, without any feature limitations. It works for both VMware and Hyper-V environments

  • 5
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now