Solved

In SAS how to create sum of values of counts for each variable?

Posted on 2013-05-20
10
578 Views
Last Modified: 2013-05-22
I am trying to sum (sumcnt) the values of counts (cnt) for each variable (name).
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?

Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.

Example of how "sumcnt" variable should look like:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C

My current code:

data combined (keep = name cnt location);
    length location $ 1;    

    set data1 (in=a)
          data2 (in=b)
          data3 (in=c);

location = "*";              

if a then location = "A";
if b then location = "B";
if c then location = "C";
run;

Open in new window

0
Comment
Question by:labradorchik
  • 5
  • 3
  • 2
10 Comments
 
LVL 11

Expert Comment

by:theartfuldazzler
ID: 39181907
Hi

Try:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from combined
group by name;
RUN;QUIT;

Open in new window

0
 

Author Comment

by:labradorchik
ID: 39182724
Thank you very much!!
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?
0
 
LVL 11

Accepted Solution

by:
theartfuldazzler earned 250 total points
ID: 39183455
One could combine the two steps into one step:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from
  (select "A" as location, name, cnt from data1
   union all
   select "B" as location, name, cnt from data2
   union all
   select "C" as location, name, cnt from data3)
group by name;
RUN;QUIT;

Open in new window


Otherwise, one needs to look at other options like hardware, or changing the location of your work directory.
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 39183464
Hi there labradorchik,

Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work.  Note there is a sort but generally sorts have been well optimized and go quite fast.

First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the   cnt  values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt  will be just partial sums for the other locations for each name.

Next we sort it all so that the names and locations are now in increasing order.

Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total).  We drop the variable (sumC) which had all the previous partial sums.

NOTE
====

proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.

data combinedTemp1 (keep = name cnt sumC location);
    length location $ 1;    
    retain sumC;

    set data3 (in=c)
               data2 (in=b)
               data1 (in=a);
    by name;
** This set (with 3 data sets and the following by statement produces an interleaving **;


    ** statement below may look a bit funny, but if does NOT use
     conditionals (with the associated branches which can slow us down).
     note that only one of a, b and c can be true (= 1) at any one time, the others
     are false (= 0).
    **;
    location = substr("ABC", a + 2*b + 3*c, 1);              

   if first.name then sumC = 0;

   sumC = sumC + cnt;

run;

proc sort data=combinedTemp1 out=combinedTemp2;
by  name location;
run;

data combined(drop=sumC);

    ******************************************************************
    ** of the following 4 statements we really only need the retain **
    ** sumcnt, but listing all these names in order results in the  **
    ** output dataset having them in in the required order          **
    ******************************************************************;
    length name $ 12;    
    length cnt 8;
    retain sumcnt;
    length location $ 1;
       

   set combinedTemp2;
   by name location;

   if first.name then sumcnt = sumC;

run;

Open in new window


Hope this helps.

Ian
0
 

Author Comment

by:labradorchik
ID: 39184129
Thank you very much to both of you!!
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.

One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 11

Expert Comment

by:theartfuldazzler
ID: 39184134
Yes:

PROC SQL;
  Create table combined2 as
    select name, cnt, sum(cnt) as sumcnt, location from
  (select "A" as location, name, cnt from data1
   union all
   select "B" as location, name, cnt from data2
   union all
   select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;
0
 

Author Comment

by:labradorchik
ID: 39184655
for some reason it is still comes out by "name". Is there a way that dataset can come out as by decreasing "sumcnt"?  

for example:

name            cnt            sumcnt     location

John              13             53             A
John              20             53             B
John              20             53             C
Mike             03             45             A
Mike             30             45             B
Mike             12             45             C
Adam            01            12              A
Adam            04            12              B
Adam            03            12              C
Adam            04            12              B
0
 

Author Comment

by:labradorchik
ID: 39185061
Anyone have any comments or suggestions?
0
 
LVL 8

Assisted Solution

by:ShannonEE
ShannonEE earned 250 total points
ID: 39186800
Hi there labradorchik,

Do you really want all the possible rows across the locations, or just the one sumcnt for each Name?  If that is the case then the whole excercise drops out simply! ((For the same price you can get the number of cnt values make up the sumcnt value - ie insert another column, drop the location column and drop all but one row for each name. And in that case a SQL solution will be quite good and easy to write and understand.


Assuming that you really want all those other records with individual cnt values, then change the sort to -

proc sort data=combinedTemp1 out=combinedTemp2;
by  decreasing sumC name location;
run;

Open in new window

and then change the later data step to

   ...
   set combinedTemp2;
   by decreasing sumC name location;
   ...

Open in new window



-----

Alternatively you can try the SQL way, but sometimes using "calculated" in other than select values gives problems.  Hence try slight change to the theartfuldazzler solution -

proc sql;

  Create table combined2 as
    select *
    from
    (
        select name, cnt, sum(cnt) as sumcnt, location from
               (select "A" as location, name, cnt from data1
                union all
                select "B" as location, name, cnt from data2
                union all
                select "C" as location, name, cnt from data3
               )
        group by name
    )
order by sumcnt desc name location;

quit;

Open in new window


In the end, with a small dataset I'm not surprised that the sql ran faster - it has only one set up whereas the datastep code has 3 set ups, which will produce relative big overheads for small processing.

I would be interested in knowing how you go with 100's of millions of records.

Ian
0
 

Author Comment

by:labradorchik
ID: 39188208
Ian, thank you very much!
Sort worked fine! I just had to change it to "descending" instead of "decreasing".

I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :)  It would take forever to process all those records.
0

Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Join & Write a Comment

Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
Creating and Managing Databases with phpMyAdmin in cPanel.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now