labradorchik
asked on
In SAS how to create sum of values of counts for each variable?
I am trying to sum (sumcnt) the values of counts (cnt) for each variable (name).
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?
Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.
Example of how "sumcnt" variable should look like:
My current code:
How can I create in my code variable "sumcnt" to sum all counts for each name for all locations?
Note: I am using three different datasets (data1, data2, data3) with different locations for names to create one dataset (combined) with all names and locations.
Example of how "sumcnt" variable should look like:
name cnt sumcnt location
John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C
John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C
My current code:
data combined (keep = name cnt location);
length location $ 1;
set data1 (in=a)
data2 (in=b)
data3 (in=c);
location = "*";
if a then location = "A";
if b then location = "B";
if c then location = "C";
run;
ASKER
Thank you very much!!
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?
This way works fine but it is very slow with large SAS datasets. Is there a different way (more efficient way) that might take faster for the program to process large datasets?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Hi there labradorchik,
Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work. Note there is a sort but generally sorts have been well optimized and go quite fast.
First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the cnt values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt will be just partial sums for the other locations for each name.
Next we sort it all so that the names and locations are now in increasing order.
Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total). We drop the variable (sumC) which had all the previous partial sums.
NOTE
====
proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.
Hope this helps.
Ian
Assuming that data1, data2 and data3 are sorted by name (this assumptions allows us to skip a step) and you want the combined dataset in sorted order by name and location then the following code will work. Note there is a sort but generally sorts have been well optimized and go quite fast.
First we combine the datasets in the reverse order (C, then B, then A) which saves a sort step. While we are combining we sum up the cnt values so that we will have the total in the last found location in our sequence but really the "first" location (probably location A but the code will work if A is missing for some names). In this dataset sumcnt will be just partial sums for the other locations for each name.
Next we sort it all so that the names and locations are now in increasing order.
Lastly we process the dataset making sumcnt for all records in a name equal to the first found sum (which will be the total). We drop the variable (sumC) which had all the previous partial sums.
NOTE
====
proc sql is very useful (and what theartfuldazzler has shown is spot on). I use it quite a lot, but for some cases proc sql can run significantly slower than coding using data steps.
data combinedTemp1 (keep = name cnt sumC location);
length location $ 1;
retain sumC;
set data3 (in=c)
data2 (in=b)
data1 (in=a);
by name;
** This set (with 3 data sets and the following by statement produces an interleaving **;
** statement below may look a bit funny, but if does NOT use
conditionals (with the associated branches which can slow us down).
note that only one of a, b and c can be true (= 1) at any one time, the others
are false (= 0).
**;
location = substr("ABC", a + 2*b + 3*c, 1);
if first.name then sumC = 0;
sumC = sumC + cnt;
run;
proc sort data=combinedTemp1 out=combinedTemp2;
by name location;
run;
data combined(drop=sumC);
******************************************************************
** of the following 4 statements we really only need the retain **
** sumcnt, but listing all these names in order results in the **
** output dataset having them in in the required order **
******************************************************************;
length name $ 12;
length cnt 8;
retain sumcnt;
length location $ 1;
set combinedTemp2;
by name location;
if first.name then sumcnt = sumC;
run;
Hope this helps.
Ian
ASKER
Thank you very much to both of you!!
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.
One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".
Ian, I just ran both codes (yours and theartfuldazzler's code). Actually, theartfuldazzler's code runs significantly faster at the moment but I am still testing both codes to see if results are the exactly the same.
One question:
I am not sure if this even possible but is there a way to have my output dataset (combined) sorted in order of decreasing "sumcnt"? I just realized that I don't need to sort by "name".
Yes:
PROC SQL;
Create table combined2 as
select name, cnt, sum(cnt) as sumcnt, location from
(select "A" as location, name, cnt from data1
union all
select "B" as location, name, cnt from data2
union all
select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;
PROC SQL;
Create table combined2 as
select name, cnt, sum(cnt) as sumcnt, location from
(select "A" as location, name, cnt from data1
union all
select "B" as location, name, cnt from data2
union all
select "C" as location, name, cnt from data3)
group by name
order by calculated sumcnt desc;
RUN;QUIT;
ASKER
for some reason it is still comes out by "name". Is there a way that dataset can come out as by decreasing "sumcnt"?
for example:
for example:
name cnt sumcnt location
John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C
Adam 01 12 A
Adam 04 12 B
Adam 03 12 C
Adam 04 12 B
John 13 53 A
John 20 53 B
John 20 53 C
Mike 03 45 A
Mike 30 45 B
Mike 12 45 C
Adam 01 12 A
Adam 04 12 B
Adam 03 12 C
Adam 04 12 B
ASKER
Anyone have any comments or suggestions?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ian, thank you very much!
Sort worked fine! I just had to change it to "descending" instead of "decreasing".
I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :) It would take forever to process all those records.
Sort worked fine! I just had to change it to "descending" instead of "decreasing".
I am testing with about 100,000 records and it is kind of fast. I agree with you, if I was testing with 100's millions of records I should probably not use proc SQL. :) It would take forever to process all those records.
Try:
Open in new window