How do I find the distribution of common values between 2 datasets?

SASnewbie
SASnewbie used Ask the Experts™
on
Hi,

I have 2 datasets (1 & 2) that have 4 variables in common (A, B. C, D). How do I find the frequency (or count) of the common values found in both datasets?

Variable A is the merge by variable.

For example:
Dataset1:
Variable B has 50 records containing one of these values (10, 15, 25, 45...)

Dataset2:
Variable B has 150 records containing one of these values (10, 15, 25, 45, 65 80...)

How can I produce an output like this?

Frequency of common values between dataset1 and dataset2:

VariableB      
Value    Dataset1 Freq       Dataset2 Freq   Common Freq     % of Total      Total (combined)
    10               75                                80                        50                25.0%                 200
    15               25                                40                        25                12.5%                 200
    25               50                                80                        10                   5.0%                200

And so on for Variables C and D...
Thanks in advance
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2012
Distinguished Expert 2018

Commented:
I'm not a SAS person.  I'm only responding to a neglected question alert.

From looking in the docs it looks like COMPARE can get you most of what you want if not all of it.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#compare-overview.htm
what do you mean by common frequency? i'm not understanding how (using your example above) common frequency is 50 for value 10 when dataset1 frequency is 75 and dataset2 frequency is 80.
SASnewbieData Analyst - Project Management Office

Author

Commented:
hi lowaloysius,

The frequency is 75 when looking for the frequency of variableB value 10 in dataset1 alone.
The frequency is 80 when looking for the frequency of variableB value 10 in dataset2 alone.
The frequency is 50 when looking for the frequency of variableB value 10 where it is found in BOTH dataset1 AND dataset2.

Something like this:
rsubmit;proc freq data = dataset1;tables variableB   / out = freq_results_dataset1;run;endrsubmit;
rsubmit;proc freq data = dataset2;tables  variableB   / out = freq_results_dataset2 ;run;endrsubmit;
rsubmit;proc freq data = datasets_combined;tables variableB  / out = freq_results_combined2;run;endrsubmit;

Thanks,
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

SASnewbieData Analyst - Project Management Office

Author

Commented:
I want to change the combined proc freq to the below. It is created by merging dataset1 and dataset2 by VariableA:

rsubmit;proc freq data =datasets_merged;tables  VariableB / out = freq_results_combined;run;endrsubmit;
and if i understand you correctly, you want 1 proc freq step to accomplish this? are you able to accept other solutions like calculating separately and creating a summary table before using proc print?
SASnewbieData Analyst - Project Management Office

Author

Commented:
hi lowaloysius,

I am open to any method.
i'm thinking of performing a few proc sqls :-
1. get the frequency of each value of variable B in dataset1, group by variable B
2. get the frequency of each value of variable B in dataset2, group by variable B
3. get the frequency of each value of variable B in the merged dataset, group by variable B
4. get the total count of values of variable B in the merged dataset (or row count?), and store into a macro variable
5. perform a full join of the results of 1, 2, and 3 above, at the same time create 2 new columns - the % and the total count, using the macro variable from 4 above.

then do a proc print of the result of 5 above

does this sound feasible to you?
SASnewbieData Analyst - Project Management Office

Author

Commented:
Hi loyaloysius,

I have worked 1-3. Could you provide examples of how to do 4 and 5?
for #4 (i would assuming that every record has a value in variable B, and even if there's no value, it should still be counted):
proc sql;
select count(1) into :total from merged dataset name
quit;

for #5 (must run #4 to initialize the macro variable total first):
proc sql;
  create table final_frequency as
    select coalesce(a.variableB, b.variableB, c.variableB) as variableB,
      a.frequency column name as dataset1_freq,
      b.frequency column name as dataset2_freq,
      c.frequency column name as common_freq,
      c.frequency column name / &total as percent_of_total,
      &total as total_combined
    from summarized step 1 dataset name a
    full join summarized step 2 dataset name b
    on a.variableB = b.variableB
    full join summarized step 3 dataset name c
    on a.variableB = c.variableB;
quit;

then do proc print:
proc print data = final_frequency noobs; run;

note that the text in italics are tables names which you'll need to replace accordingly with what you have currently

note also i wrote this code blindly i.e. no testing is done as i don't have an environment to test to ensure the syntax etc is correct. however, if there are any errors, it shouldn't be too far off and if you need help, post the log for the section of code in question for troubleshooting :)
SASnewbieData Analyst - Project Management Office

Author

Commented:
thank you. I'll give this a run!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial