Link to home
Start Free TrialLog in
Avatar of SASnewbie
SASnewbie

asked on

How do I find the distribution of common values between 2 datasets?

Hi,

I have 2 datasets (1 & 2) that have 4 variables in common (A, B. C, D). How do I find the frequency (or count) of the common values found in both datasets?

Variable A is the merge by variable.

For example:
Dataset1:
Variable B has 50 records containing one of these values (10, 15, 25, 45...)

Dataset2:
Variable B has 150 records containing one of these values (10, 15, 25, 45, 65 80...)

How can I produce an output like this?

Frequency of common values between dataset1 and dataset2:

VariableB      
Value    Dataset1 Freq       Dataset2 Freq   Common Freq     % of Total      Total (combined)
    10               75                                80                        50                25.0%                 200
    15               25                                40                        25                12.5%                 200
    25               50                                80                        10                   5.0%                200

And so on for Variables C and D...
Thanks in advance
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

I'm not a SAS person.  I'm only responding to a neglected question alert.

From looking in the docs it looks like COMPARE can get you most of what you want if not all of it.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#compare-overview.htm
what do you mean by common frequency? i'm not understanding how (using your example above) common frequency is 50 for value 10 when dataset1 frequency is 75 and dataset2 frequency is 80.
Avatar of SASnewbie

ASKER

hi lowaloysius,

The frequency is 75 when looking for the frequency of variableB value 10 in dataset1 alone.
The frequency is 80 when looking for the frequency of variableB value 10 in dataset2 alone.
The frequency is 50 when looking for the frequency of variableB value 10 where it is found in BOTH dataset1 AND dataset2.

Something like this:
rsubmit;proc freq data = dataset1;tables variableB   / out = freq_results_dataset1;run;endrsubmit;
rsubmit;proc freq data = dataset2;tables  variableB   / out = freq_results_dataset2 ;run;endrsubmit;
rsubmit;proc freq data = datasets_combined;tables variableB  / out = freq_results_combined2;run;endrsubmit;

Thanks,
I want to change the combined proc freq to the below. It is created by merging dataset1 and dataset2 by VariableA:

rsubmit;proc freq data =datasets_merged;tables  VariableB / out = freq_results_combined;run;endrsubmit;
and if i understand you correctly, you want 1 proc freq step to accomplish this? are you able to accept other solutions like calculating separately and creating a summary table before using proc print?
hi lowaloysius,

I am open to any method.
ASKER CERTIFIED SOLUTION
Avatar of Aloysius Low
Aloysius Low
Flag of Singapore image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi loyaloysius,

I have worked 1-3. Could you provide examples of how to do 4 and 5?
for #4 (i would assuming that every record has a value in variable B, and even if there's no value, it should still be counted):
proc sql;
select count(1) into :total from merged dataset name
quit;

for #5 (must run #4 to initialize the macro variable total first):
proc sql;
  create table final_frequency as
    select coalesce(a.variableB, b.variableB, c.variableB) as variableB,
      a.frequency column name as dataset1_freq,
      b.frequency column name as dataset2_freq,
      c.frequency column name as common_freq,
      c.frequency column name / &total as percent_of_total,
      &total as total_combined
    from summarized step 1 dataset name a
    full join summarized step 2 dataset name b
    on a.variableB = b.variableB
    full join summarized step 3 dataset name c
    on a.variableB = c.variableB;
quit;

then do proc print:
proc print data = final_frequency noobs; run;

note that the text in italics are tables names which you'll need to replace accordingly with what you have currently

note also i wrote this code blindly i.e. no testing is done as i don't have an environment to test to ensure the syntax etc is correct. however, if there are any errors, it shouldn't be too far off and if you need help, post the log for the section of code in question for troubleshooting :)
thank you. I'll give this a run!