Avatar of SASnewbie
 asked on

How do I find the distribution of common values between 2 datasets?


I have 2 datasets (1 & 2) that have 4 variables in common (A, B. C, D). How do I find the frequency (or count) of the common values found in both datasets?

Variable A is the merge by variable.

For example:
Variable B has 50 records containing one of these values (10, 15, 25, 45...)

Variable B has 150 records containing one of these values (10, 15, 25, 45, 65 80...)

How can I produce an output like this?

Frequency of common values between dataset1 and dataset2:

Value    Dataset1 Freq       Dataset2 Freq   Common Freq     % of Total      Total (combined)
    10               75                                80                        50                25.0%                 200
    15               25                                40                        25                12.5%                 200
    25               50                                80                        10                   5.0%                200

And so on for Variables C and D...
Thanks in advance

Avatar of undefined
Last Comment

8/22/2022 - Mon
slightwv (䄆 Netminder)

I'm not a SAS person.  I'm only responding to a neglected question alert.

From looking in the docs it looks like COMPARE can get you most of what you want if not all of it.

Aloysius Low

what do you mean by common frequency? i'm not understanding how (using your example above) common frequency is 50 for value 10 when dataset1 frequency is 75 and dataset2 frequency is 80.

hi lowaloysius,

The frequency is 75 when looking for the frequency of variableB value 10 in dataset1 alone.
The frequency is 80 when looking for the frequency of variableB value 10 in dataset2 alone.
The frequency is 50 when looking for the frequency of variableB value 10 where it is found in BOTH dataset1 AND dataset2.

Something like this:
rsubmit;proc freq data = dataset1;tables variableB   / out = freq_results_dataset1;run;endrsubmit;
rsubmit;proc freq data = dataset2;tables  variableB   / out = freq_results_dataset2 ;run;endrsubmit;
rsubmit;proc freq data = datasets_combined;tables variableB  / out = freq_results_combined2;run;endrsubmit;

Your help has saved me hundreds of hours of internet surfing.

I want to change the combined proc freq to the below. It is created by merging dataset1 and dataset2 by VariableA:

rsubmit;proc freq data =datasets_merged;tables  VariableB / out = freq_results_combined;run;endrsubmit;
Aloysius Low

and if i understand you correctly, you want 1 proc freq step to accomplish this? are you able to accept other solutions like calculating separately and creating a summary table before using proc print?

hi lowaloysius,

I am open to any method.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Aloysius Low

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question

Hi loyaloysius,

I have worked 1-3. Could you provide examples of how to do 4 and 5?
Aloysius Low

for #4 (i would assuming that every record has a value in variable B, and even if there's no value, it should still be counted):
proc sql;
select count(1) into :total from merged dataset name

for #5 (must run #4 to initialize the macro variable total first):
proc sql;
  create table final_frequency as
    select coalesce(a.variableB, b.variableB, c.variableB) as variableB,
      a.frequency column name as dataset1_freq,
      b.frequency column name as dataset2_freq,
      c.frequency column name as common_freq,
      c.frequency column name / &total as percent_of_total,
      &total as total_combined
    from summarized step 1 dataset name a
    full join summarized step 2 dataset name b
    on a.variableB = b.variableB
    full join summarized step 3 dataset name c
    on a.variableB = c.variableB;

then do proc print:
proc print data = final_frequency noobs; run;

note that the text in italics are tables names which you'll need to replace accordingly with what you have currently

note also i wrote this code blindly i.e. no testing is done as i don't have an environment to test to ensure the syntax etc is correct. however, if there are any errors, it shouldn't be too far off and if you need help, post the log for the section of code in question for troubleshooting :)

thank you. I'll give this a run!
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck