asked on

How to compute the count of duplicates found in both Data1 and Data2 and in Data 3, and in one or the other?

Hello,
For testing and verification purposes, I am trying to compute the count of good duplicates found in both Data1 and Data2 and in Data3, and in one or the other. These are the steps I am trying to accomplish to get to my final results:

#1. Create a list (Data4) of unique Data1 and Data2 where variable rec_flag=A, B, C, D.
#2. Build a duplicate (Data5) file from the list in #1 by generating a record linking each two records where Var3 and Var4 are linked to the same Var6 and Var7.
#3. Match Data5 file with Data3 where variable rec_flag=B, C, D using key variables Var3, Var4, Var6, and Var7.
#4. Display in the output .lis the number of records in:
#4.1 Both Data5 and Data3
#4.2 Data5 only
#4.3 Data3 only

The issue that I am having is with #2 where I need to build a duplicate file Data5.

Here is my SAS code below with missing step #2:

libname mydir "/home/myprofile/links"; 
%macro cnt_lnks;

 /*  #1 */
proc sort data=mydir.Data1
          out=Data1out(where=(rec_flag='A', 'B', 'C', 'D') keep=Var3 Var4 Var6 Var7 rec_flag);
          by Var3 Var4 Var6 Var7;
run;

proc sort data=mydir.Data2
          out=Data2out(where=(rec_flag='A', 'B', 'C', 'D') keep=Var3 Var4 Var6 Var7 rec_flag);
          by Var3 Var4 Var6 Var7;
run;

/*********************************************
 Datastep: merge Data1 & Data2
           by Var3 Var4 Var6 Var7
 ********************************************/
data all_links1;
  merge Data2out (in=ina)
       Data1out (in=inb);
  by Var3 Var4 Var6 Var7;
  if ina and inb;
run;

/***********************************
 Remove duplicate records
 In: all_links1
 out: n_links
 ************************************/
proc sort data=all_links1 nodupkey out=n_links;
     by Var3 Var4 Var6 Var7;
run;

/* #2  -  how can I build this one?  */



/******************************************
#3
 Create Data3out: keep recs where rec_flag=B,C,D
 in: Data3
 out: Data3out
 *******************************************/
data Data3out(where=(rec_flag in('B','C','D')) keep=Var3 Var4 Var6 Var7 rec_flag);
 set mydir.Data3;
run;

/*******************************************
 sort Data3out by Var3 Var4 Var6 Var7
 *******************************************/
proc sort data= Data3out; by Var3 Var4 Var6 Var7; run;

/****************************************
 Datastep: merge n_links & Data3out
 in: n_links & Data3out
 out: all_links2
 *****************************************/

data all_links2;
 merge n_links (in=ina) Data3out (in=inb);
 by Var3 Var4 Var6 Var7;
 if ina and inb;
run;

/*****************************************
 #4 
Counts:
 1. Total recs: Combined Data2out & Data1out, where rec_flag=A & nodups
 2. Total recs: Data3out, where rec_flag=B,C,D
 3. Total recs: combined [Data2out & Data1out] & Data3out
 ***********************************************/

PROC SQL NOPRINT;
    SELECT COUNT(*)
    INTO:combo1
    FROM n_links;

    SELECT COUNT(*)
    INTO:Data3out2
    FROM Data3out;

    SELECT COUNT(*)
    INTO:combo2
    FROM all_links2;
quit;

 /**********************************************
  Output counts
  out: Counts.lis
  **********************************************/

  data _null_;
   file "Counts.lis";
   put 50*'*';
   put " ";
   put "Additional counts for Data3out and Data1out/Data2out duplicates";
   put " ";
   put "Data1out and Data2out where REC_FLAG=A: &combo1";
   put "Data3out & Data1out and Data2out: &combo2";
   put "Data3out where REC_FLAG= B,C,D: &Data3out1";
  run;
%mend cnt_lnks;
%cnt_lnks;

Open in new window

d-glitch

What is your data? How big are your lists?
Why are there three lists? And how do some records wind up in more than one list?
How do define a good duplicate? Are there also bad duplicates?

I don't know anything about SAS. but the most efficient way identify duplicates in multiple lists is sort the lists and go through them in a linear fashion.

labradorchik

ASKER

Hi,
Thank you for your comments and questions!

Each SAS dataset in my example contains 100s of thousands records. Yes, very large datasets and not easy to sort/merge the data in those datasets.

I am trying to match these records on specific keys and count 100% duplicates (good duplicates).

Bad duplicate when matching two similar records, at least in my experience, considers not 100% match, meaning that there can be a human error or system error when a record was originally created. In my counts, I only would like to get 100% matches.

Thank you again for your questions!
Maybe your questions would also help to someone with more SAS experience.

ASKER CERTIFIED SOLUTION

labradorchik

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

d-glitch

Please outline your solution for the rest of us.