Link to home
Start Free TrialLog in
Avatar of labradorchik
labradorchikFlag for United States of America

asked on

SAS Data Sets Manipulations

Hello,
I would like to very that one SAS dataset "Data1" has the same amount of records/counts that my second SAS dataset "Data2". For testing purposes I placed exactly 20 records in "Data1" dataset and 20 records in "Data2" dataset, so they should match.
This is what I have so far:

data temp1;
 length count1 10;
 retain count1 0;
  set mylab.Data1;
count1 + 1;
run;

data temp2;
 length count2 10;
 retain count2 0;
  set mylab.Data2;
count2 + 1;
run;


How would I compare both datasets' records/counts (count1 and count2) automatically in my program?  I know how to compare records/counts within one dataset but what if I have two different datasets?? Should I use macro, if so, how should I use it?
Note: records/counts in my both datasets should equal, if not I would have write a message to an another file saying "There is discrepancy in records count! " 

Thanks in advance!
Avatar of Aloysius Low
Aloysius Low
Flag of Singapore image

i would use proc compare... however i'm at home now, so i don't have the full syntax (sorry it isn't something i use frequently, so i'm not able to recall the syntax offhand)...

you could see http://support.sas.com/techsup/technote/ts440.pdf for hints on how to do it...

before you use proc compare, you should sort both datasets according to the variables you would like to compare.. typically when i do a 'exact match' comparison, i would sort them by all the variables and compare by observation ID (aka row number)... alternatively, you could also compare by key fields...

to implement the last step, the idea would be to check the output dataset... however, i'll need to be in office to write some code and test so that i can advise further on this so that's going to be another 9 hours or so (it's night time here now...)
Avatar of labradorchik

ASKER

lowaloysius: I am also looking into this problem myself, but please do post an example of this verification whenever you do have a chance!
Note: I can probably do this on just one key variable.
lowaloysius: From your link above I see PROC COMPARE is used to compare variables but not records. In my case I just need to compare two datasets and make sure they have the same amount of records written.
I was thinking to use MACROs to store 'count1' and 'count2' and then compare two MACRO counts together, but I do not know how to do that yet.
I use Proc Compare quite a bit and it can be used to compare records.  Here's some log and output content generated comparing a test read of an external file to the SAS data set used to create it.  The proc compares all observations and all variables.  Record comparison is done on a 1:1 basis so the datasets must have the same number of obs and they must be in the same sort order.

1741  proc compare
1742      base=trigger.demand_call_file_&_fileDate
1743      compare=testRead;
1744  run;

<some output not shown>
Number of Observations in Common: 293305.
Total Number of Observations Read from TRIGGER.DEMAND_CALL_FILE_20110315: 293305.
Total Number of Observations Read from WORK.TESTREAD: 293305.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 293305.

If any records are not exactly the same then the Number of Obs with Some Compared Vars Unequal will not be zero.  Proc Compare has lots of options for identifying differences so you'll want to read up on them in the on-line documentation.  The code examples are very good.

As might be expected, the larger the data sets being compared the slower Proc Compare runs.  This example has 293,305 records each with 7 fields and it takes about 6 minutes to run.  2 fields are numeric and 6 are character but I don't know if the mix makes a difference in performance.
ASKER CERTIFIED SOLUTION
Avatar of d507201
d507201
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
d507201 Thanks!!  This this exactly what I was looking for!
Questions: Are "_dsid"  and  "_rc" stand for counter names in your code?  Did you place "_" before these two names for a reason or its just part of the name of those counter names?

Also, If I need to output message to file "File1" after verifying my counts should my code look like below?

  data _null_;
      if "&_numRecs1" ne "&_numRecs2" then
       file File1 mod;
              put "Number of records is not equal";
      else put "Number of records is equal";
run;

Of course, I would have to define "File1" location directory prior to this data step.
are you trying to compare records and the count or just the count? the way i read it is compare records... if all you need is the count comparison, you can refer to SASHELP.VTABLE. This is a view containing metadata information about tables currently registered with the SAS session.

if you need to compare records (contents of a table), then you'll need PROC COMPARE.

proc sql;
    select nobs into :count1 from SASHELP.VTABLE where LIBNAME = "[library name for table 1]" and MEMNAME = "[table name 1]";
    select nobs into :count2 from SASHELP.VTABLE where LIBNAME = "[library name for table 2]" and MEMNAME = "[table name 2]";
quit;


for your code to output to file, you have placed the file statement in the wrong place - need to swap lines with the if statement:
data _null_;
  file File1 mod;
    if &count1 ne &count2 then
      put "Number of records is not equal";
    else put "Number of records is equal";
run;

also, i would not recommend putting the variables within quotation marks. it's better to compare the numbers as numbers instead of text.
I am sorry for the confusion! I just need to verify counts of records in both datastes by comparing the counts.

Note: there also more things to consider in my code that I forgot to mention before:
1. "Data1" has many records with 'Var5' equal from 01 to 99 in those records. So, for example 'Var5' may be found with the same value of 01 or other values in two or three records within "Data1".    
2. "Data2" only has two variables:
   a. 'Var5' which has values from 01 to 99.
   b. 'Var4' which has a value or a number of counts for each value in 'Var5'.  


So will this code below work or should I better use PROQ SQL with SASHELP.VTABLE?  I think I can leave first three lines of code and the last five, but in the middle section of code (with "Data2") I would probably have to change a few things. Since 'Var4' in "Data2" has already number of counts for each value in 'Var5' I can probably just use that number to compare my counts with "Data1".  
Any help would be greatly appreciated!

%let _dsid=%sysfunc(open(mylab.Data1));
%let _numRecs1=%sysfunc(attrn(%eval(&_dsid-0),nobs));
%let _rc=%sysfunc(close(%eval(&_dsid-0)));

%let _dsid=%sysfunc(open(mylab.Data2));
%let _numRecs2=%sysfunc(attrn(%eval(&_dsid-0),nobs));    
%let _rc=%sysfunc(close(%eval(&_dsid-0)));


data _null_;
   file File1 mod;
      if &_numRecs1 ne &_numRecs2 then put "Number of records is not equal";
      else put "Number of records is equal";
run;

Thanks!
from your description, it appears that Var4 in Data2 is not useful and meant to add confusion...

also, it's not clear from your post whether duplicate records in Data1 should be counted as 1 or the actual number of times the records appear

finally, it's also not clear if you are supposed to match the record count by the value in Var5. what i mean is if in Data1, you have 4 records of Var5 = "01", and in Data2, you have 1 record where Var5 = "01" and Var4 = 3, are you supposed to prompt that there's a mismatch, even though the final record count (count of rows in Data1 and Data2) tallies? if not, it adds further to my first statement that Var4 serves to confuse only. this means that either way, the original suggestion and the alternative i have provided will work.

whether or not to use the PROC SQL provided by me, or the system functions is up to you - both serve the same function. the SASHELP.VTABLE presents a much easier way to understand the metadata of the table.
lowaloysius:
You question was:
"also, it's not clear from your post whether duplicate records in Data1 should be counted as 1 or the actual number of times the records appear"

I just doubled check this, - there will be no duplicate records in Data1, so it is actual number of times records appear in the Data1 dataset.
Both datastes designed that way that they should both match.  There should be no mismatches.

Thank you in advance!
For the code I sent, "_dsid"  holds the data set identifier that SAS assigns when it opens the data set and  "_rc" is a sort of dummy variable--return code--that's needed for the close function to work.  I usually start my macro variable names with "_" as another visual cue.  

If you're comparing record counts regardless of duplicatest or the record contents then either will work.  If you need to exclude dups or apply conditions then use a select count () statement in proc sql so that you can use the unique constraint or apply where criteria.
if that's the case, Var4 in Data2 serves no purpose but to add confusion, unless there's specific requirement to use it... given this, it does not matter which method you use - SASHELP.VTABLES, or to run the system functions to read the number of observations
Hello,
After running below code I received the following messages:

%let _dsid=%sysfunc(open(mylab.Data1));
%let _numRecs1=%sysfunc(attrn(%eval(&_dsid-0),nobs));
%let _rc=%sysfunc(close(%eval(&_dsid-0)));

%let _dsid=%sysfunc(open(mylab.Data2));
%let _numRecs2=%sysfunc(attrn(%eval(&_dsid-0),nobs));    
%let _rc=%sysfunc(close(%eval(&_dsid-0)));

data _null_;
   file File1 mod;
      if &_numRecs1 ne &_numRecs2 then
         put  "Current Data1 count is                     "  &_numRecs1;
         put "Expected records count in Data2 is  "  &_numRecs2;
 else put "Number of records is equal";
run;

**************************************************************************************
546           put "Current Data1 count is           " &_numRecs1;
NOTE: Line generated by the macro variable "_NUMRECS1".
546   3
         -
        22
ERROR 22-322: Syntax error, expecting one of the following: a name, #, +, @.

547           put "Expected records count in Data2 is " &_numRecs2;
NOTE: Line generated by the macro variable "_NUMRECS2".
547   11
         --
        22
ERROR 22-322: Syntax error, expecting one of the following: a name, #, +, @.

548
549            else put "Number of Records is Equal";
                  ----
                  160
ERROR 160-185: No matching IF-THEN clause.
***********************************************************************************************************

Any suggestions?  
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
In this case the macro variables won't resolve proberly unless they're in double quotes.  SAS is looking for a variables named 3 and 11 and obviously those are invalid names.

put  "Current Data1 count is                       &_numRecs1";
put "Expected records count in Data2 is    &_numRecs2";
Thanks to both of you!!  Both of your comments helped me a lot!!