We help IT Professionals succeed at work.

SAS Data Sets Manipulations

621 Views
Last Modified: 2012-06-27
I am trying to do some matching of data on a few different variables from two datasets.
Input Files
Data1.sas7bdat has variables: Var1, Var2, Var3, Var4, Var5, Var6, Var7, Var8, and Var9.
Data2.sas7bdat has variables: Var10, Var11, Var12, and Var13.

Output Files
Data3.sas7bdat  
Counts.log will have all counts for variables:  Var10a, Var10b, Var10c, + input Data1 counts and output Data3 counts.


I am trying to read Data1 data set and output all records to Data3 data set with Var10a, Var10b, and Var10c by using Var10 from Data2.
Here is the process of creation Data3:
1.      Look up Var10 from Data2 and call it Var10a. Use Var1, Var2, Var3 from Data1 to match records value for variables Var11, Var12, and Var13 in the Data2.
2.       Look up Var10 from Data2 and call it Var10b. Use Var4, Var5, Var6 from Data1 to match records value for variables Var11, Var12, and Var13 in the Data2.
3.      Look up Var10 from Data2 and call it Var10c. Use Var7, Var8, Var9 from Data1 to match records value for variables Var11, Var12, and Var13 in the Data2.
4.      Then I need to choose between Var10 values as following:
 if Var10a ne blank, then Var10 = Var10a
else if Var10b ne blank, Var10 = Var10b
else if Var10c ne blank, Var10 = Var10c
else Var10 = blank

5.      Count how many times Var10 was assigned from Var10a, Var10b, and Var10c and write into Counts.Log .
6.      Count all input and output records and write into Counts.Log
Note: values in variables Var11, Var12, and Var13 from Data2 have similar values as these 3 sets of variables in Data1:  
set 1 (Var1, Var2, Var3)
set2 (Var4, Var5, Var6)
set3 (Var7, Var8, Var9)

This is how I started, but I am not sure how should I proceed next to match my data from both datasets??  Any help will be greatly appreciated!  Thank you!

data Data3;
 set Data1 end=last;






  
  if last then do;
   file 'Counts' mod;               
    put ' '; 
    put '*****************************************************************';
    put '* Total Input Records                        ' _n_;
    put '* Total Output Records                       ' outrec; 
    put '*  Total Var10s came from Var10a             ' acnt;
    put '*  Total Var10s came from Var10b             ' bcnt;
    put '*  Total Var10s came from Var10c             ' ccnt;
    put '*****************************************************************';
    put ' ';
  end;  
run;

Open in new window

Comment
Watch Question

This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
Wow, proc SQL as always looks so completed..., but code runs with no errors at all!! Thank you very much!

A question:
At what point in this program should I output Data3 (my final output dataset)?  Is it the same as table "final3"?  Should I just place "Data3" instead of "final3"?

I am still making some corrections in my testing data, so will run again tomorrow with all records and see if it works as it is now. If any problems I will let you know tomorrow.

Thank you again!!
Hi

Your final datastep that you wanted would be Step4 I'm guessing.

PS - PROC SQL is really handy for some data manipulations.  I would suggest (if yo are able) to go on the PROC SQL course that SAS offers.  I've found its one of the better ones for someone wanting to be more proficient in SAS.

Author

Commented:
I had to add one thing to my requirements section 5:
5. Count how many times Var10 was assigned from Var10a, Var10b, Var10c or left blank and write into Counts.Log .


When I test this code I am not getting any errors, but my counters for SumA, SumB, and SumC are all apear as zeros, but BlankC counter has 4 times more counts than records written into the file. My guess is that counter BlankC is also counting SumA, SumB, and SumC counts, but how that be if they all zeros?

Note: SumA, SumB, and SumC were all zeros before I even added BlankC counter.

Please see below where I added counter "BlankC" in the code:

PROC SQL;
  Create table step1 as
   select a.*, b.Var10 as Var10a from  Data1 as a
left join Data2 as b
  on a.Var1 = b.Var11
   and a.Var2 = b.Var12
   and a.Var3 = b.Var13;

  Create table step2 as
   select a.*, b.Var10 as Var10b from  Step1 as a
left join Data2 as b
  on a.Var4 = b.Var11
   and a.Var5 = b.Var12
   and a.Var6 = b.Var13;


  Create table step3 as
   select a.*, b.Var10 as Var10c from  Step2 as a
left join Data2 as b
  on a.Var8 = b.Var11
   and a.Var9 = b.Var12
   and a.Var10 = b.Var13;

QUIT;

DATA Data3;
 set Step3;
retain countA countB countC 0;
 if Var10a ne "" then do;
              Var10 = Var10a;
              CountA = CountA + 1;
end;
               
else if Var10b ne "" then do;
           Var10 = Var10b;
           CountB = CountB + 1;
end;
else if Var10c ne "" then do;
          Var10 = Var10c;
           CountC = CountC + 1;
end;
else Var10 = "";
BlankC = BlankC + 1;
RUN;

PROC SQL;
  create table final as
  select count(*) as freq, sum(CountA) as SumA, sum(CountB) as SumB, 
                                         sum(CountC) as SumC, sum(BlankC) as Blank from Data3;

 create table final2 as
  select count(*) as InputObs from Data1;

create table final3 as
  select count(*) as OutPutObs from Data3;
QUIT;

DATA _NULL_;
Merge final final2 Final3;
filename clog "c:\counts.log";
file clog mod;
    put ' '; 
    put '*****************************************************************';
    put '* Total Input Records                               ' inputObs;
    put '* Total Output Records                            ' outPutObs; 
    put '*  Total Var10s came from Var10a             ' SumA;
    put '*  Total Var10s came from Var10b             ' SumB;
    put '*  Total Var10s came from Var10c             ' SumC;
    put '*  Total Var10s =Blank                               ' Blank;
    put '*****************************************************************';
    put ' ';
run;

Open in new window



Note: Data1 has only 7 records with variables Var1, Var2, Var3, Var4, Var5, Var6, Var7, Var8, and Var9. Also, (I doubled checked) all data in Data1 dataset is matched with data in Data2 with variables Var11, Var12, and Var13,  so I am puzzled why Var10a, Var10b, and Var10c are empty as well as their counters?

This is how my output in counts.log looks like:
********************************************************************
* Total Input Records                                   7
* Total Output Records                                7
*  Total Var10s come from Var10a                 0
*  Total Var10s came from Var10b                 0
*  Total Var10s came from Var10c                 0
*  Total Var10s = Blank                                  28
********************************************************************

I am trying to finish this tomorrow if possible, so may be you know another way to tackle this issue? May be somehow with MACROS?

I will be testing this code tomorrow first thing in the morning and let you how it goes.

BTW: Thank you for your advice regarding the SQL class, - I will see if I can sign up for one.

Author

Commented:
Thank you very much! Program worked fine after a few small changes.
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.