Link to home
Start Free TrialLog in
Avatar of labradorchik
labradorchikFlag for United States of America

asked on

SAS Data Sets Manipulations


Hello, I am trying to append or put together numerous SAS datasets (note: every dataset has its own number) into one dataset "FinalData". Here is example of how my SAS datasets look like:
DataSet_<653>
DataSet_<352>
DataSet_<193>
DataSet_<number> and so on...

The problem that I am experiencing is that my program has to somehow look for all those SAS datasets (with different numbers) in my directory folder first before appending them together into one dataset "FinalData". First, my program has to search all datasets look like "DataSet_<number>" available in the directory and than append them into "FinalData". There can be 5 or 500 datasets with the name like "DataSet_<number>" in my directory.
Note: all these datasets have the same variables.

I realized that I can not just use "PROC APPEND" function since I want my program to search for those datasets (with those individual numbers) automatically and than append them into "FinalData" dataset.

Should I use "%sysfunc" within my macro statement? How do I search for an existence of datasets in my directory?

Any examples will really help!
Thank you!


Avatar of Aloysius Low
Aloysius Low
Flag of Singapore image

to check for existence of datasets in a library, you could use the following statement (which will also output into a sas dataset (WORK.TEMP) which you'll need to use it later:

proc contents data = [source library]._ALL_ memtype=data out=work.temp;
run;

now that you have output the data into a temporary dataset, you'll need the following code to concatenate the datasets together. basically what this does is to read WORK.TEMP (the dataset produced earlier) for columns LIBNAME (the library which you have specified ealier) and MEMNAME (the dataset names in this example, because of the MEMTYPE option specified earlier) and use the call execute function to produce the code required:

%macro concatall;
      data _null_;      
            set work.temp end=last;
            length alldata $2000;
            retain alldata;
            if _N_ eq 1 then do;
                  call execute("data finalout;");
                  alldata = "";
            end;
            
            alldata = trim(left(alldata)) || " " || trim(left(libname))||"."||trim(left(memname));

            if last then do;
                  call execute("set " || trim(left(alldata)) || ";");
                  call execute("run;");
            end;
      run;
%mend;

%concatall;
to elaborate on what the macro is doing:

the first data step (data _null_) and the set statement (set work.temp) is to initiate the reading of the dataset WORK.TEMP which contains your dataset listing which you would like to process.

the definition of alldata is just to have a super long character string to contain the list of all datasets read out from WORK.TEMP. this will be used in the data step (data finalout) within the call execute statement.

the call execute statement is a very powerful statement, which often i use it to optimize processing as it can help to process data in 1 pass rather than multiple passes. however, it is also harder to debug. read up on it :)

essentially, it will store the commands to be executed into memory, and only executing it when it encounters the run statement (the one matching data _null_ in this case). hence, the final code generated by the call execute statement would be something you would write (as if it's hard coding), if you see exactly what is in your library.
Hi


There might be two easier methods:

1.  Using DICTIONARY tables


PROC SQL noprint;
    select memname into: datasets separated by " "
  from dictionary.tables where LIBNAME = 'WORK' and memname like 'DATASET_%';

DATA New;
  SET &datasets;
RUN;

/*Note - the libname in the Dictionary tables and the memname variable values must be in CAPS.*/

Method 2:  Using colon modifier;

In SAS 9.2, you can use the colon modifier as a type of "wildcard" for datasets:


OPTIONS NODSNFERR;
DATA New;
   SET Work,DataSets_:;
RUN;
Avatar of labradorchik

ASKER

lowaloysius, thank you very much for your comments!

I am not sure if I can use your example since my directory also contains other datasets (maybe a few thousands of other datastes)  that I do not need to use in my program. I only need to use those datastes with the following structure "DataSet_<number>" (<number> represents a specific number for that particular dataset).
I am still learning macros, so I am not sure I can figure out if your code only searches for particular datasets (with structure of "DataSet_<number>")  in my directory.

This is what I am trying to do:
  Search in my directory "dir2011" for datasets "DataSet_<number>" (<number> represents a specific number for that particular dataset) and than read all records from all "DataSet_<number>" datasets into one dataset "FinalData".

Note: I already know what numbers will appear in the names of those datasets (with names "DataSet_<number>") and what numbers will not appear. To my knowledge, datasets with a name "DataSet_<number>" should be not more than 10 or 20.

I was thinking may I can use "DO WHILE LOOP" for this?

my macro as it is above will process all datasets residing in the directory. if there may be other datasets which may reside in the same directory, then you could take an additional step to filter the output of proc contents (work.temp in my example above) to do something like
proc sql;
  create table filtered_temp as
    select * from work.temp
    where memname like "DataSet_%";
quit;

this will not work though, if there may be datasets which contains names like dataset_[alphabets] in the same directory...is this a possibility?

what you are thinking is definitely not wrong. yes you may use a do while loop, but again enclose in a macro. however, you will be restricting yourself to that many iterations.
I do not think there are names like dataset_[alphabets] in the same directory. I think I want to restrict myself to a specific pool of numbers.

OK, this is what I came up with for my "do loop", does this look correct or am I missing something?
Note: "dir2011" is directory where all those datasets are located. Also, I know for sure that those 10 or 20 datasets will have a number between 190 and 700 in their dataset name.

%macro combine;
  data FinalData ;
    set
    %do i = 190 %to 700;
      dir2011.DataSet&i
    %end;
  run;
%mend;

Will this pull all needed datasets from "dir2011" directory and than append them into one "FinalData" dataset?
Ooops! I forgot to enclose the following at the end of my code:

options mprint;
%combine
yes and no

it will only work if the dataset you are looking for exists. since you said it's going to be 10-20 out of a range of 711 datasets here, it's not going to work. it'll be better if you could check which are the datasets that are currently present, which is what i have posted earlier that you do a filter to pick up only those that you really needed before calling the macro to concatenate them

just a short note on your code - you said the dataset names will be like DataSet_<number>, however, your code is DataSet&i. The underscore is missing.

also, you are missing a final semi-colon for your set statement, which you'll need after the %end statement
I think I would like to try code this without involving PROC SQL, if this is possible of course.
For testing purposes I placed only three datasets (DataSet_03, DataSet_08, and DataSet_10) in my directory "dir2011".

After I run my code I get the following errors:

ERROR: File dir2011.DataSet_1.DATA does not exist.
ERROR: File dir2011.DataSet_2.DATA does not exist.
ERROR: File dir2011.DataSet_3.DATA does not exist.
ERROR: File dir2011.DataSet_4.DATA does not exist.
ERROR: File dir2011.DataSet_5.DATA does not exist.
ERROR: File dir2011.DataSet_6.DATA does not exist.
ERROR: File dir2011.DataSet_7.DATA does not exist.
ERROR: File dir2011.DataSet_8.DATA does not exist.
ERROR: File dir2011.DataSet_9.DATA does not exist.
ERROR: File dir2011.DataSet_11.DATA does not exist.

So, my program for some reason only recognizes DataSet_10.DATA, but does not find the other two datasets (DataSet_03.DATA and DataSet_08.DATA). Looking at the ERROR messages I can see that my code does not recognize zero in front of a number.  Also, program just stops because of those errors.
How can I code my program so that when a dataset is not found in my directory, my program just skips that particular dataset and just keep searching for a next available dataset?
Also, how can I search and append to "FinalData" only those records where "Var5" is equal to 2?  Where should I place Var5='2' in my code?

%macro combine;
  data FinalData ;
    set
    %do i = 01 %to 11;  
      dir2011.DataSet_&i
    %end;
   ;
  run;
%mend;
options mprint;
%combine
this is exactly why i said that your method will not work - even if there's 1 missing dataset out of 1000, or 1 million, it will still not work because the dataset don't exist.

the best and probably the only way is to check for the dataset existence, like the first example i (or theartfuldazzler) have given...
SOLUTION
Avatar of theartfuldazzler
theartfuldazzler
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you, theartfuldazzler!
"NODSNFERR" seems to work fine for me!  :)

Any recommendations on how can include Var5=2 in my code?
Note:  I just need to have this, so only records with condition "Var5=2" will be written to "FinalData" dataset.  Where exactly should I place condition Var5='2' in my macro code?

%macro combine;
  data FinalData ;
    set
    %do i = 01 %to 11;  
      dir2011.DataSet_&i
       while Var5='2';                   /*  <= is this correct?  */
    %end;
   ;
  run;
%mend;
options mprint nodsnferr;
%combine
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Everything works fine! :)
Thank you both for all your help!!