Link to home
Start Free TrialLog in
Avatar of labradorchik
labradorchikFlag for United States of America

asked on

How to efficiently create one SAS dataset from multiple large SAS datasets?

Hi everyone, 


In SAS, I am trying create one SAS dataset (datafinal.sas7bdat) from a few thousands of large SAS datasets (data_<statecounty>_2015.sas7bdat). Processing time or processing memory should not be an issue but still efficiency is the key. All these SAS datasets have the same 5 variables and the only difference is that each dataset has its unique name and of course data from each state and county.  Since I am dealing with thousands of SAS datasets and trying to run this as efficient (and possibly faster) as possible, I am not sure what is the best to use a PROC SQL or a DATA step with macros? Any suggestions or examples would be greatly appreciated.  


<statecounty> in the each dataset's name is 5 digits number, for example, 02001.

Avatar of labradorchik
labradorchik
Flag of United States of America image

ASKER

Also, I am trying to understand what be an example of a wild card used within each input SAS dataset to process thousands of <statecounty> SAS datasets (data_<statecounty>_2015.sas7bdat) and create just one final SAS dataset (datafinal.sas7bdat)?


Examples of State and Counties within each input SAS dataset:

state can be equal to from 01 to 50

and county can be equal from 001 to 999


Maybe I need to create a bash script first to create an exported variable for <statecounty> in the script for each input  SAS dataset and then pass that exported variable to the SAS program? 

Avatar of noci
noci

In Linux the wildcards are processed by the Shell (or programs itself).

There is no OS function for expanding wildcards.

There is a wildcard expression handler in the glibc library 

glob()   see man 3 glob for it's use.

Use of that function requires writing some programs.

In perl there is the opendir()/readdir() pair of functions.  man perlfunc then search for "readdir DIRHANDLE"

Readdir returns filenames, filtering is done by some normal regex like: /data_[0-9]{5}_2015.sas7bdat/ comparison.

In python there is the glob modules : https://docs.python.org/3/library/glob.html


In Bash you can use the shell's glob handling 

So if you want to generate a list of names use f.e.

LIST=$( echo data_?????_2015.sas7bdat)

After that the LIST shell-variable holds the names....  in the current directory.


Another way to generate names... 

find /the-data/dir -name 'data_?????_2015.sas7bdat -print

Thank you, Noci!


I will also try "LIST" option in Bash scripting and will use that list of files in a SAS program to loop through. Since all input files are SAS datasets, I really need to use SAS for appending all datasets (data_<statecounty>_2015.sas7bdat) and their records into one SAS dataset (datafinal.sas7bdat). So, maybe combination Bash scripting and SAS would work. 


I am also thinking to use file exist function in SAS to check if each input file (data_<statecounty>_2015.sas7bdat) exists before appending to final output file (datafinal.sas7bdat)


Please note:  I have created statecounty.sas7bdat dataset - it has one variable/column with all state/county codes/digits (for example: 01001, 01002, 01003 and so on ...). This SAS dataset can be sued to loop through data_<statecounty>_2015.sas7bdat datasets


For example, something like this: 

libname input "/data/dir/input";
libname output "/data/dir/output";


/* statecounty dataset */
data _null_;
   set input.statecounty end=last;  
     call symput('stcnty'||srtip(_n_), fips);
run;


%do k=1 %to 999; 
  %if %sysfunc(exist(input.data_&&stcnty&k._2015)) %then %do;
     data output.datafinal;
       set input.data_&&stcnty&k._2015;
     run;
%end;

Open in new window


I would really appreciate if anyone can correct me if I am wrong with the above logic. I am also open to any options as long as SAS is used for appending these datasets into one. 


Thank you! 


If your have a LOT of files the LIST variable length limit may be is too small to contain all data.


If you can read from stdin (Standard input) in SAS you can feed the data with:


find /the-data/dir -name 'data_?????_2015.sas7bdat -print | sas sas-job... 

Open in new window

The SAS job then gets a filename relative to current location, one / line.
ASKER CERTIFIED SOLUTION
Avatar of labradorchik
labradorchik
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

Thank you very much, Ian!!  I actually tested this a while back and just recently used this code in production. Everything was working as expected! Thank you again for your suggestions and comments!