asked on
How to efficiently create one SAS dataset from multiple large SAS datasets?
Hi everyone,
In SAS, I am trying create one SAS dataset (datafinal.sas7bdat) from a few thousands of large SAS datasets (data_<statecounty>_2015.sas7bdat). Processing time or processing memory should not be an issue but still efficiency is the key. All these SAS datasets have the same 5 variables and the only difference is that each dataset has its unique name and of course data from each state and county. Since I am dealing with thousands of SAS datasets and trying to run this as efficient (and possibly faster) as possible, I am not sure what is the best to use a PROC SQL or a DATA step with macros? Any suggestions or examples would be greatly appreciated.
<statecounty> in the each dataset's name is 5 digits number, for example, 02001.
In Linux the wildcards are processed by the Shell (or programs itself).
There is no OS function for expanding wildcards.
There is a wildcard expression handler in the glibc library
glob() see man 3 glob for it's use.
Use of that function requires writing some programs.
In perl there is the opendir()/readdir() pair of functions. man perlfunc then search for "readdir DIRHANDLE"
Readdir returns filenames, filtering is done by some normal regex like: /data_[0-9]{5}_2015.sas7bdat/ comparison.
In python there is the glob modules : https://docs.python.org/3/library/glob.html
In Bash you can use the shell's glob handling
So if you want to generate a list of names use f.e.
LIST=$( echo data_?????_2015.sas7bdat)
After that the LIST shell-variable holds the names.... in the current directory.
Another way to generate names...
find /the-data/dir -name 'data_?????_2015.sas7bdat -print
ASKER
Thank you, Noci!
I will also try "LIST" option in Bash scripting and will use that list of files in a SAS program to loop through. Since all input files are SAS datasets, I really need to use SAS for appending all datasets (data_<statecounty>_2015.sas7bdat) and their records into one SAS dataset (datafinal.sas7bdat). So, maybe combination Bash scripting and SAS would work.
I am also thinking to use file exist function in SAS to check if each input file (data_<statecounty>_2015.sas7bdat) exists before appending to final output file (datafinal.sas7bdat).
Please note: I have created statecounty.sas7bdat dataset - it has one variable/column with all state/county codes/digits (for example: 01001, 01002, 01003 and so on ...). This SAS dataset can be sued to loop through data_<statecounty>_2015.sas7bdat datasets.
For example, something like this:
libname input "/data/dir/input";
libname output "/data/dir/output";
/* statecounty dataset */
data _null_;
set input.statecounty end=last;
call symput('stcnty'||srtip(_n_), fips);
run;
%do k=1 %to 999;
%if %sysfunc(exist(input.data_&&stcnty&k._2015)) %then %do;
data output.datafinal;
set input.data_&&stcnty&k._2015;
run;
%end;
I would really appreciate if anyone can correct me if I am wrong with the above logic. I am also open to any options as long as SAS is used for appending these datasets into one.
Thank you!
If your have a LOT of files the LIST variable length limit may be is too small to contain all data.
If you can read from stdin (Standard input) in SAS you can feed the data with:
find /the-data/dir -name 'data_?????_2015.sas7bdat -print | sas sas-job...
The SAS job then gets a filename relative to current location, one / line.
ASKER
Thank you very much, Ian!! I actually tested this a while back and just recently used this code in production. Everything was working as expected! Thank you again for your suggestions and comments!
ASKER
Also, I am trying to understand what be an example of a wild card used within each input SAS dataset to process thousands of <statecounty> SAS datasets (data_<statecounty>_2015.sas7bdat) and create just one final SAS dataset (datafinal.sas7bdat)?
Examples of State and Counties within each input SAS dataset:
state can be equal to from 01 to 50
and county can be equal from 001 to 999
Maybe I need to create a bash script first to create an exported variable for <statecounty> in the script for each input SAS dataset and then pass that exported variable to the SAS program?