SAS: Create a new variable from a character variable's value

Posted on 2015-02-23
Last Modified: 2016-02-13
I am using SAS (statistical analysis software).

I have a SAS dataset with 50 variables called ID1-ID50 and 50 variables called Score1-Score50. This dataset has test item scores from a 50 item test. There is one record per examinee. However, the dataset contains data from dozens of different test forms. The items are from an item bank. Each item is identified in the item bank with a 6-digit master ID number. For the first examinee, ID1 might be item 143576; for another examinee, ID1 might be 437892, etc.

I want to create a dataset that scores the item scores (from the Score1-Score50 variables) in which the variable names are the master Item ID numbers. For example, the ID numbers might be Item_143576, item_437892, etc. There are 2,000 different ID numbers, so there will be about 2,000 item score variables in the new dataset.

The simplest way would be something like CALL SYMPUT for data step variables rather than macro variables.

I could use PROC FORMAT to create a huge lookup table of the ID numbers and their index in a 2.000 element array. Then I could use a PUT statement to grab the appropriate array index value for the stored ID value. I was hoping there is something more elegant so that I don't have to create thousands of lines of code to store my proc format and my array statement (with 2,000 variable names).
Question by:gordonwwaugh
  • 3
  • 2

Expert Comment

ID: 40627240
I'm not exactly sure if you want to -
use the examinee dataset to update the item database where you are adding performance information into the item database
use the item database to find out which was the correct answer for each item for each examinee and produce a total score for each examinee.

Assuming the first, =>
Two completely different ways to tackle this.

Method 1
Use proc SQL to join your test dataset (the one with ID1-ID50 and Score1-Score50) to the dataset for your item bank, to retrive / or set the data value for each answered item.

You also could use a data step using keyed access to do much the same thing. This requires a
     SET statement identifying the examinee dataset
and another
    SET statement identifying the item database using KEYed access determined by current question ID.
where the second SET statement is within a DO loop which loops over all 50 questions.  To be able to do this neatly use two ARRAY declarations one for ID1-ID50 and the other for SCORE1-SCORE50

(using SQL could get messy doing the update if there is more than one examinee answering any particular item.)

Method 2
Makes sure the test dataset has an identification number for each subject (ie each record)
Transpose the test dataset (BY SubjectID) to produce a new dataset with 4 variables and 50 times the number of records as before as there will be one record for each question for each subject
SubjectID, Question,  ID, Score
where -
    Question contains values 1 to 50
    ID contains the item number and score
    Score contains the associated score.

Next sort by item ID  (either proc SQL combined with next step or proc SORT)
Sum all scores for each item ID (either data step, proc MEANS, or proc SQL)
merge new dataset into database adding sum of scores for each item (data step using merge or keyed access; OR proc SQL)


If you could provide information on the columns in the database that you want to use (and how you want to modify them) then we could talk specifics.


Author Comment

ID: 40627320
My current SAS datasat looks like this. There is one record per examinee. The ID values are test item ID numbers.

ExamineeNum  ID1      Score1       ID2          Score2       ID3      Score3    ...
1                          143         1             578               0             351         1
2                          297         1             325               0             426         1

I want it to look like this--still with one record per examinee.

ExamineeNum  s_143  s_297  s_325  s_351 s_426 s_578 ...
1                               1          .           .           1         .            0
2                               .          1           0          .          1           .

Obviously, the data above is simplified. My actual dataset has 100,000 examinees and 2,000 unique test item ID numbers.

I just finished doing this using a 2000-element proc format and a 2000-variable array. However, those 4,000 lines of code make the program a bit unwieldy to read--and I had to write a data step and proc freq to generate the 2000 item ID numbers.

Accepted Solution

ShannonEE earned 500 total points
ID: 40629595
Hi there

Thanks for the update.

Provided that you have sequentially numbered questions with variables ID1 ID2 ... IDx and variables Score1 Score2 ... Scorex where x is the numeric limit (in your case 50) the following code should do the trick.

%let lastQ=4;

	BY ExamineeNum;
	VAR ID1-ID&lastQ;

PROC TRANSPOSE DATA=work.test1	OUT=WORK.test3	PREFIX=newScore	NAME=SourceScore;
	BY ExamineeNum;
	VAR Score1-Score&lastQ ;

data work.test4;
merge work.test2 work.test3;
newName = cats("Q_", newID1);

PROC TRANSPOSE DATA=WORK.test4	OUT=WORK.test5(drop=_NAME_ LABEL="Twice transposed questions");
	BY ExamineeNum;
	ID newName;
	VAR newScore1 ;

proc print; run;

Open in new window

Make sure you set the macro variable lastQ.  I have used 4 to give the following results

Examin	ID1	Score1	ID2	Score2	ID3	Score3	ID4	Score4
10001	143	1	578	0	351	1	325	1
10002	297	1	325	0	426	1	143	1
10003	143	0	325	1	352	1	578	1
10004	351	1	143	1	297	0	578	0
10005	351	0	297	0	326	0	143	0

Open in new window

transferred into

Examin	Q_143	Q_578	Q_351	Q_325	Q_297	Q_426	Q_352	Q_326
10001	1	0	1	1	.	.	.	.
10002	1	.	.	0	1	1	.	.
10003	0	1	.	1	.	.	1	.
10004	1	0	1	.	0	.	.	.
10005	0	.	0	.	0	.	.	0	

Open in new window

You have the flexability to have any prefix on the question names.

The first 2 transposes just get a number of records for each examinee First for the question numbers and then the corresponding scores.

The data step with the merge then puts them together.

Lastly the transpose uses the new name created in the merge as a variable name into which to put the scores. This occurs at the time of putting all the records for each examinee back into a single record.

Note by default where an examinee did not have a qarticular question the transpose puts missing   (shows up as a dot ).

With a small amout of trickery it is possible to get the questions in ascending order.

Sort test2 by newID1,
make a variable ExamineeNum set to missing.
 get only one of each question (in variable newID1,
make up newScore1 for each newID1 and set to missing.
Transpose this data set  - because it is in sorted order the transposed questions will be in order.

Then append your final dataset to this artificial data set (this will pick up variables from the artificial single record data set first and keep in the correct order.  When combining the datasets just delete the first record (which was our dummy record.

You could probably get the questions sorted using proc datasets or onother utility procedure.
Is this what you want?


Author Comment

ID: 40630812
Perfect! That's exactly what I wanted!

Author Comment

ID: 40630937
Here is the full solution. I added the code for putting the variable names in ascending order in the dataset. The steps involving datasets dummy1, dummy2, etc., sort the variable names.

data test1;
infile cards;
      ID1    Score1         
      ID2    Score2          
      ID3    Score3          
      ID4    Score4          
10001      143      1      578      0      351      1      325      1
10002      297      1      325      0      426      1      143      1
10003      143      0      325      1      352      1      578      1
10004      351      1      143      1      297      0      578      0
10005      351      0      297      0      326      0      143      0

%let lastQ=4;

PROC TRANSPOSE DATA=work.test1      OUT=WORK.test2      PREFIX=newID      NAME=SourceID;
      BY ExamineeNum;
      VAR ID1-ID&lastQ;

proc sort data=work.test2 out=work.dummy1;
      by newID1;

data work.dummy2;
set work.dummy1(DROP=SourceID);
      by newID1;
if first.newID1;
examineeNum = .;
NewScore1 = .;
NewName = cats('Q_',newID1);

PROC TRANSPOSE DATA=work.dummy2 out=work.dummy3;
ID NewName;

data work.dummy4(DROP=_NAME_);
set work.dummy3;
if _N_=1;

PROC TRANSPOSE DATA=work.test1      OUT=WORK.test3      PREFIX=newScore      NAME=SourceScore;
      BY ExamineeNum;
      VAR Score1-Score&lastQ ;

data work.test4;
merge work.test2 work.test3;
newName = cats('Q_', newID1);

PROC TRANSPOSE DATA=WORK.test4      OUT=WORK.test5(drop=_NAME_ LABEL="Twice transposed questions");
      BY ExamineeNum;
      ID newName;
      VAR newScore1 ;

data work.test6;
set work.dummy4 work.test5;
if _N_ = 1 then delete;

proc print; run;

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If your app took Google’s lash recently, here are the 5 most likely reasons.
This guide will walk you through the essential considerations and tech stack for building scalable websites. Know how to grow your business the smart way!
Video by: Tony
This video teaches viewers how to export a project from Adobe Premiere Pro and the various file types involved.
The viewer will learn how to successfully download and install the SARDU utility on Windows 8, without downloading adware.

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now