PROC SURVEYSELECT IN SAS EXPLAINED

PROC SURVEYSELECT in SAS is used to select samples from the dataset. PROC SURVEYSELECT is used for simple random sampling and stratified sampling. PROC SURVEYSELECT is also used for selection of train and test data set. Let’s see an example of each

Syntax PROC SURVEYSELECT in SAS:

PROC SURVEYSELECT options;
STRATA variable;
CONTROL variable;
SIZE variable;
ID variable;

So we will be using CARS Table in our example

proc survey select in SAS explained 1

 

 

Simple Random Sampling PROC SURVEY SELECT:

Select N% samples

Selecting Random N% samples in SAS is accomplished using PROC SURVEYSELECT function, by specifying method =srs & samprate = n% as shown below


/* Type 1: proc survey select n percentage sample*/ 
proc surveyselect data=cars 
out = cars_sample_60perc 
method=srs 
samprate=60; 
run; 

So the resultant table with 60% of samples will be
proc survey select in SAS explained 2

 

Select N samples

Selecting Random N samples in SAS is accomplished using PROC SURVEYSELECT function. by specifying method=srs & sampsize = N as shown below


/* Type 2: proc survey select n samples*/ 

proc surveyselect data=cars 
out = cars_sample_n 
method=srs 
sampsize=10; 
run;

So the random 10 sample of population will be

proc survey select in SAS explained 3

 

 

Simple Random Sampling with replacement – proc survey select

Simple Random sample with replacement in SAS is accomplished using PROC SURVEYSELECT function. by specifying sampsize = N and rep=1 as shown below which indicates 10 samples with repetition will be selected.


/* simple random sampling with replacement - proc survey select */ 

proc surveyselect data=cars method = srs sampsize = 10  
rep=1 seed=12345 out=cars_rep_n; 
run; 

So the random 10 sample of population will be

proc survey select in SAS explained 4

 

 

Stratified Sampling in SAS : PROC SURVEYSELECT

Note : PROC SURVEYSELECT expects the dataset to be sorted by the strata variable (s).

Luxury is the strata variable. 4 samples are selected for each strata (i.e. 4 samples are selected for Luxury=1 and 4 samples are selected for Luxury=0).


proc sort data=cars; 
by Luxury; 
run; 
 
/** sample size of 4 for each strata */ 

proc surveyselect data=cars 
out = strat_sample_n 
method=srs  
sampsize=4; 
strata Luxury; 
run;

So the resultant stratified sample in SAS with N Sample for each stratum will be

proc survey select in SAS explained 5

 

 

Total N Samples split proportionately according to distribution of strata

In below example we decided to have totally 4 samples with strata variable as luxury and split is proportionate to the distribution of strata.  Luxury= 1 has 5 entries and Luxury=0 has 11 entries. So split will be 1:3 approximately. So out of 4 samples, 3 will have Luxury =0 and 1 will have Luxury =1


proc sort data=cars; 
by Luxury; 
run; 
 
/** total sample size of 4 with allocation proportionate to strata*/ 

proc surveyselect data=cars  
out = strat_sample_n 
method=srs 
sampsize=4; 
strata Luxury / alloc=proportional; 
run;

 So the resultant sample table will beproc survey select in SAS explained 6

 

 

Split Train and Test Data set in SAS  –  PROC SURVEYSELECT

Step 1:  Use  PROC SURVEYSELECT and specify the ratio of split for train and test data (70% and 30%  in our case) along with Method which is SRS – Simple Random Sampling in our case


proc surveyselect data=cars rat=0.7 
out= cars_select outall 
method=srs; 
run;

Details of SURVEYSELECT Procedures are

proc survey select in SAS explained 7

Resultant table “cars_select” will have column “selected” with values 1 and 0

proc survey select in SAS explained 8

Step 2:  Split all the 1s as Train data set and all 0s as Test data set as shown below


data cars_train cars_test; 
set cars_select; 
if selected =1 then output cars_train; 
else output cars_test; 
run;


Training Data:

proc survey select in SAS explained 8

Testing Data:

proc survey select in SAS explained 9

 

                                                                                           

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.