PROC SURVEYSELECT in SAS is used to select samples from the dataset. PROC SURVEYSELECT is used for simple random sampling and stratified sampling. PROC SURVEYSELECT is also used for selection of train and test data set. Let’s see an example of each
Syntax PROC SURVEYSELECT in SAS:
PROC SURVEYSELECT options;
STRATA variable;
CONTROL variable;
SIZE variable;
ID variable;
So we will be using CARS Table in our example
Simple Random Sampling PROC SURVEY SELECT:
Select N% samples
Selecting Random N% samples in SAS is accomplished using PROC SURVEYSELECT function, by specifying method =srs & samprate = n% as shown below
/* Type 1: proc survey select n percentage sample*/ proc surveyselect data=cars out = cars_sample_60perc method=srs samprate=60; run;
So the resultant table with 60% of samples will be
Select N samples
Selecting Random N samples in SAS is accomplished using PROC SURVEYSELECT function. by specifying method=srs & sampsize = N as shown below
/* Type 2: proc survey select n samples*/ proc surveyselect data=cars out = cars_sample_n method=srs sampsize=10; run;
So the random 10 sample of population will be
Simple Random Sampling with replacement – proc survey select
Simple Random sample with replacement in SAS is accomplished using PROC SURVEYSELECT function. by specifying sampsize = N and rep=1 as shown below which indicates 10 samples with repetition will be selected.
/* simple random sampling with replacement - proc survey select */ proc surveyselect data=cars method = srs sampsize = 10 rep=1 seed=12345 out=cars_rep_n; run;
So the random 10 sample of population will be
Stratified Sampling in SAS : PROC SURVEYSELECT
Note : PROC SURVEYSELECT expects the dataset to be sorted by the strata variable (s).
Luxury is the strata variable. 4 samples are selected for each strata (i.e. 4 samples are selected for Luxury=1 and 4 samples are selected for Luxury=0).
proc sort data=cars; by Luxury; run; /** sample size of 4 for each strata */ proc surveyselect data=cars out = strat_sample_n method=srs sampsize=4; strata Luxury; run;
So the resultant stratified sample in SAS with N Sample for each stratum will be
Total N Samples split proportionately according to distribution of strata
In below example we decided to have totally 4 samples with strata variable as luxury and split is proportionate to the distribution of strata. Luxury= 1 has 5 entries and Luxury=0 has 11 entries. So split will be 1:3 approximately. So out of 4 samples, 3 will have Luxury =0 and 1 will have Luxury =1
proc sort data=cars; by Luxury; run; /** total sample size of 4 with allocation proportionate to strata*/ proc surveyselect data=cars out = strat_sample_n method=srs sampsize=4; strata Luxury / alloc=proportional; run;
So the resultant sample table will be
Split Train and Test Data set in SAS – PROC SURVEYSELECT
Step 1: Use PROC SURVEYSELECT and specify the ratio of split for train and test data (70% and 30% in our case) along with Method which is SRS – Simple Random Sampling in our case
proc surveyselect data=cars rat=0.7 out= cars_select outall method=srs; run;
Details of SURVEYSELECT Procedures are
Resultant table “cars_select” will have column “selected” with values 1 and 0
Step 2: Split all the 1s as Train data set and all 0s as Test data set as shown below
data cars_train cars_test; set cars_select; if selected =1 then output cars_train; else output cars_test; run;
Training Data:
Testing Data: