Simple random sampling and stratified sampling in pyspark – Sample(), SampleBy()

In Simple random sampling every individuals are randomly obtained and so the individuals are equally likely to be chosen. Simple Random sampling in pyspark is achieved by using sample() Function. Here we have given an example of simple random sampling with replacement in pyspark and simple random sampling in pyspark without replacement. In Stratified sampling every member of the population is grouped into homogeneous subgroups and representative of each group is chosen. Stratified sampling in pyspark is achieved by using sampleBy() Function. Lets look at an example of both simple random sampling and stratified sampling in pyspark.

Simple random sampling in pyspark with example using sample() function
Stratified sampling in pyspark with example

We will be using the dataframe df_cars

Simple random sampling and stratified sampling in pyspark 3

Simple random sampling in pyspark with example

In Simple random sampling every individuals are randomly obtained and so the individuals are equally likely to be chosen.

Simple random sampling and stratified sampling in pyspark 1

Simple random sampling without replacement in pyspark

Syntax:

sample(False, fraction, seed=None)

Returns a sampled subset of Dataframe without replacement.

Note: fraction is not guaranteed to provide exactly the fraction specified in Dataframe

### Simple random sampling in pyspark

df_cars_sample = df_cars.sample(False, 0.5, 42)
df_cars_sample.show()

So the resultant sample without replacement will be

Simple random sampling with replacement

Syntax:

sample(True, fraction, seed=None)

Returns a sampled subset of Dataframe with replacement.

### Simple random sampling in pyspark with replacement

df_cars_sample = df_cars.sample(True, 0.5, 42)
df_cars_sample.show()

So the resultant sample with replacement will be

Simple random sampling and stratified sampling in pyspark 5

Stratified sampling in pyspark

In Stratified sampling every member of the population is grouped into homogeneous subgroups called strata and representative of each group (strata) is chosen.

Syntax:

sampleBy(col, fractions, seed=None)

Simple random sampling and stratified sampling in pyspark 2


### Stratified sampling in pyspark

from pyspark.sql.functions import col

sampled = df_cars.sampleBy("cyl", fractions={4: 0.2, 6: 0.4, 8: 0.2}, seed=0)
sampled.show()

From cyl column we have three subgroups or Strata – (4,6,8) which are chosen at fraction of 0.2, 0.4 and 0.2 respectively. We use sampleBy() function as shown above so the resultant sample will be

Simple random sampling and stratified sampling in pyspark 6

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts