In Simple random sampling every individuals are randomly obtained and so the individuals are equally likely to be chosen. Simple Random sampling in pyspark is achieved by using sample() Function. Here we have given an example of simple random sampling with replacement in pyspark and simple random sampling in pyspark without replacement. In Stratified sampling every member of the population is grouped into homogeneous subgroups and representative of each group is chosen. Stratified sampling in pyspark is achieved by using sampleBy() Function. Here we have given an example of stratified sampling in pyspark.

- Simple random sampling in pyspark with example
- Stratified sampling in pyspark with example

We will be using the dataframe **df_cars**

**Simple random sampling in pyspark with example**

In Simple random sampling every individuals are randomly obtained and so the individuals are equally likely to be chosen.

**Simple random sampling without replacement in pyspark**

**Syntax:**

*sample(False, fraction, seed=None)*Returns a sampled subset of Dataframe without replacement.

**Note:** fraction is not guaranteed to provide exactly the fraction specified in Dataframe

### Simple random sampling in pyspark df_cars_sample = df_cars.sample(False, 0.5, 42) df_cars_sample.show()

So the resultant sample without replacement will be

** **

**Simple random sampling with replacement**

**Syntax:**

*sample(True, fraction, seed=None)*Returns a sampled subset of Dataframe with replacement.

### Simple random sampling in pyspark with replacement df_cars_sample = df_cars.sample(True, 0.5, 42) df_cars_sample.show()

So the resultant sample with replacement will be

**Stratified sampling in pyspark**

In Stratified sampling every member of the population is grouped into homogeneous subgroups called strata and representative of each group (strata) is chosen.

**Syntax:**

*sampleBy(col, fractions, seed=None)*### Stratified sampling in pyspark from pyspark.sql.functions import col sampled = df_cars.sampleBy("cyl", fractions={4: 0.2, 6: 0.4, 8: 0.2}, seed=0) sampled.show()

From cyl column we have three subgroups or Strata – (4,6,8) which are chosen at fraction of 0.2, 0.4 and 0.2 respectively. We use sampleBy() function as shown above so the resultant sample will be