Mean, Variance and standard deviation of column in Pyspark

Mean, Variance and standard deviation of column in pyspark can be accomplished using aggregate() function with argument column name followed by mean , variance and standard deviation according to our need. Mean, Variance and standard deviation of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each

  • Mean of the column in pyspark with example
  • Variance of the column in pyspark with example
  • Standard deviation of column in pyspark with example
  • Mean of each group of dataframe in pyspark with example
  • Variance of each group of dataframe in pyspark with example
  • Standard deviation of each group of dataframe in pyspark with example

We will be using dataframe named df_basket1

Mean, Variance and standard deviation of column in Pyspark 1

 

 

Mean of the column in pyspark with example:

Mean of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘mean’ keyword which returns the mean value of that column

## Mean value of the column in pyspark

df_basket1.agg({'Price': 'mean'}).show()

Mean value of price column is calculated

Mean, Variance and standard deviation of column in Pyspark 2

 

 

Variance of the column in pyspark with example:

Variance of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘variance’ keyword which returns the variance of that column


## Variance of the column in pyspark

df_basket1.agg({'Price': 'variance'}).show()

Variance of price column is calculated

Mean, Variance and standard deviation of column in Pyspark 3

 

 

Standard Deviation of the column in pyspark with example:

Standard Deviation of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘stddev’ keyword which returns the standard deviation of that column

## Variance of the column in pyspark

df_basket1.agg({'Price': 'stddev'}).show()

Standard deviation of price column is calculated

Mean, Variance and standard deviation of column in Pyspark 4

 

 

Mean of each group in pyspark with example:

Mean variance and standard deviation of column in pyspark c1

Mean value of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘mean’ keyword, groupby() takes up column name which returns the mean value of each group in a column

# Mean value of each group

df_basket1.groupby('Item_group').agg({'Price': 'mean'}).show()

Mean price of each “Item_group” is calculated

Mean, Variance and standard deviation of column in Pyspark 5

 

 

Variance of each group in pyspark with example:

Variance of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘variance’ keyword, groupby() takes up column name, which returns the variance of each group in a column

# Variance of each group

df_basket1.groupby('Item_group').agg({'Price': 'variance'}).show()

Variance price of each “Item_group” is calculated

Mean, Variance and standard deviation of column in Pyspark 6

 

 

Standard deviation of each group in pyspark with example:

Standard deviation of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘stddev’ keyword, groupby() takes up column name, which returns the standard deviation of each group in a column

# Standard deviation of each group

df_basket1.groupby('Item_group').agg({'Price': 'stddev'}).show()

Standard deviation price of each “Item_group” is calculated

Mean, Variance and standard deviation of column in Pyspark 7


Other Related Topics:

Mean, Variance and standard deviation of column in Pyspark                                                                                                     Mean, Variance and standard deviation of column in Pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.