Mean, Variance and standard deviation of column in pyspark can be accomplished using aggregate() function with argument column name followed by mean , variance and standard deviation according to our need. Mean, Variance and standard deviation of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each
- Mean of the column in pyspark with example
- Variance of the column in pyspark with example
- Standard deviation of column in pyspark with example
- Mean of each group of dataframe in pyspark with example
- Variance of each group of dataframe in pyspark with example
- Standard deviation of each group of dataframe in pyspark with example
We will be using dataframe named df_basket1
Mean of the column in pyspark with example:
Mean of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘mean’ keyword which returns the mean value of that column
## Mean value of the column in pyspark df_basket1.agg({'Price': 'mean'}).show()
Mean value of price column is calculated
Variance of the column in pyspark with example:
Variance of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘variance’ keyword which returns the variance of that column
## Variance of the column in pyspark df_basket1.agg({'Price': 'variance'}).show()
Variance of price column is calculated
Standard Deviation of the column in pyspark with example:
Standard Deviation of the column in pyspark is calculated using aggregate function – agg() function. The agg() Function takes up the column name and ‘stddev’ keyword which returns the standard deviation of that column
## Variance of the column in pyspark df_basket1.agg({'Price': 'stddev'}).show()
Standard deviation of price column is calculated
Mean of each group in pyspark with example:
Mean value of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘mean’ keyword, groupby() takes up column name which returns the mean value of each group in a column
# Mean value of each group df_basket1.groupby('Item_group').agg({'Price': 'mean'}).show()
Mean price of each “Item_group” is calculated
Variance of each group in pyspark with example:
Variance of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘variance’ keyword, groupby() takes up column name, which returns the variance of each group in a column
# Variance of each group df_basket1.groupby('Item_group').agg({'Price': 'variance'}).show()
Variance price of each “Item_group” is calculated
Standard deviation of each group in pyspark with example:
Standard deviation of each group in pyspark is calculated using aggregate function – agg() function along with groupby(). The agg() Function takes up the column name and ‘stddev’ keyword, groupby() takes up column name, which returns the standard deviation of each group in a column
# Standard deviation of each group df_basket1.groupby('Item_group').agg({'Price': 'stddev'}).show()
Standard deviation price of each “Item_group” is calculated
Other Related Topics:
- Maximum or Minimum value of column in Pyspark
- Raised to power of column in pyspark – square, cube , square root and cube root in pyspark
- Drop column in pyspark – drop single & multiple columns
- Subset or Filter data with multiple conditions in pyspark
- Frequency table or cross table in pyspark – 2 way cross table
- Mean, Variance and standard deviation of column in Pyspark
- Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max
- Descriptive statistics or Summary Statistics of dataframe in pyspark
- Rearrange or reorder column in pyspark
- cumulative sum of column and group in pyspark
- Calculate Percentage and cumulative percentage of column in pyspark
- Select column in Pyspark (Select single & Multiple columns)
- Get data type of column in Pyspark (single & Multiple columns)
- Get List of columns and its data type in Pyspark