Descriptive statistics or Summary Statistics of dataframe in pyspark

In order to calculate Descriptive statistics or Summary Statistics of dataframe in pyspark we will be using describe() function. Descriptive statistics or summary statistics of a column can also be calculated with describe() function. Lets get clarity with an example.

  • Descriptive statistics or summary statistics of dataframe in pyspark.
  • Descriptive statistics or summary statistics of a numeric column in pyspark
  • Descriptive statistics or summary statistics of a character column in pyspark
  • Mean, Min and Max of a column in pyspark using select() function.

Descriptive statistics in pyspark generally gives the

    • Count – Count of values of each column
    • Mean – Mean value of each column
    • Stddev – standard deviation of each column
    • Min – Minimum value of each column
    • Max – Maximum value of each column

Syntax:

df.describe()

    df – dataframe

We will use the dataframe named df.

Summary Statistics or descriptive statistics of dataframe in pyspark 1

 

 

Descriptive statistics or summary statistics of dataframe in pyspark:

dataframe.describe() gives the descriptive statistics of each column. The descriptive statistics include

  • Count – Count of values of each column
  • Mean – Mean value of each column
  • Stddev – standard deviation of each column
  • Min – Minimum value of each column
  • Max – Maximum value of each column
## summary statistics or descriptive statistics of dataframe
df.describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 2

 

 


Descriptive statistics or summary statistics of a numeric column in pyspark : Method 1

dataframe.select(‘column_name’).describe() gives the descriptive statistics of single column.

## summary statistics of a column (numeric column)
df.select('science_score').describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 3

 

Descriptive statistics or summary statistics of a numeric column in pyspark : Method 2

The columns for which the summary statistics needs to found is passed as argument to the describe() function which gives gives the descriptive statistics of those two columns.

## summary statistics of two columns (numeric column)

df.describe('science_score', 'mathematics_score').show()

so the summary statistics of “science_score” and “mathematics_score” columns will be

summary statistics or descriptive statistics of the dataframe in pyspark c1

 

 


Descriptive statistics or summary statistics of a character column in pyspark : method 1

dataframe.select(‘column_name’).describe() gives the descriptive statistics of single column. Descriptive statistics of character column gives

  • Count – Count of values of a character column
  • Min – Minimum value of a character column
  • Max – Maximum value of a character column
## summary statistics of a column (character column)
df.select('name').describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 4

 

Descriptive statistics or summary statistics of a character column in pyspark : method 2

dataframe.describe() gives the descriptive statistics of single column by passing the column names to the describe() function. Descriptive statistics of character column gives count, Minimum and Maximum value of the column.

## summary statistics of a column (character column)
df.describe('name').show()

so the resultant summary statistics of “name” column will be

Summary Statistics or descriptive statistics of dataframe in pyspark 4

 


Extract Mean, Min and Max of a column in pyspark using select() function:

Inside the select() function we will be using mean() function, min() function and max() function. which calculates the average value , Minimum value and Maximum value of the column

  • Average values of the numeric column – mean() 
  • Minimum value of the numeric column – min()
  • Maximum value of the numeric column – max()
## summary statistics of a column (character column)
df.select([mean('science_score'), min('science_score'), max('science_score')]).show()

so the resultant summary statistics of “science_score” column will be

summary statistics or descriptive statistics of the dataframe in pyspark c2

 


Other Related Topics:

 

Descriptive statistics or Summary Statistics of dataframe in pyspark                                                                                           Descriptive statistics or Summary Statistics of dataframe in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.