Descriptive statistics or Summary Statistics of dataframe in pyspark

In order to calculate Descriptive statistics or Summary Statistics of dataframe in pyspark we will be using describe() function. Descriptive statistics or summary statistics of a column can also be calculated with describe() function. Lets get clarity with an example.

  • Descriptive statistics or summary statistics of dataframe in pyspark
  • Descriptive statistics or summary statistics of a numeric column in pyspark
  • Descriptive statistics or summary statistics of a character column in pyspark

Descriptive statistics in pyspark generally gives the

    • Count – Count of values of each column
    • Mean – Mean value of each column
    • Stddev – standard deviation of each column
    • Min – Minimum value of each column
    • Max – Maximum value of each column

Syntax:

df.describe()

    df – dataframe

We will use the dataframe named df.

Summary Statistics or descriptive statistics of dataframe in pyspark 1

 

 

Descriptive statistics or summary statistics of dataframe in pyspark

dataframe.describe() gives the descriptive statistics of each column. The descriptive statistics include

  • Count – Count of values of each column
  • Mean – Mean value of each column
  • Stddev – standard deviation of each column
  • Min – Minimum value of each column
  • Max – Maximum value of each column
## summary statistics or descriptive statistics of dataframe
df.describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 2

 

 

Descriptive statistics or summary statistics of a numeric column in pyspark

dataframe.select(‘column_name’).describe() gives the descriptive statistics of single column.

## summary statistics of a column (numeric column)
df.select('science_score').describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 3

 

 

Descriptive statistics or summary statistics of a character column in pyspark

dataframe.select(‘column_name’).describe() gives the descriptive statistics of single column. Descriptive statistics of character column gives

  • Count – Count of values of a character column
  • Min – Minimum value of a character column
  • Max – Maximum value of a character column
## summary statistics of a column (character column)
df.select('name').describe().show()

Summary Statistics or descriptive statistics of dataframe in pyspark 4

 

Descriptive statistics or Summary Statistics of dataframe in pyspark                                                                                           Descriptive statistics or Summary Statistics of dataframe in pyspark