Summary or Descriptive statistics in R

Descriptive Statistics of the dataframe in R can be calculated by 3 different methods. Let’s see how to calculate summary statistics of each column of dataframe  in R with an example for each method. summary() function in R is used to get the summary statistics of the column

  • Descriptive statistics with summary function in R
  • Summary statistics in R using stat.desc() function from “pastecs” package
  • Descriptive statistics with describe() function from “Hmisc” package
  • summarise() function of the dplyr package in R

Let’s first create the dataframe.

### Create Data Frame
df1 = data.frame(Name = c('George','Andrea', 'Micheal','Maggie','Ravi','Xien','Jalpa'), 
                 Grade_score=c(4,6,2,9,5,7,8),
                 Mathematics1_score=c(45,78,44,89,66,49,72),
                 Science_score=c(56,52,45,88,33,90,47))
df1

So the resultant dataframe will be

Descriptive or summary statistics in R 0

 

Descriptive statistics in R (Method 1):

summary statistic is computed using  summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column.

  • If the column is a numeric variable, mean, median, min, max and quartiles are returned.
  • If the column is a factor variable, the number of observations in each group is returned.

Descriptive statistics in R with simple summary function calculates

  • minimum value of each column
  • maximum value of each column
  • mean value of each column
  • median value of each column
  • 1st quartile  of each column (25th percentile)
  • 3rd quartile of each column (75th percentile)

as shown below

# Summary statistics of dataframe in R

summary(df1)

summary statistics is

Descriptive or summary statistics in R 1

summary statistics of a single column in R:

Five values of a specified column is returned: the mean, median, 25th and 75th quartiles, min and max in one single line call:

 

# Summary statistics of a column in R

 summary(df1$Science_score)

so the summary statistics of the “Science_score” column will be

summary statistics of the column in R 11

 

Summary / Descriptive statistics in R (Method 2):

Descriptive statistics in R with pastecs package does bit more than simple describe () function. It also Calculates

  • number of missing values and null of each column in R
  • number of non missing values of each column
  • sum , range ,variance and standard deviation etc for each column
# descripive statistics of dataframe in R 

install.packages("pastecs")  
library(pastecs)
stat.desc(df1)

summary statistics is

Descriptive or summary statistics in R 2

 

Summary statistics in R (Method 3):

Descriptive statistics in R with Hmisc package calculates the  distinct value of each column, frequency of each value and proportion of that value in that column. as shown below

# Summary statistics of dataframe in R 

install.packages("Hmisc")
library(Hmisc)
describe(df1)

summary statistics is

Descriptive or summary statistics in R 3

 

Summarise using dplyr() package in R

We will be using mtcars data to depict the example of summarise function.

library(dplyr)
mydata = mtcars

# summarise the columns of dataframe
summarise(mydata, mpg_mean=mean(mpg),mpg_median=median(mpg))

summarise() function that gets the mean and median of mpg.

Get the summary of dataset in R using Dplyr summarise function in R dplyr 1

 

summarise_all()

The summarise_all() function allows you to summarise all the variables.

library(dplyr)
mydata = mtcars

# summarise all the column of dataframe
summarise_all(mydata,funs(n(),mean,median))

summarise_all() function that gets the number of rows, mean and median of  all the columns.

Get the summary of dataset in R using Dplyr summarise function in R dplyr 4

 

Summarize categorical or factor Variable:

We will be summarizing the number of levels/categories and count of missing observations in a categorical (factor) variable. Let’s use iris dataset for example

library(dplyr)

mydata2 = iris
summarise_all(mydata2["Species"], funs(nlevels(.), nmiss=sum(is.na(.))))

In the iris dataset “Species” column has three distinct levels and zero missing values as shown below.

Get the summary of dataset in R using Dplyr summarise function in R dplyr 5

 

For further understanding of  summary statistics using dplyr package in R refer the dplyr documentation


Other Related Topics:

                                                                                                   

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.