Mean of two or more columns in pyspark

In order to calculate Mean of two or more columns in pyspark. We will be using + operator of the column in pyspark and dividing by number of columns to calculate mean of columns. Second method is to calculate mean of columns in pyspark and add it to the dataframe by using simple + operation along with select Function and dividing by number of columns. Let’s see an example of each.

  • mean of two or more columns in pyspark using + and select() and dividing by number of columns
  • mean of multiple columns in pyspark and appending to dataframe and dividing by number of columns

We will be using the dataframe df_student_detail.

mean of two or more columns in pyspark 1

 

 

Mean of two or more column in pyspark : Method 1

  • In Method 1 we will be using simple + operator to calculate mean of multiple column in pyspark. using + to calculate sum and dividing by number of column, gives the mean
### Mean of two or more columns in pyspark

from pyspark.sql.functions import col, lit

df1=df_student_detail.select(((col("mathematics_score") + col("science_score")) / lit(2)).alias("mean"))
df1.show()

In this method simply finds the mean of the two or more columns and produce the resultant column as shown below.

mean of two or more columns in pyspark 2

 

 

Mean of multiple column in pyspark and appending to dataframe: Method 2

In Method 2 we will be using simple + operator and dividing the result by number of column to calculate mean of multiple column in pyspark, and appending the results to the dataframe

### Mean of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("mean_of_col", (col("mathematics_score")+col("science_score"))/2)
df1.show()

so we will be finding the mean the two columns namely “mathematics_score” and “science_score”,  then storing the result in the column named “mean_of_col” as shown below in the resultant dataframe.

mean of two or more columns in pyspark 3

 


Other Related Topics :

Mean of two or more columns in pyspark                                                                                                  Mean of two or more columns in pyspark

DataScience Made Simple