Mean of two or more columns in pyspark

In order to calculate Mean of two or more columns in pyspark. We will be using + operator of the column in pyspark and dividing by number of columns to calculate mean of columns. Second method is to calculate mean of columns in pyspark and add it to the dataframe by using simple + operation along with select Function and dividing by number of columns. Let’s see an example of each.

  • mean of two or more columns in pyspark using + and select() and dividing by number of columns
  • mean of two or more columns in pyspark and appending to dataframe and dividing by number of columns

We will be using the dataframe df_student_detail.

mean of two or more columns in pyspark 1

 

 

Mean of two or more columns in pyspark : Method 1

  • In Method 1 we will be using simple + operator to calculate mean of two or more columns in pyspark. using + to calculate sum and dividing by number of columns gives the mean
### Mean of two or more columns in pyspark

from pyspark.sql.functions import col, lit

df1=df_student_detail.select(((col("mathematics_score") + col("science_score")) / lit(2)).alias("mean"))
df1.show()

mean of two or more columns in pyspark 2

 

 

Mean of two or more columns in pyspark and appending to dataframe: Method 2

In Method 2 we will be using simple + operator and dividing the result by number of columns to calculate mean of two or more columns in pyspark, and appending the results to the dataframe

### Mean of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("mean_of_col", (col("mathematics_score")+col("science_score"))/2)
df1.show()

mean of two or more columns in pyspark 3
Mean of two or more columns in pyspark                                                                                                  Mean of two or more columns in pyspark