Calculate Percentage and cumulative percentage of column in pyspark

In order to calculate percentage and cumulative percentage of column in pyspark we will be using sum() function and partitionBy(). We will explain how to get percentage and cumulative percentage of column by group in Pyspark with an example.

  • Calculate Percentage of column in pyspark : Represent the value of the column in terms of percentage
  • Calculate cumulative percentage of column in pyspark
  • Cumulative percentage of the column by group

We will use the dataframe named df_basket1.

Get Percentage and cumulative percentage of a column in pyspark 1

 

Calculate percentage of column in pyspark

Sum() function and partitionBy() is used to calculate the percentage of column in pyspark

import pyspark.sql.functions as f
from pyspark.sql.window import Window
df_percent = df_basket1.withColumn('price_percent',f.col('Price')/f.sum('Price').over(Window.partitionBy())*100)
df_percent.show()

We use sum function to sum up the price column and partitionBy() none to calculate percentage of column as shown below

Get Percentage and cumulative percentage of a column in pyspark 2

 

 

Calculate cumulative percentage of column in pyspark

cumulative percentage of the dataframe in pyspark 1

Sum() function and partitionBy() is used to calculate the percentage of column in pyspark

import pyspark.sql.functions as f
import sys
from pyspark.sql.window import Window
df_percent = df_basket1.withColumn('price_percent',f.col('Price')/f.sum('Price').over(Window.partitionBy())*100)
df_cum_percent = df_percent.withColumn('cum_percent', f.sum(df_percent.price_percent).over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)))
df_cum_percent.show()

We use sum function to sum up the price column and partitionBy() function to calculate percentage of column as shown below and we name it as price_percent. Then we sum up the price_percent column to calculate the cumulative percentage of column

Get Percentage and cumulative percentage of a column in pyspark 3

 

 

Calculate cumulative percentage of column by group in spark

cumulative percentage of the dataframe in pyspark c2

Sum() function and partitionBy() the column name, is used to calculate the cumulative percentage of column by group.

import pyspark.sql.functions as f
import sys
from pyspark.sql.window import Window
df_percent = df_basket1.withColumn('price_percent', f.col('Price')/f.sum('Price').over(Window.partitionBy('Item_group'))*100)
df_cum_percent_grp = df_percent.withColumn('cum_percent_grp', f.sum(df_percent.price_percent).over(Window.partitionBy('Item_group').orderBy().rowsBetween(-sys.maxsize, 0)))
df_cum_percent_grp.show()

We use sum function to sum up the price column and partitionBy() function to calculate the cumulative percentage of column as shown below and we name it as price_percent. Then we sum up the price_percent column to calculate the cumulative percentage of column by group.

cumulative percentage of the dataframe in pyspark d3

 


Other Related Products:

 

Calculate Percentage and cumulative percentage of column in pyspark                                                                                         Calculate Percentage and cumulative percentage of column in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.