Frequency table or cross table in pyspark – 2 way cross table

In order to calculate Frequency table or cross table in pyspark we will be using crosstab() function. Frequency table in pyspark can be calculated in roundabout way using group by count. Cross table in pyspark can be calculated using crosstab() function. Let’s get clarity with an example.

  • Calculate Frequency table in pyspark with example
  • Compute Cross table in pyspark with example – two way cross table / frequency table
  • Compute Cross table in pyspark  using groupby function

We will be using df_basket1

Frequency table or cross table in pyspark 1

 

 

Frequency table in pyspark:

Frequency table in pyspark can be calculated in roundabout way using group by count. The dataframe is grouped by column named “Item_group” and count of occurrence is calculated which in turn calculates the frequency of “Item_group”.

## Frequency table in pyspark
df_basket1.groupBy("Item_group").count().show()

Column name is passed to groupBy function along with count() function as shown, which gives the frequency table

Frequency table or cross table in pyspark 2

 

 


Cross table in pyspark   : Method 1

Cross table in pyspark can be calculated using crosstab() function. Cross tab takes two arguments to calculate two way frequency table or cross table of these two columns.

## Cross table in pyspark

df_basket1.crosstab('Item_group', 'price').show()

Cross table of “Item_group” and “price” is shown below

Frequency table or cross table in pyspark 3

 

Cross table in pyspark   : Method 2

Cross table in pyspark can be calculated using groupBy() function. groupBy() function takes two columns arguments to calculate two way frequency table or cross table.

## Cross table in pyspark
df_basket1.groupBy("Item_group","price").count().show()

Cross table of “Item_group” and “price” columns is shown below

Frequency table or cross table in pyspark 4

 

Comparison of Two way cross table in Method 1 and Method 2:

Cross Table and Frequency table Method 1 Pyspark

Method 1 Takes up one value along the rows and other value on the columns and cells represents the frequency where as in method 2 Long format i.e. both values are represented as rows and frequency is populated accordingly.


Other Related Topics:

Frequency table or cross table in pyspark – 2 way cross table                                                                                            Frequency table or cross table in pyspark – 2 way cross table

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.