Pyspark

In this Section we will be explaining Pyspark concepts one by one. This set of topics on pyspark is designed to make pyspark learning in quick and easy way.  lets get started with pyspark Learning

pyspark tutorial

1) Simple random sampling and stratified sampling in pyspark – Sample(), SampleBy()

 

Simple random sampling without replacement in pyspark

Syntax:

 sample(False, fraction, seed=None)

Returns a sampled subset of Dataframe without replacement.

Note: fraction is not guaranteed to provide exactly the fraction specified in Dataframe

### Simple random sampling in pyspark

df_cars_sample = df_cars.sample(False, 0.5, 42)
df_cars_sample.show()

So the resultant sample without replacement will be

Simple random sampling and stratified sampling in pyspark 4

 

 

Simple random sampling with replacement

Syntax:

 sample(True, fraction, seed=None)

Returns a sampled subset of Dataframe with replacement.

### Simple random sampling in pyspark with replacement

df_cars_sample = df_cars.sample(True, 0.5, 42)
df_cars_sample.show()

So the resultant sample with replacement will be

Simple random sampling and stratified sampling in pyspark 5

 

 

2) Join in pyspark (Merge) inner , outer, right , left join in pyspark

 

Inner join in pyspark with example

Inner Join in pyspark is the simplest and most common type of join. It is also known as simple join or Natural Join. Inner join returns the rows when matching condition is met.

join in pyspark (Merge) inner , outer, right , left join in pyspark 3

### Inner join in pyspark

df_inner = df1.join(df2, on=['Roll_No'], how='inner')
df_inner.show()

inner join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 8

 

 

Outer join in pyspark with example

outer Join in pyspark combines the results of both left and right outer joins. The joined table will contain all records from both the tables

join in pyspark (Merge) inner , outer, right , left join in pyspark 4

### Outer join in pyspark

df_outer = df1.join(df2, on=['Roll_No'], how='outer')
df_outer.show()

outer join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 9

 

 

Left join in pyspark with example

The LEFT JOIN in pyspark returns all records from the left dataframe (A), and the matched records from the right dataframe (B)

join in pyspark (Merge) inner , outer, right , left join in pyspark 1

### Left join in pyspark

df_left = df1.join(df2, on=['Roll_No'], how='left')
df_left.show()

left join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 10

 

 

Right join in pyspark with example

The RIGHT JOIN in pyspark returns all records from the right dataframe (B), and the matched records from the left dataframe (A)

join in pyspark (Merge) inner , outer, right , left join in pyspark 2

### Right join in pyspark

df_right = df1.join(df2, on=['Roll_No'], how='right')
df_right.show()

Right join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 11

 

3) Get duplicate rows in pyspark

### Get Duplicate rows in pyspark

df1=df_basket1.groupBy("Item_group","Item_name","price").count().filter("count >1")
df1.drop('count').show()
  • First we do groupby count of all the columns i.e. “Item_group”,”Item_name”,”price”
  • Secondly we filter the rows with count greater than 1.

4) Quantile rank, decile rank & n tile rank in pyspark – Rank by Group

In order to calculate the quantile rank , decile rank and n tile rank in pyspark we use ntile() Function.  By passing argument 4 to ntile() function quantile rank of the column in pyspark is calculated. By passing argument 10 to ntile() function decile rank of the column in pyspark is calculated.

5) Populate row number in pyspark – Row number by Group

In order to populate row number in pyspark we use row_number() Function.  row_number() function along with partitionBy() of other column populates the row number by group. Let’s see an example on how to populate row number in pyspark.

  • Populate row number in pyspark
  • Populate row number in pyspark by group

6) Percentile Rank of the column in pyspark

In order to calculate the percentile rank of the column in pyspark we use percent_rank() Function.  percent_rank() function along with partitionBy() of other column calculates the percentile Rank of the column by group. Let’s see an example on how to calculate percentile rank of the column in pyspark.

  • Percentile Rank of the column in pyspark
  • Percentile Rank of the column by group in pyspark

 

7) Mean of two or more columns in pyspark

simple + operator and dividing the result by number of columns to calculate mean of two or more columns in pyspark, and appending the results to the dataframe

### Mean of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("mean_of_col", (col("mathematics_score")+col("science_score"))/2)
df1.show()

mean of two or more columns in pyspark 3

8) Sum of two or more columns in pyspark

we will be using simple + operator to calculate sum of two or more columns in pyspark, and appending the results to the dataframe by naming the column as sum

### Sum of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("sum", col("mathematics_score")+col("science_score"))
df1.show()

Sum of two or more columns in pyspark 3

 

9) Row wise mean, sum, minimum and maximum in pyspark

In order to calculate the row wise mean, sum, minimum and maximum in pyspark, we will be using different functions. Row wise mean in pyspark is calculated in roundabout way. Row wise sum in pyspark is calculated using sum() function. Row wise minimum (min) in pyspark is calculated using least() function. Row wise maximum (max) in pyspark is calculated using greatest() function.

 

10) Rename column name in pyspark – Rename single and multiple column

Rename single column in pyspark

Syntax:

df.withColumnRenamed(‘old_name’, ‘new_name’)

old_name – old column name
new_name – new column name to be replaced.

 

 

11) Typecast Integer to Decimal and Integer to float in Pyspark

n order to type cast an integer to decimal in pyspark we will be using cast() function with DecimalType() as argument. To type cast integer to float in pyspark we will be using cast() function with FloatType() as argument. Let’s see an example of type conversion or casting of integer column to decimal column and integer column to float column in pyspark.

  • Type cast an integer column to decimal column in pyspark
  • Type cast an integer column to float column in pyspark

 

 

12) Get number of rows and number of columns of dataframe in pyspark

Count the number of rows in pyspark – Get number of rows

Syntax:

 df.count()

df – dataframe

dataframe.count() function counts the number of rows of dataframe.

 

Syntax:

 df.distinct.count()

df – dataframe

dataframe.distinct.count() function counts the number of distinct rows of dataframe

 

 

14) Get Absolute value of column in Pyspark

Get Absolute value in Pyspark:

abs() function in pyspark gets the absolute value

abs(-113.5)
abs(113.5)

So the Absolute value will be

113.5

 

 

 

 

15) Set Difference in Pyspark – Difference of two dataframe

Set difference in Pyspark returns the rows that are in the one dataframe but not other dataframe.  Set difference performs set difference i.e. difference of two dataframe in Pyspark.

Set Difference in Pyspark – Difference of two dataframe 0

Syntax:

 df1.subtract(df2)

df1 – dataframe1
df2 – dataframe2

dataframe1.subtract(dataframe2) gets the difference of dataframe2 from dataframe1.

 

 

16) Union and union all of two dataframe in pyspark (row bind)

Union pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0a

pyspark union all: Union all concatenates but does not remove duplicates.

Union all pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0b

Let’s discuss with an example. Let’s take three dataframe for example

 

 

 

17) Intersect of two dataframe in pyspark (two or more)

Intersect of two dataframe in pyspark can be accomplished using intersect() function. Intersection in Pyspark returns the common rows of two or more table. Intersect removes the duplicate after combining.

Intersect of two dataframe in pyspark performs a DISTINCT on the result set, returns the common rows of two different tables

  • Intersect of two dataframe in pyspark
  • Intersect of two or more dataframe in pyspark – (more than two dataframe)

Intersect of two datframe in pyspark (two or more) 0

 

 

18) Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)

 

Round up or Ceil in pyspark using ceil() function

Syntax:

 ceil(‘colname1’)

colname1 – Column name

ceil() Function in pyspark takes up the column name as argument and rounds up the column

 

 

Round down or Floor in pyspark using floor() function

Syntax:

 floor(‘colname1’)

colname1 – Column name

floor() Function in pyspark takes up the column name as argument and rounds down the column

 

19) Sort the dataframe in pyspark – Sort on single column & Multiple column

In order to sort the dataframe in pyspark we will be using orderBy() function. orderBy() Function in pyspark sorts the dataframe in by single column and multiple column. It also sorts the dataframe in pyspark by descending order or ascending order. Let’s see an example of  each.

  • Sort the dataframe in pyspark by single column – ascending order
  • Sort the dataframe in pyspark by single column – descending order
  • Sort the dataframe in pyspark by multiple columns – ascending order
  • Sort the dataframe in pyspark by multiple columns – descending order

Syntax:

<em> df.orderBy(‘colname1’,‘colname2’,ascending=False)</em>

df – dataframe
colname1 – Column name
ascending = False  – sort by descending order
ascending= True – sort by ascending order

 

 

20) Drop rows in pyspark – drop rows with condition

In order to drop rows in pyspark we will be using different functions in different circumstances. Drop rows with conditions in pyspark are accomplished by dropping NA rows, dropping duplicate rows and dropping rows by specific conditions in a where clause etc. Let’s see an example for each on dropping rows in pyspark with multiple conditions.

  • Drop rows with NA or missing values in pyspark
  • Drop duplicate rows in pyspark
  • Drop rows with conditions using where clause
  • Drop duplicate rows by a specific column

 

 

 

 

 

 

21) Distinct value of a column in pyspark

In order to get the distinct value of a column in pyspark we will be using select() and distinct() function. There is another way to get distinct value of the column in pyspark using dropDuplicates() function. Let’s see with an example for both

  • Distinct value of a column in pyspark using distinct() function
### Get distinct value of column

df.select("name").distinct().show()

 

 

 

22) Distinct value of dataframe in pyspark – drop duplicates

In order to get the distinct value of dataframe in pyspark we will be using distinct() function.

 

Get distinct value of dataframe in pyspark – distinct rows – Method 1

Syntax:

 df.distinct()

df – dataframe

dataframe.distinct() gets the distinct value of the dataframe in pyspark

 

 

23) Count of Missing (NaN,Na) and null values in Pyspark

Count of Missing and null values in pyspark can be accomplished using isnan()  function and isNull() function respectively. isnan() function returns the count of missing values of column in pyspark – (nan, na) .  isnull() function returns the count of null values of column in pyspark

 

24) Mean, Variance and standard deviation of column in Pyspark

Mean, Variance and standard deviation of column in pyspark can be accomplished using aggregate() function with argument column name followed by mean , variance and standard deviation according to our need. Mean, Variance and standard deviation of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each

  • Mean of the column in pyspark with example
  • Variance of the column in pyspark with example
  • Standard deviation of column in pyspark with example
  • Mean of each group of dataframe in pyspark with example
  • Variance of each group of dataframe in pyspark with example
  • Standard deviation of each group of dataframe in pyspark with example

 

25) Maximum or Minimum value of column in Pyspark

Maximum and minimum value of the column in pyspark can be accomplished using aggregate() function with argument column name followed by max or min according to our need. Maximum or Minimum value of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each

  • Maximum value of the column in pyspark with example
  • Minimum value of the column in pyspark with example

 

 

 

 

26) Raised to power of column in pyspark – square, cube , square root and cube root in pyspark

 

Raised to the power column in pyspark can be accomplished using pow() function with argument column name followed by numeric value which is raised to the power. We will see Raised to power of column in pyspark with an example

  • Raised to power n of the column in pyspark with example
  • Square of the column in pyspark with example
  • Cube of the column in pyspark with example
  • Square root of the column in pyspark with example
  • Cube root of the column in pyspark with example

Syntax:

 pow(col1,n)

col1 – Column name
n – Raised power

 

 

27) Drop column in pyspark – drop single & multiple columns

Deleting or Dropping column in pyspark can be accomplished using drop() function. drop() Function with argument column name is used to drop the column in pyspark.  We will see how to

  • Drop single column in pyspark with example
  • Drop multiple column in pyspark with example
  • Drop column like function in pyspark – drop similar column

 

28) Subset or Filter data with multiple conditions in pyspark

In order to filter data with conditions in pyspark we will be using filter() function. filter() function  subsets or filters the data with single or multiple conditions in pyspark. Let’s get clarity with an example.

  • Subset or filter data with single condition
  • Subset or filter data with multiple conditions (multiple or condition in pyspark)
  • Subset or filter data with multiple conditions (multiple and condition in pyspark)
  • Subset or filter data with conditions using sql functions
  • Filter using Regular expression in pyspark
  • Filter starts with and ends with keyword in pyspark
  • Filter with null and non null values in pyspark
  • Filter with LIKE% and in operator in pyspark

 

Subset or filter data with single condition in pyspark

Subset or filter data with single condition in pyspark can be done using filter function() with conditions inside the filter function.

## subset with single condition

df.filter(df.mathematics_score > 50).show()

 

 

 

29) Frequency table or cross table in pyspark – 2 way cross table

In order to calculate Frequency table or cross table in pyspark we will be using crosstab() function. Frequency table in pyspark can be calculated in roundabout way using group by count. Cross table in pyspark can be calculated using crosstab() function. Let’s get clarity with an example.

  • Calculate Frequency table in pyspark with example
  • Compute Cross table in pyspark with example

 

30) Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max

Groupby functions in pyspark which is also known as aggregate function in pyspark is calculated using groupby(). Groupby single column and multiple column is shown with an example of each. We will be using aggregate function to get groupby count, groupby mean, groupby sum, groupby min and groupby max of dataframe in pyspark. Let’s get clarity with an example.

  • Groupby count of dataframe in pyspark – Groupby single and multiple column
  • Groupby sum of dataframe in pyspark – Groupby single and multiple column
  • Groupby mean of dataframe in pyspark – Groupby single and multiple column
  • Groupby min of dataframe in pyspark – Groupby single and multiple column
  • Groupby max of dataframe in pyspark – Groupby single and multiple column

 

 

 

31) Descriptive statistics or Summary Statistics of dataframe in pyspark

In order to calculate Descriptive statistics or Summary Statistics of dataframe in pyspark we will be using describe() function. Descriptive statistics or summary statistics of a column can also be calculated with describe() function. Lets get clarity with an example.

  • Descriptive statistics or summary statistics of dataframe in pyspark
  • Descriptive statistics or summary statistics of a numeric column in pyspark
  • Descriptive statistics or summary statistics of a character column in pyspark

Descriptive statistics in pyspark generally gives the

    • Count – Count of values of each column
    • Mean – Mean value of each column
    • Stddev – standard deviation of each column
    • Min – Minimum value of each column
    • Max – Maximum value of each column

Syntax:

df.describe()

    df – dataframe

 

 

32) Re arrange or re order column in pyspark

In order to Re arrange or re order the column in pyspark we will be using select function. To reorder the column in ascending order we will be using Sorted function. To reorder the column in descending order we will be using Sorted function with an argument reverse =True. lets get clarity with an example.

  • Re arrange the column in pyspark
  • Re order the column in pyspark in ascending order
  • Re order the column in pyspark in descending order

 

 

 

 

33) cumulative sum of column and group in pyspark

In order to calculate cumulative sum of column in pyspark we will be using sum function and partitionBy. To calculate cumulative sum of a group in pyspark we will be using sum function and also we mention the group on which we want to partitionBy lets get clarity with an example.

  • Calculate cumulative sum of column in pyspark
  • Calculate cumulative sum of group in pyspark

 

 

 

34) Calculate Percentage and cumulative percentage of column in pyspark

In order to calculate percentage and cumulative percentage of column in pyspark we will be using sum() function and partitionBy(). We will explain how to get percentage and cumulative percentage of column in Pyspark with an example.

  • Calculate Percentage of column in pyspark
  • Calculate cumulative percentage of column in pyspark

 

 

35) Select column in Pyspark (Select single & Multiple columns)

In order to select column in pyspark we will be using select function. Select() function is used to select single column and multiple columns in pyspark. Select column name like in pyspark. We will explain how to select column in Pyspark with an example.

  • Select single column in pyspark
  • Select multiple column in pyspark
  • Select column name like in pyspark
  • Select column name using regular expression in pyspark

Syntax:

df.select(‘colname1’,‘colname2’)

df – dataframe
colname1..n – column name

 

 

36) Get data type of column in Pyspark (single & Multiple columns)

In order to Get data type of column in pyspark we will be using dtypes function and printSchema() function . We will explain how to get data type of single and multiple columns in Pyspark with an example.

  • Get data type of single column in pyspark
  • Get data type of multiple column in pyspark
  • Get data type of all the column in pyspark

 

37) Get List of columns and its data type in Pyspark

In order to Get list of columns and its data type in pyspark we will be using dtypes function and printSchema() function . We will explain how to get list of column names and its data type in pyspark with an example.

  • Get List of columns in pyspark dataframe.
  • Get List of columns and its datatype

 

 

 

38) Read CSV file in Pyspark and Convert to dataframe

In order to read csv file in Pyspark and convert to dataframe, we import SQLContext. We will explain step by step how to read a csv file and convert them to dataframe in pyspark with an example.

We have used two methods to convert CSV to dataframe in Pyspark

 

 

 

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.