Drop rows in pyspark with condition

In order to drop rows in pyspark we will be using different functions in different circumstances. Drop rows with condition in pyspark are accomplished by dropping – NA rows, dropping duplicate rows and dropping rows by specific conditions in a where clause etc. Let’s see an example for each on dropping rows in pyspark with multiple conditions.

Drop rows with NA or missing values in pyspark
Drop rows with Null values using where condition in pyspark
Drop duplicate rows in pyspark
Drop Duplicate rows by keeping the first occurrence in pyspark
Drop duplicate rows by keeping the last occurrence in pyspark
Drop rows with conditions using where clause
Drop duplicate rows by a specific column

drop rows in pyspark drop rows with multiple condition c1

We will be using dataframe df_orders

Drop rows in pyspark - drop rows with condition 1

Drop rows with NA or missing values in pyspark : Method1

drop rows in pyspark drop rows with multiple condition c2

Drop rows with NA or missing values in pyspark is accomplished by using dropna() function.

### Drop rows with NA or missing values in pyspark

df_orders1=df_orders.dropna()
df_orders1.show()

NA or Missing values in pyspark is dropped using dropna() function.

Drop rows in pyspark - drop rows with condition 2

Drop rows with NA or missing values in pyspark : Method2

Drop rows with NA or missing values in pyspark is accomplished by using na.drop() function.

### Drop rows with NA or missing values in pyspark

df_orders.na.drop().show()

NA or Missing values in pyspark is dropped using na.drop() function.

Drop rows in pyspark - drop rows with condition 2

Drop NULL rows with where condition in pyspark :

Drop rows with Null values values in pyspark is accomplished by using isNotNull() function along with where condition rows with Non null values are filtered using where condition as shown below.

### Drop rows with Null values with where condition in pyspark

df_orders1 = df_orders.where(col('Shipped_date').isNotNull())
df_orders1.show()

Null values values in pyspark is dropped using isNotNull() function.

Drop rows in pyspark - drop rows with condition 2

Drop duplicate rows in pyspark:

drop rows in pyspark drop rows with multiple condition c3

Duplicate rows of dataframe in pyspark is dropped using dropDuplicates() function.

#### Drop rows in pyspark – drop duplicate rows

from pyspark.sql import Row
df_orders1 = df_orders.dropDuplicates()
df_orders1.show()

dataframe.dropDuplicates() removes duplicate rows of the dataframe

Drop rows in pyspark - drop rows with condition 3

Drop duplicate rows by a specific column

Duplicate rows is dropped by a specific column of dataframe in pyspark using dropDuplicates() function. dropDuplicates() with column name passed as argument will remove duplicate rows by a specific column

#### Drop duplicate rows in pyspark by a specific column

df_orders.dropDuplicates((['cust_no'])).show()

dataframe.dropDuplicates(‘colname’) removes duplicate rows of the dataframe by a specific column

Drop rows in pyspark - drop rows with condition 4

Drop duplicate rows by keeping the first duplicate occurrence in pyspark:

dropping duplicates by keeping first occurrence is accomplished by adding a new column row_num (incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)

### drop duplicates and keep first occurrence

from pyspark.sql.window import Window 
import pyspark.sql.functions as F 
from pyspark.sql.functions import row_number
df_orders1 = df_orders.select("order_no","cust_no","eno","received_date","Shipped_date", F.row_number().over(Window.partitionBy("received_date").orderBy(df_orders['received_date'])).alias("row_num")) 
df_orders1.filter(df_orders1.row_num ==1).show()

dropping duplicates by keeping first occurrence is

drop rows in pyspark drop rows with multiple condition d1

Drop duplicate rows by keeping the Last duplicate occurrence in pyspark:

dropping duplicates by keeping last occurrence is accomplished by adding a new column row_num (incremental column) and drop duplicates based the max row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)

### drop duplicates and keep last occurrence

from pyspark.sql.window import Window 
import pyspark.sql.functions as F 
from pyspark.sql.functions import row_number

df_orders1 = df_orders.select("order_no","cust_no","eno","received_date","Shipped_date", F.row_number().over(Window.partitionBy("received_date").orderBy(df_orders['received_date'])).alias("row_num")) 
df_orders1.groupBy("order_no","cust_no","eno","received_date","Shipped_date").max("row_num").show()

dropping duplicates by keeping last occurrence is

drop rows in pyspark drop rows with multiple condition d2

Drop rows with conditions using where clause

drop rows in pyspark drop rows with multiple condition c1

Drop rows with conditions in pyspark is accomplished by using where() function. condition to be dropped is specified inside the where clause

#### Drop rows with conditions – where clause

df_orders1=df_orders.where("cust_no!=23512")
df_orders1.show()

dataframe with rows dropped after where clause will be

also for other function refer the cheatsheet.

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts