Drop rows in pyspark – drop rows with condition

In order to drop rows in pyspark we will be using different functions in different circumstances. Drop rows with conditions in pyspark are accomplished by dropping NA rows, dropping duplicate rows and dropping rows by specific conditions in a where clause etc. Let’s see an example for each on dropping rows in pyspark with multiple conditions.

  • Drop rows with NA or missing values in pyspark
  • Drop duplicate rows in pyspark
  • Drop rows with conditions using where clause
  • Drop duplicate rows by a specific column

We will be using dataframe df_orders

Drop rows in pyspark - drop rows with condition 1

 

Drop rows with NA or missing values in pyspark

Drop rows with NA or missing values in pyspark is accomplished by using dropna() function.

### Drop rows with NA or missing values in pyspark

df_orders1=df_orders.dropna()
df_orders1.show()

NA or Missing values in pyspark is dropped using dropna() function.

Drop rows in pyspark - drop rows with condition 2

 

 

Drop duplicate rows in pyspark

Duplicate rows of dataframe in pyspark is dropped using dropDuplicates() function.

#### Drop rows in pyspark – drop duplicate rows

from pyspark.sql import Row
df_orders1 = df_orders.dropDuplicates()
df_orders1.show()

dataframe.dropDuplicates() removes duplicate rows of the dataframe

Drop rows in pyspark - drop rows with condition 3

 

 

 

Drop duplicate rows by a specific column

Duplicate rows is dropped by a specific column of dataframe in pyspark using dropDuplicates() function. dropDuplicates() with column name passed as argument will remove duplicate rows by a specific column

#### Drop duplicate rows in pyspark by a specific column

df_orders.dropDuplicates((['cust_no'])).show()

dataframe.dropDuplicates(‘colname’) removes duplicate rows of the dataframe by a specific column

Drop rows in pyspark - drop rows with condition 4

 

 

 

Drop rows with conditions using where clause

Drop rows with conditions in pyspark is accomplished by using where() function. condition to be dropped is specified inside the where clause

#### Drop rows with conditions – where clause

df_orders1=df_orders.where("cust_no!=23512")
df_orders1.show()

dataframe with rows dropped after where clause will be

Drop rows in pyspark - drop rows with condition 5

 

Drop rows in pyspark – drop rows with condition                                                                                              Drop rows in pyspark – drop rows with condition