Extract First N rows & Last N rows in pyspark (Top N & Bottom N)

In order to Extract First N rows in pyspark we will be using functions like show() function and head() function. head() function in pyspark returns the top N rows. Number of rows is passed as an argument to the head() and show() function. First() Function in pyspark returns the First row of the dataframe. To Extract Last N rows we will be working on roundabout methods like creating index and sorting them in reverse order and there by extracting bottom n rows, Let’s see how to

  • Extract First row of dataframe in pyspark – using first() function.
  • Get First N rows in pyspark – Top N rows in pyspark using head() function – (First 10 rows)
  • Get First N rows in pyspark – Top N rows in pyspark using take() and show() function
  • Fetch Last Row of the dataframe in pyspark
  • Extract Last N rows of the dataframe in pyspark – (Last 10 rows)

With an example for each

We will be using the dataframe named df_cars

Extract Top N rows in pyspark – First N rows 1

 

Get First N rows in pyspark

Extract First N and Last N rows in pyspark C1

Extract First row of dataframe in pyspark – using first() function:

dataframe.first() Function extracts the first row of the dataframe

########## Extract first row of the dataframe in pyspark

df_cars.first()

so the first row of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 2

 

 

Extract First N rows in pyspark – Top N rows in pyspark using show() function:

dataframe.show(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – show()

df_cars.show(5)

so the first 5 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 3

 

 

Extract First N rows in pyspark – Top N rows in pyspark using head() function:

dataframe.head(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – head()

df_cars.head(3)

so the first 3 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 4

 

 

Extract First N rows in pyspark – Top N rows in pyspark using take() function:

dataframe.take(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – take()

df_cars.take(2)

so the first 2 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 5

 


Extract Last N rows in Pyspark :

Extract First N and Last N rows in pyspark C2

 

Extract Last row of dataframe in pyspark – using last() function:

last() Function extracts the last row of the dataframe and it is stored as a variable name “expr” and it is passed as an argument to agg() function as shown below.

########## Extract last row of the dataframe in pyspark

from pyspark.sql import functions as F
expr = [F.last(col).alias(col) for col in df_cars.columns]
df_cars.agg(*expr).show()

so the last row of “df_cars” dataframe is extracted

Extract First N and Last N rows in pyspark d1

 

Get Last N rows in pyspark:

Extracting last N rows of the dataframe  is accomplished in a roundabout way. First step is to create a index using monotonically_increasing_id()  Function and then as a second step sort them on descending order of the index. which in turn extracts last N rows of the dataframe as shown below.

########## Extract last N rows of the dataframe in pyspark

from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import desc

df_cars = df_cars.withColumn("index", monotonically_increasing_id())
df_cars.orderBy(desc("index")).drop("index").show(5)

so the last N rows of “df_cars” dataframe is extracted

Extract First N and Last N rows in pyspark d2

 


Other Related topics:

 

Extract Top N rows in pyspark – First N rows                                                                                          Extract Top N rows in pyspark – First N rows

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.