Extract First N rows & Last N rows in pyspark (Top N & Bottom N)

In order to Extract First N rows in pyspark we will be using functions like show() function and head() function. head() function in pyspark returns the top N rows. Number of rows is passed as an argument to the head() and show() function. First() Function in pyspark returns the First row of the dataframe. To Extract Last N rows we will be working on roundabout methods like creating index and sorting them in reverse order and there by extracting bottom n rows, Let’s see how to

Extract First row of dataframe in pyspark – using first() function.
Get First N rows in pyspark – Top N rows in pyspark using head() function – (First 10 rows)
Get First N rows in pyspark – Top N rows in pyspark using take() and show() function
Fetch Last Row of the dataframe in pyspark
Extract Last N rows of the dataframe in pyspark – (Last 10 rows)

With an example for each

We will be using the dataframe named df_cars

Extract Top N rows in pyspark – First N rows 1

Get First N rows in pyspark

Extract First N and Last N rows in pyspark C1

Extract First row of dataframe in pyspark – using first() function:

dataframe.first() Function extracts the first row of the dataframe

########## Extract first row of the dataframe in pyspark

df_cars.first()

so the first row of “df_cars” dataframe is extracted

Extract First N rows in pyspark – Top N rows in pyspark using show() function:

dataframe.show(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – show()

df_cars.show(5)

so the first 5 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 3

Extract First N rows in pyspark – Top N rows in pyspark using head() function:

dataframe.head(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – head()

df_cars.head(3)

so the first 3 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 4

Extract First N rows in pyspark – Top N rows in pyspark using take() function:

dataframe.take(n) Function takes argument “n” and extracts the first n row of the dataframe

########## Extract first N row of the dataframe in pyspark – take()

df_cars.take(2)

so the first 2 rows of “df_cars” dataframe is extracted

Extract Top N rows in pyspark – First N rows 5

Extract Last N rows in Pyspark :

Extract First N and Last N rows in pyspark C2

Extract Last row of dataframe in pyspark – using last() function:

last() Function extracts the last row of the dataframe and it is stored as a variable name “expr” and it is passed as an argument to agg() function as shown below.

########## Extract last row of the dataframe in pyspark

from pyspark.sql import functions as F
expr = [F.last(col).alias(col) for col in df_cars.columns]
df_cars.agg(*expr).show()

so the last row of “df_cars” dataframe is extracted

Extract First N and Last N rows in pyspark d1

Get Last N rows in pyspark:

Extracting last N rows of the dataframe is accomplished in a roundabout way. First step is to create a index using monotonically_increasing_id() Function and then as a second step sort them on descending order of the index. which in turn extracts last N rows of the dataframe as shown below.

########## Extract last N rows of the dataframe in pyspark

from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import desc

df_cars = df_cars.withColumn("index", monotonically_increasing_id())
df_cars.orderBy(desc("index")).drop("index").show(5)

so the last N rows of “df_cars” dataframe is extracted

Extract First N and Last N rows in pyspark d2

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts