Pyspark

In this Section we will be explaining Pyspark concepts one by one. This set of topics on pyspark is designed to make pyspark learning in quick and easy way. lets get started with pyspark Learning

1) Simple random sampling and stratified sampling in pyspark – Sample(), SampleBy()

Simple random sampling without replacement in pyspark

Syntax:

sample(False, fraction, seed=None)

Returns a sampled subset of Dataframe without replacement.

Note: fraction is not guaranteed to provide exactly the fraction specified in Dataframe

### Simple random sampling in pyspark

df_cars_sample = df_cars.sample(False, 0.5, 42)
df_cars_sample.show()

So the resultant sample without replacement will be

Simple random sampling with replacement

Syntax:

sample(True, fraction, seed=None)

Returns a sampled subset of Dataframe with replacement.

### Simple random sampling in pyspark with replacement

df_cars_sample = df_cars.sample(True, 0.5, 42)
df_cars_sample.show()

So the resultant sample with replacement will be

Simple random sampling and stratified sampling in pyspark 5

2) Join in pyspark (Merge) inner , outer, right , left join in pyspark

Inner join in pyspark with example

Inner Join in pyspark is the simplest and most common type of join. It is also known as simple join or Natural Join. Inner join returns the rows when matching condition is met.

join in pyspark (Merge) inner , outer, right , left join in pyspark 3

### Inner join in pyspark

df_inner = df1.join(df2, on=['Roll_No'], how='inner')
df_inner.show()

inner join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 8

Outer join in pyspark with example

outer Join in pyspark combines the results of both left and right outer joins. The joined table will contain all records from both the tables

### Outer join in pyspark

df_outer = df1.join(df2, on=['Roll_No'], how='outer')
df_outer.show()

outer join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 9

Left join in pyspark with example

The LEFT JOIN in pyspark returns all records from the left dataframe (A), and the matched records from the right dataframe (B)

join in pyspark (Merge) inner , outer, right , left join in pyspark 1

### Left join in pyspark

df_left = df1.join(df2, on=['Roll_No'], how='left')
df_left.show()

left join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 10

Right join in pyspark with example

The RIGHT JOIN in pyspark returns all records from the right dataframe (B), and the matched records from the left dataframe (A)

join in pyspark (Merge) inner , outer, right , left join in pyspark 2

### Right join in pyspark

df_right = df1.join(df2, on=['Roll_No'], how='right')
df_right.show()

Right join will be
join in pyspark (Merge) inner , outer, right , left join in pyspark 11

3) Get duplicate rows in pyspark

### Get Duplicate rows in pyspark

df1=df_basket1.groupBy("Item_group","Item_name","price").count().filter("count &amp;amp;amp;amp;amp;gt;1")
df1.drop('count').show()

First we do groupby count of all the columns i.e. “Item_group”,”Item_name”,”price”
Secondly we filter the rows with count greater than 1.

4) Quantile rank, decile rank & n tile rank in pyspark – Rank by Group

In order to calculate the quantile rank , decile rank and n tile rank in pyspark we use ntile() Function. By passing argument 4 to ntile() function quantile rank of the column in pyspark is calculated. By passing argument 10 to ntile() function decile rank of the column in pyspark is calculated.

5) Populate row number in pyspark – Row number by Group

In order to populate row number in pyspark we use row_number() Function. row_number() function along with partitionBy() of other column populates the row number by group. Let’s see an example on how to populate row number in pyspark.

Populate row number in pyspark
Populate row number in pyspark by group

6) Percentile Rank of the column in pyspark

In order to calculate the percentile rank of the column in pyspark we use percent_rank() Function. percent_rank() function along with partitionBy() of other column calculates the percentile Rank of the column by group. Let’s see an example on how to calculate percentile rank of the column in pyspark.

Percentile Rank of the column in pyspark
Percentile Rank of the column by group in pyspark

7) Mean of two or more columns in pyspark

simple + operator and dividing the result by number of columns to calculate mean of two or more columns in pyspark, and appending the results to the dataframe

### Mean of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("mean_of_col", (col("mathematics_score")+col("science_score"))/2)
df1.show()

mean of two or more columns in pyspark 3

8) Sum of two or more columns in pyspark

we will be using simple + operator to calculate sum of two or more columns in pyspark, and appending the results to the dataframe by naming the column as sum

### Sum of two or more columns in pyspark

from pyspark.sql.functions import col

df1=df_student_detail.withColumn("sum", col("mathematics_score")+col("science_score"))
df1.show()

Sum of two or more columns in pyspark 3

9) Row wise mean, sum, minimum and maximum in pyspark

In order to calculate the row wise mean, sum, minimum and maximum in pyspark, we will be using different functions. Row wise mean in pyspark is calculated in roundabout way. Row wise sum in pyspark is calculated using sum() function. Row wise minimum (min) in pyspark is calculated using least() function. Row wise maximum (max) in pyspark is calculated using greatest() function.

10) Rename column name in pyspark – Rename single and multiple column

Rename single column in pyspark

Syntax:

df.withColumnRenamed(‘old_name’, ‘new_name’)

old_name – old column name
new_name – new column name to be replaced.

11) Typecast Integer to Decimal and Integer to float in Pyspark

n order to type cast an integer to decimal in pyspark we will be using cast() function with DecimalType() as argument. To type cast integer to float in pyspark we will be using cast() function with FloatType() as argument. Let’s see an example of type conversion or casting of integer column to decimal column and integer column to float column in pyspark.

Type cast an integer column to decimal column in pyspark
Type cast an integer column to float column in pyspark

12) Get number of rows and number of columns of dataframe in pyspark

Count the number of rows in pyspark – Get number of rows

Syntax:

df.count()

df – dataframe

dataframe.count() function counts the number of rows of dataframe.

Syntax:

df.distinct.count()

df – dataframe

dataframe.distinct.count() function counts the number of distinct rows of dataframe

14) Get Absolute value of column in Pyspark

Get Absolute value in Pyspark:

abs() function in pyspark gets the absolute value

abs(-113.5)

abs(113.5)

So the Absolute value will be

113.5

15) Set Difference in Pyspark – Difference of two dataframe

Set difference in Pyspark returns the rows that are in the one dataframe but not other dataframe. Set difference performs set difference i.e. difference of two dataframe in Pyspark.

Set Difference in Pyspark – Difference of two dataframe 0

Syntax:

df1.subtract(df2)

df1 – dataframe1
df2 – dataframe2

dataframe1.subtract(dataframe2) gets the difference of dataframe2 from dataframe1.

16) Union and union all of two dataframe in pyspark (row bind)

Union pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0a

pyspark union all: Union all concatenates but does not remove duplicates.

Union all pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0b

Let’s discuss with an example. Let’s take three dataframe for example

17) Intersect of two dataframe in pyspark (two or more)

Intersect of two dataframe in pyspark can be accomplished using intersect() function. Intersection in Pyspark returns the common rows of two or more table. Intersect removes the duplicate after combining.

Intersect of two dataframe in pyspark performs a DISTINCT on the result set, returns the common rows of two different tables

Intersect of two dataframe in pyspark
Intersect of two or more dataframe in pyspark – (more than two dataframe)

Intersect of two datframe in pyspark (two or more) 0

18) Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)

Round up or Ceil in pyspark using ceil() function

Syntax:

ceil(‘colname1’)

colname1 – Column name

ceil() Function in pyspark takes up the column name as argument and rounds up the column

Round down or Floor in pyspark using floor() function

Syntax:

floor(‘colname1’)

colname1 – Column name

floor() Function in pyspark takes up the column name as argument and rounds down the column

19) Sort the dataframe in pyspark – Sort on single column & Multiple column

In order to sort the dataframe in pyspark we will be using orderBy() function. orderBy() Function in pyspark sorts the dataframe in by single column and multiple column. It also sorts the dataframe in pyspark by descending order or ascending order. Let’s see an example of each.

Sort the dataframe in pyspark by single column – ascending order
Sort the dataframe in pyspark by single column – descending order
Sort the dataframe in pyspark by multiple columns – ascending order
Sort the dataframe in pyspark by multiple columns – descending order

Syntax:

&amp;amp;amp;amp;lt;em&amp;amp;amp;amp;gt;&amp;amp;amp;amp;nbsp;df.orderBy(‘colname1’,‘colname2’,ascending=False)&amp;amp;amp;amp;lt;/em&amp;amp;amp;amp;gt;

df – dataframe
colname1 – Column name
ascending = False – sort by descending order
ascending= True – sort by ascending order

20) Drop rows in pyspark – drop rows with condition

In order to drop rows in pyspark we will be using different functions in different circumstances. Drop rows with conditions in pyspark are accomplished by dropping NA rows, dropping duplicate rows and dropping rows by specific conditions in a where clause etc. Let’s see an example for each on dropping rows in pyspark with multiple conditions.

Drop rows with NA or missing values in pyspark
Drop duplicate rows in pyspark
Drop rows with conditions using where clause
Drop duplicate rows by a specific column

21) Distinct value of a column in pyspark

In order to get the distinct value of a column in pyspark we will be using select() and distinct() function. There is another way to get distinct value of the column in pyspark using dropDuplicates() function. Let’s see with an example for both

Distinct value of a column in pyspark using distinct() function

### Get distinct value of column

df.select("name").distinct().show()

22) Distinct value of dataframe in pyspark – drop duplicates

In order to get the distinct value of dataframe in pyspark we will be using distinct() function.

Get distinct value of dataframe in pyspark – distinct rows – Method 1

Syntax:

df.distinct()

df – dataframe

dataframe.distinct() gets the distinct value of the dataframe in pyspark

23) Count of Missing (NaN,Na) and null values in Pyspark

Count of Missing and null values in pyspark can be accomplished using isnan() function and isNull() function respectively. isnan() function returns the count of missing values of column in pyspark – (nan, na) . isnull() function returns the count of null values of column in pyspark

24) Mean, Variance and standard deviation of column in Pyspark

Mean, Variance and standard deviation of column in pyspark can be accomplished using aggregate() function with argument column name followed by mean , variance and standard deviation according to our need. Mean, Variance and standard deviation of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each

Mean of the column in pyspark with example
Variance of the column in pyspark with example
Standard deviation of column in pyspark with example
Mean of each group of dataframe in pyspark with example
Variance of each group of dataframe in pyspark with example
Standard deviation of each group of dataframe in pyspark with example

25) Maximum or Minimum value of column in Pyspark

Maximum and minimum value of the column in pyspark can be accomplished using aggregate() function with argument column name followed by max or min according to our need. Maximum or Minimum value of the group in pyspark can be calculated by using groupby along with aggregate() Function. We will see with an example for each

Maximum value of the column in pyspark with example
Minimum value of the column in pyspark with example

26) Raised to power of column in pyspark – square, cube , square root and cube root in pyspark

Raised to the power column in pyspark can be accomplished using pow() function with argument column name followed by numeric value which is raised to the power. We will see Raised to power of column in pyspark with an example

Raised to power n of the column in pyspark with example
Square of the column in pyspark with example
Cube of the column in pyspark with example
Square root of the column in pyspark with example
Cube root of the column in pyspark with example

Syntax:

pow(col1,n)

col1 – Column name
n – Raised power

27) Drop column in pyspark – drop single & multiple columns

Deleting or Dropping column in pyspark can be accomplished using drop() function. drop() Function with argument column name is used to drop the column in pyspark. We will see how to

Drop single column in pyspark with example
Drop multiple column in pyspark with example
Drop column like function in pyspark – drop similar column

28) Subset or Filter data with multiple conditions in pyspark

In order to filter data with conditions in pyspark we will be using filter() function. filter() function subsets or filters the data with single or multiple conditions in pyspark. Let’s get clarity with an example.

Subset or filter data with single condition
Subset or filter data with multiple conditions (multiple or condition in pyspark)
Subset or filter data with multiple conditions (multiple and condition in pyspark)
Subset or filter data with conditions using sql functions
Filter using Regular expression in pyspark
Filter starts with and ends with keyword in pyspark
Filter with null and non null values in pyspark
Filter with LIKE% and in operator in pyspark

Subset or filter data with single condition in pyspark

Subset or filter data with single condition in pyspark can be done using filter function() with conditions inside the filter function.

## subset with single condition

df.filter(df.mathematics_score &amp;gt; 50).show()

29) Frequency table or cross table in pyspark – 2 way cross table

In order to calculate Frequency table or cross table in pyspark we will be using crosstab() function. Frequency table in pyspark can be calculated in roundabout way using group by count. Cross table in pyspark can be calculated using crosstab() function. Let’s get clarity with an example.

Calculate Frequency table in pyspark with example
Compute Cross table in pyspark with example

30) Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max

Groupby functions in pyspark which is also known as aggregate function in pyspark is calculated using groupby(). Groupby single column and multiple column is shown with an example of each. We will be using aggregate function to get groupby count, groupby mean, groupby sum, groupby min and groupby max of dataframe in pyspark. Let’s get clarity with an example.

Groupby count of dataframe in pyspark – Groupby single and multiple column
Groupby sum of dataframe in pyspark – Groupby single and multiple column
Groupby mean of dataframe in pyspark – Groupby single and multiple column
Groupby min of dataframe in pyspark – Groupby single and multiple column
Groupby max of dataframe in pyspark – Groupby single and multiple column

31) Descriptive statistics or Summary Statistics of dataframe in pyspark

In order to calculate Descriptive statistics or Summary Statistics of dataframe in pyspark we will be using describe() function. Descriptive statistics or summary statistics of a column can also be calculated with describe() function. Lets get clarity with an example.

Descriptive statistics or summary statistics of dataframe in pyspark
Descriptive statistics or summary statistics of a numeric column in pyspark
Descriptive statistics or summary statistics of a character column in pyspark

Descriptive statistics in pyspark generally gives the

- Count – Count of values of each column
- Mean – Mean value of each column
- Stddev – standard deviation of each column
- Min – Minimum value of each column
- Max – Maximum value of each column

Syntax:

df.describe()

df – dataframe

32) Re arrange or re order column in pyspark

In order to Re arrange or re order the column in pyspark we will be using select function. To reorder the column in ascending order we will be using Sorted function. To reorder the column in descending order we will be using Sorted function with an argument reverse =True. lets get clarity with an example.

Re arrange the column in pyspark
Re order the column in pyspark in ascending order
Re order the column in pyspark in descending order

33) cumulative sum of column and group in pyspark

In order to calculate cumulative sum of column in pyspark we will be using sum function and partitionBy. To calculate cumulative sum of a group in pyspark we will be using sum function and also we mention the group on which we want to partitionBy lets get clarity with an example.

Calculate cumulative sum of column in pyspark
Calculate cumulative sum of group in pyspark

34) Calculate Percentage and cumulative percentage of column in pyspark

In order to calculate percentage and cumulative percentage of column in pyspark we will be using sum() function and partitionBy(). We will explain how to get percentage and cumulative percentage of column in Pyspark with an example.

Calculate Percentage of column in pyspark
Calculate cumulative percentage of column in pyspark

35) Select column in Pyspark (Select single & Multiple columns)

In order to select column in pyspark we will be using select function. Select() function is used to select single column and multiple columns in pyspark. Select column name like in pyspark. We will explain how to select column in Pyspark with an example.

Select single column in pyspark
Select multiple column in pyspark
Select column name like in pyspark
Select column name using regular expression in pyspark

Syntax:

df.select(‘colname1’,‘colname2’)

df – dataframe
colname1..n – column name

36) Get data type of column in Pyspark (single & Multiple columns)

In order to Get data type of column in pyspark we will be using dtypes function and printSchema() function . We will explain how to get data type of single and multiple columns in Pyspark with an example.

Get data type of single column in pyspark
Get data type of multiple column in pyspark
Get data type of all the column in pyspark

37) Get List of columns and its data type in Pyspark

In order to Get list of columns and its data type in pyspark we will be using dtypes function and printSchema() function . We will explain how to get list of column names and its data type in pyspark with an example.

Get List of columns in pyspark dataframe.
Get List of columns and its datatype

38) Read CSV file in Pyspark and Convert to dataframe

In order to read csv file in Pyspark and convert to dataframe, we import SQLContext. We will explain step by step how to read a csv file and convert them to dataframe in pyspark with an example.

We have used two methods to convert CSV to dataframe in Pyspark

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts