Pyspark String

In this Section we will be explaining Pyspark string concepts one by one. This set of Learning exercises on pyspark string is designed to make pyspark string learning quick and easy.  lets get started with pyspark string

pyspark tutorial

Remove leading zero of column in pyspark

In order to remove leading zero of column in pyspark, we use regexp_replace() function and we remove consecutive leading zeros. Lets see an example on how to remove leading zeros of the column in pyspark.

  • Remove Leading Zero of column in pyspark

We will be using dataframe df.

Remove leading zero of column in pyspark 1

 

Remove leading zero of column in pyspark

We use regexp_replace() function with column name and regular expression as argument and thereby we remove consecutive leading zeros.

### Remove leading zero of column in pyspark
from pyspark.sql.functions import *
import pyspark.sql.functions as F

df = df.withColumn('grad_Score_new', F.regexp_replace('grad_Score', r'^[0]*', ''))

so the resultant dataframe with leading zeros removed will be

Remove leading zero of column in pyspark 2

 

Left and Right pad of column in pyspark –lpad() & rpad()

In order to add padding to the left side of the column we use left pad of column in pyspark, left padding is accomplished using lpad() function. In order to add padding to the right side of the column we use right pad of column in pyspark, right padding is accomplished using rpad() function. Let’s see how to

  • Left pad of the column in pyspark – lpad()
  • Right pad of the column in pyspark – rpad()

We will be using dataframe df_states
Left and Right pad of column in pyspark –lpad() & rpad() 1

 

 

Add left pad of the column in pyspark

Padding is accomplished using lpad() function. lpad() Function takes column name ,length and padding string as arguments. In our case we are using state_name column and “#” as padding string so the left padding is done till the column reaches 14 characters.

### Add Left pad of the column in pyspark
from pyspark.sql.functions import *

df_states = df_states.withColumn('states_Name_new', lpad(df_states.state_name,14, '#'))
df_states.show(truncate =False)

So the resultant left padding string and dataframe will be
Left and Right pad of column in pyspark –lpad() & rpad() 2

 

Add Right pad of the column in pyspark

Padding is accomplished using rpad() function. rpad() Function takes column name ,length and padding string as arguments. In our case we are using state_name column and “#” as padding string so the right padding is done till the column reaches 14 characters.

### Add Right pad of the column in pyspark
from pyspark.sql.functions import *

df_states = df_states.withColumn('states_Name_new', rpad(df_states.state_name,14, '#'))
df_states.show(truncate =False)

So the resultant right padding string and dataframe will be
Left and Right pad of column in pyspark –lpad() & rpad() 3

 

Add Leading and Trailing space of column in pyspark – add space

To Add leading space of the column in pyspark we will be using left padding with space. To Add trailing space of the column in pyspark we will be using right padding with space. To Add leading and trailing space of the column in pyspark we will be using pad function. Let’s see how to

  • Add leading space of the column in pyspark
  • Add trailing space of the column in pyspark
  • Add both leading and trailing space of the column in postgresql

 

Remove Leading, Trailing and all space of column in pyspark – strip & trim space

In order to remove leading, trailing and all space of column in pyspark, we use ltrim(), rtrim() and trim() function. Strip leading and trailing space in pyspark is accomplished using ltrim() and rtrim() function respectively. In order to trim both the leading and trailing space in pyspark we will using trim() function. Let’s see how to

  • Remove Leading space of column in pyspark with ltrim() function – strip or trim leading space
  • Remove Trailing space of column in pyspark with rtrim() function – strip or trim trailing space
  • Remove both leading and trailing space of column in postgresql with trim() function – strip or trim both leading and trailing space
  • Remove all the space of column in postgresql

String split of the columns in pyspark

In order to split the strings of the column in pyspark we will be using split() function. split function takes the column name and delimiter as arguments. Let’s see with an example on how to split the string of the column in pyspark.

  • String split of the column in pyspark with an example.

Repeat the column in Pyspark

In order to repeat the column in pyspark we will be using repeat() Function. We look at an example on how to  repeat the string of the column in pyspark.

  • Repeat the string of the column in pyspark.

Syntax:

 repeat(colname,n)

colname – Column name.
n –  number of times repeat

 

Get Substring of the column in Pyspark

In order to get substring of the column in pyspark we will be using substr() Function. We look at an example on how to get substring of the column in pyspark.

  • Get substring of the column in pyspark using substring function.
  • Get Substring from end of the column in pyspark.

Syntax:

 df.colname.substr(start,length)

df- dataframe
colname- column name
start – starting position
length – number of string from starting position

 

Get String length of column in Pyspark

In order to get string length of column in pyspark we will be using length() Function. We look at an example on how to get string length of the column in pyspark.

  • Get string length of the column in pyspark using length() function.

Syntax:

 length(“colname”)

colname –  column name

 

Typecast string to date and date to string in Pyspark

In order to type cast string to date in pyspark we will be using to_date() function with column name and date format as argument. To type cast date to string in pyspark we will be using cast() function with StringType() as argument. Let’s see an example of type conversion or casting of string column to date column and date column to string column in pyspark.

  • Type cast string column to date column in pyspark
  • Type cast date column to string column in pyspark

 

Typecast Integer to string and String to integer in Pyspark

In order to type cast an integer to string in pyspark we will be using cast() function with StringType() as argument. To type cast string to integer in pyspark we will be using cast() function with IntegerType() as argument. Let’s see an example of type conversion or casting of integer column to string column or character column and string column to integer column or numeric column in pyspark.

  • Type cast an integer column to string column in pyspark
  • Type cast a string column to integer column in pyspark

Extract First N and Last N character in pyspark

In order to Extract First N and Last N character in pyspark we will be using substr() function. In this Section we will see an example on how to extract First N character from left in pyspark and how to extract last N character from right in pyspark. Let’s see how to

  • Extract First N character in pyspark – First N character from left
  • Extract Last N character in pyspark – Last N character from right

Convert to upper case, lower case and title case in pyspark

Converting a column to Upper case in pyspark is accomplished using upper() function, Converting a column to Lower case in pyspark is done using lower() function, and title case in pyspark uses initcap() function. Let’s see an example of each.

  • Convert column to upper case in pyspark – upper() function
  • Convert column to lower case in pyspark – lower() function
  • Convert column to title case in pyspark – initcap() function

Syntax:

 upper(‘colname1’)

colname1 – Column name

 

Syntax:

 lower(‘colname1’)

colname1 – Column name

 

Syntax:

 initcap(‘colname1’)

colname1 – Column name

initcap() Function takes up the column name as argument and converts the column to title case or proper case

 

Add leading zeros to the column in pyspark

In order to add leading zeros to the column in pyspark we will be using concat() function. There are some other ways to add leading zeros to the column in pyspark using format_string() function. Let’s see an example for each method

  • Add leading zeros to the column in pyspark using concat() function
  • Add leading zeros to the column in pyspark using format_string() function
  • Add leading zeros to the column in pyspark using lpad() function

 

 

Concatenate two columns in pyspark

In order to concatenate two columns in pyspark we will be using concat() Function. We look at an example on how to join or concatenate two string columns in pyspark (two or more columns) and also string and numeric column with space or any separator.

  • Concatenate two columns in pyspark without space.
  • Concatenate columns in pyspark with single space.
  • Concatenate columns with hyphen in pyspark (“-”)
  • Concatenate by removing leading and trailing space
  • Concatenate numeric and character column in pyspark

 

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.