Extract First N and Last N characters in pyspark

In order to Extract First N and Last N characters in pyspark we will be using substr() function. In this section we will see an example on how to extract First N character from left in pyspark and how to extract last N character from right in pyspark. Let’s see how to

  • Extract First N characters in pyspark – First N character from left
  • Extract Last N characters in pyspark – Last N character from right
  • Extract characters from string column of the dataframe in pyspark using substr() function.

With an example for both

We will be using the dataframe named df_states

Extract First N and Last N character in pyspark 1

 

 

 

Extract First N character in pyspark – First N character from left

First N character of column in pyspark is obtained using substr() function.

########## Extract first N character from left in pyspark

df = df_states.withColumn("first_n_char", df_states.state_name.substr(1,6))
df.show()

First 6 characters from left is extracted using substring function so the resultant dataframe will be

Extract First N and Last N character in pyspark 2

 

 

 

Extract Last N characters in pyspark – Last N character from right

Extract Last N character of column in pyspark is obtained using substr() function. by passing first argument as negative value as shown below

########## Extract Last N character from right in pyspark

df = df_states.withColumn("last_n_char", df_states.state_name.substr(-2,2))
df.show()

Last 2 characters from right is extracted using substring function so the resultant dataframe will be

Extract First N and Last N character in pyspark 3

 

Extract characters from string column in pyspark – substr()

Extract characters from string column in pyspark is obtained using substr() function.  by passing two values first one represents the starting position of the character and second one represents the length of the substring. In our example we have extracted the two substrings and concatenated them using concat() function as shown below

########## Extract N characters from string column in pyspark

df_states_new=df_states.withColumn('new_string', concat(df_states.state_name.substr(1, 3),
                                   lit('_'),
                                   df_states.state_name.substr(6, 2)))
df_states_new.show()

so the resultant dataframe will be

Extract First N and Last N characters in pyspark c1

 


Other Related Columns:

Extract First N and Last N character in pyspark                                                                                       Extract First N and Last N character in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.