Get Substring of the column in Pyspark – substr()

In order to get substring of the column in pyspark we will be using substr() Function. We look at an example on how to get substring of the column in pyspark.

  • Get substring of the column in pyspark using substring function.
  • Get Substring from end of the column in pyspark substr() .
  • Extract characters from string column in pyspark

Get Substring of the column in Pyspark - substr() c1

Syntax:

 df.colname.substr(start,length)

df- dataframe
colname- column name
start – starting position
length – number of string from starting position

We will be using the dataframe named df_states

Get Substring of the column in Pyspark 1

 

 

Substring from the start of the column in pyspark – substr() :

df.colname.substr() gets the substring of the column. Extracting first 6 characters of the column in pyspark is achieved as follows.


### Get Substring of the column in pyspark

df = df_states.withColumn("substring_statename", df_states.state_name.substr(1,6))
df.show()

substr(1,6) returns the first 6 characters from column “state_name”

Get Substring of the column in Pyspark 2

 

 

Get Substring from end of the column in pyspark

df.colname.substr() gets the substring of the column in pyspark . In order to get substring from end we will specifying first parameter with minus(-) sign.

### Get Substring from end of the column in pyspark

df = df_states.withColumn("substring_from_end", df_states.state_name.substr(-2,2))
df.show()

In our example we will extract substring from end. i.e. last two character of the column. We will specifying first parameter with minus(-) sign, Followed by length as second parameter so the resultant table will be

 

 

Extract characters from string column in pyspark – substr()

Extract characters from string column in pyspark is obtained using substr() function.  by passing two values first one represents the starting position of the character and second one represents the length of the substring. In our example we have extracted the two substrings and concatenated them using concat() function as shown below

########## Extract N characters from string column in pyspark

df_states_new=df_states.withColumn('new_string', concat(df_states.state_name.substr(1, 3),
                                   lit('_'),
                                   df_states.state_name.substr(6, 2)))
df_states_new.show()

so the resultant dataframe will be

Extract First N and Last N characters in pyspark c1

 


Other Related Topics:

 

Get Substring of the column in Pyspark                                                                                            Get Substring of the column in Pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.