Concatenate two columns in pyspark

In order to concatenate two columns in pyspark we will be using concat() Function. We look at an example on how to join or concatenate two string columns in pyspark (two or more columns) and also string and numeric column with space or any separator.

  • Concatenate two columns in pyspark without space
  • Concatenate columns in pyspark with single space
  • Concatenate columns with hyphen in pyspark (“-”)
  • Concatenate by removing leading and trailing space
  • Concatenate numeric and character column in pyspark

we will be using “df_states” dataframe

Concatenate two columns in pyspark 1

 

 

 

Concatenate two columns in pyspark with single space :Method 1

Concatenating two columns in pyspark is accomplished using concat() Function.

###### concatenate using single space

from pyspark.sql.functions import concat, lit, col

df1=df_states.select("*", concat(col("state_name"),lit(" "),col("state_code")).alias("state_name_code"))
df1.show()

so the dataframe with concatenated column of single space will be

Concatenate two columns in pyspark 2a

Concatenate columns in pyspark with single space :Method 2

Concatenating columns in pyspark is accomplished using concat() Function.

###### concatenate using single space

from pyspark.sql import functions as sf

df=df_states.withColumn('joined_column', sf.concat(sf.col('state_name'),sf.lit(' '),sf.col('state_code')))
df.show()

So the dataframe with concatenated column of single space will be

Concatenate two columns in pyspark 2b

 

 

 

 

Concatenate two columns without space :Method 1

Concatenating columns in pyspark is accomplished using concat() Function.

###### concatenate two columns without space

from pyspark.sql.functions import concat, lit, col

df1=df_states.select("*", concat(col("state_name"),col("state_code")).alias("state_name_code"))
df1.show()

So the dataframe with concatenated column without space will be

Concatenate two columns in pyspark 3a

Concatenate two columns without space :Method 2

Concatenating two columns is accomplished using concat() Function.

###### concatenate two columns without space

from pyspark.sql import functions as sf

df=df_states.withColumn('joined_column', sf.concat(sf.col('state_name'),sf.col('state_code')))
df.show()

So the dataframe with concatenated column without space will be
Concatenate two columns in pyspark 3b

 

 

 

 

Concatenate two columns with hyphen :Method 1

Concatenating multiple columns is accomplished using concat() Function.

###### concatenate using hyphen

from pyspark.sql.functions import concat, lit, col

df1=df_states.select("*", concat(col("state_name"),lit("-"),col("state_code")).alias("state_name_code"))
df1.show()

so the dataframe with concatenated column with hyphen (“-”) will be

Concatenate two columns in pyspark 4a

Concatenate two columns with hyphen :Method 2

Concatenating columns in pyspark is accomplished using concat() Function.

###### concatenate using hyphen

from pyspark.sql import functions as sf

df=df_states.withColumn('joined_column', sf.concat(sf.col('state_name'),sf.lit('-'),sf.col('state_code')))
df.show()

so the dataframe with concatenated column with hyphen (“-”) will be
Concatenate two columns in pyspark 4b

 

 

Concatenate numeric and character column in pyspark:

Concatenating numeric and character column in pyspark is accomplished by converting the numeric column to character by using cast() function. After that both the columns are concatenated using concat() function.

###### concatenate using hyphen

from pyspark.sql import functions as sf

df=df_states.withColumn('joined_column', sf.concat(sf.col('state_name'),sf.lit('-'),sf.col('state_code')))
df.show()

So the dataframe with numeric and character column concatenated will be
Concatenate two columns in pyspark 5

 

 

 

Concatenate two columns in pyspark by removing leading and trailing space :

First Concatenation of two columns in pyspark is accomplished using concat() Function. After that  trim() function is used to remove leading and trailing space


##### concatenate two columns by removing leading and trailing space

from pyspark.sql.functions import concat, lit, col
from pyspark.sql.functions import *


df1=df_states.select("*", concat(col("state_name"),lit("-"),col("state_code")).alias("state_name_code"))
df1 = df1.withColumn('state_name_code', trim(df1.state_name_code))

df1.show()

So the dataframe with concatenated column with leading and trailing space removed will be

Concatenate two columns in pyspark 6

 


Other related topics:

 

Concatenate two columns in pyspark                                                                                                       Concatenate two columns in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.