Select column in Pyspark (Select single & Multiple columns)

In order to select column in pyspark we will be using select function. Select() function is used to select single column and multiple columns in pyspark. Select column name like in pyspark. We will explain how to select column in Pyspark using regular expression  and also by column position with an example.

  • Select single column in pyspark using select() function.
  • Select multiple column in pyspark
  • Select column name like in pyspark using select() function
  • Select the column in pyspark using column position.
  • Select column name using regular expression in pyspark using colRegex() function

 

Syntax:

df.select(‘colname1’,‘colname2’)

df – dataframe
colname1..n – column name

We will use the dataframe named df_basket1.

Select column in Pyspark (Select single & Multiple columns) 1

 

Select single column in pyspark

Select() function with column name passed as argument is used to select that single column in pyspark.


df_basket1.select('Price').show()

We use select and show() function to select particular column. So in our case we select the ‘Price’ column as shown above.

Select column in Pyspark (Select single & Multiple columns) 2

 

 

Select multiple column in pyspark

select column of the dataframe in pyspark c1

 

Select() function with set of column names passed as argument is used to select those set of columns


df_basket1.select('Price','Item_name').show()

We use select function to select columns and use show() function along with it. So in our case we select the ‘Price’ and ‘Item_name’ columns as shown above.

Select column in Pyspark (Select single & Multiple columns) 3

 

 

Select column by column position in pyspark:

select column of the dataframe in pyspark c2

We can use the select function inorder to select  the column by position. In the below example the columns are selected using the position, say will be selecting the first column (Position:0) and last column (Position:2), by passing position as argument as shown below

## select column by position

df_basket1.select(df_basket1.columns[0],df_basket1.columns[2]).show()

so the resultant dataframe by selecting the column by position will be

select column of the dataframe in pyspark d1

 

 

Select using Regex with column name like in pyspark (select column name like):

colRegex() function with regular expression inside is used to select the column with regular expression. In our example we will be using the regular expressions and will be capturing the column whose name  starts with or contains “Item” in it.

## select using Regex with column name like

df_basket1.select(df_basket1.colRegex("`(Item)+?.+`")).show()

the above code selects columns which has the column name like Item%. so the resultant dataframe will be

Select column in Pyspark (Select single & Multiple columns) 4

 


Other Related Topics:

 

Select column in Pyspark (Select single & Multiple columns)                                                                                              Select column in Pyspark (Select single & Multiple columns)

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.