Read CSV file in Pyspark and Convert to dataframe

In order to read csv file in Pyspark and convert to dataframe, we import SQLContext. We will explain step by step how to read a csv file and convert them to dataframe in pyspark with an example.

We have used two methods to convert CSV to dataframe in Pyspark

Lets first import the necessary package

from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

So we have imported SQLContext as shown above.

Method 1: Read csv and convert to dataframe in pyspark

df_basket = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('C:/Users/Desktop/data/Basket.csv')
df_basket.show()

We use sqlcontext to read csv file and convert to spark dataframe with header=’true’.
Then we use load(‘your_path/file_name.csv’)
The resultant dataframe is stored as df_basket
df_basket.show() displays the top 20 rows of resultant dataframe

Read CSV file and Convert to dataframe in Pyspark 1

Method 2: Read csv and convert to dataframe in pyspark

df_basket1= sqlContext.read.load('C:/Users/Desktop/data/Basket.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')
df_basket1.show()

Here we first give load(‘your_path/file_name.csv’) and then we pass arguments to format like header=’true’. So the resultant dataframe will be

Read CSV file and Convert to dataframe in Pyspark 2

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts

Read CSV file in Pyspark and Convert to dataframe

Method 1: Read csv and convert to dataframe in pyspark

Method 2: Read csv and convert to dataframe in pyspark

Author

Related Posts: