asked on

Modifying CSV Headers in Pyspark

Trying to modify CSV headers in Pyspark in order to get rid of blank space and extra characters from CSV columns.
For example , attribute "Loan Account" need to be renamed to "LoanAccount" and "Late Payment Fee(ACC)" renamed to "LatePaymentFeeACC".

Need assistance to rename csv headers and convert into a dataframe .I have tried to rename attributes by using StructType . Kindly advise
if this can be resolved in different ways (in python) without using Pandas dataframe .

Below is my code snippet throwing Null values for all attributes from CSV file (Attached).

--DataType from account_Data.csv
df.printSchema()
ID: string (nullable = true)
Created: string (nullable = true)
Modified: string (nullable = true)
CaseId: string (nullable = true)
Loan Account: string (nullable = true)
Follow-up Date: string (nullable = true)
Late Payment Fee(ACC): string (nullable = true)

***************************************************************************************************
from datetime import *
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType
import os, datetime

sourceFile='hdfs://nameservice1/data/test/dev01/pk/data/account_Data.csv"----In HDFS

class CSVData_Load:

schema = StructType([
StructField("ID",StringType(),True),
StructField("Created",StringType(),True),
StructField("Modified",StringType(),True),
StructField("CaseId",StringType(),True),
StructField("LoanAccount",StringType(),True), # Header Loan Account renamed to LoanAccount
       StructField("FollowupDate",StringType(),True), # Header Follow-up Date renamed to FollowupDate
       StructField("LatePaymentFeeACC",StringType(),True)]) # Header Late Payment Fee(ACC) renamed to LatePaymentFeeACC


       if __name__ == "__main__":
try:
df=spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").csv(sourceFile, schema=schema)
df.show(2,False)---->Resulting All Null
***************************************************************************************************

Would appreciate your help to get the headers modified and readable in Pyspark dataframe without using Pandas utility . account_data.csv

ASKER CERTIFIED SOLUTION

Louis LIETAER

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pravelk

ASKER

Thanks a lot for your help , @Louis LIETAER