Link to home
Start Free TrialLog in
Avatar of pravelk
pravelk

asked on

Modifying CSV Headers in Pyspark

Trying to modify  CSV headers in Pyspark in order to get rid of blank space and extra characters from CSV columns.
For example , attribute "Loan Account" need to be renamed to "LoanAccount" and  "Late Payment Fee(ACC)" renamed to "LatePaymentFeeACC".

Need assistance to rename csv headers and convert into a dataframe .I have tried to rename attributes by using StructType . Kindly advise
if this can be resolved in different ways (in python) without using Pandas dataframe .

Below is my code snippet throwing Null values for all attributes from CSV file (Attached).

--DataType from account_Data.csv
df.printSchema()
ID: string (nullable = true)
Created: string (nullable = true)
Modified: string (nullable = true)
CaseId: string (nullable = true)
Loan Account: string (nullable = true)
Follow-up Date: string (nullable = true)
Late Payment Fee(ACC): string (nullable = true)

***************************************************************************************************
from datetime import *
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType
import os, datetime

sourceFile='hdfs://nameservice1/data/test/dev01/pk/data/account_Data.csv"----In HDFS

class CSVData_Load:

schema = StructType([
      StructField("ID",StringType(),True),
      StructField("Created",StringType(),True),
      StructField("Modified",StringType(),True),
      StructField("CaseId",StringType(),True),
      StructField("LoanAccount",StringType(),True), # Header Loan Account renamed to LoanAccount
        StructField("FollowupDate",StringType(),True), # Header Follow-up Date renamed to FollowupDate
        StructField("LatePaymentFeeACC",StringType(),True)]) # Header Late Payment Fee(ACC) renamed to LatePaymentFeeACC

        
        if __name__ == "__main__":
    try:
       df=spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").csv(sourceFile, schema=schema)
       df.show(2,False)---->Resulting All Null
***************************************************************************************************

Would appreciate your help to get the headers  modified and readable in Pyspark dataframe without using Pandas utility .  account_data.csv
ASKER CERTIFIED SOLUTION
Avatar of Louis LIETAER
Louis LIETAER
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pravelk
pravelk

ASKER

Thanks a lot for your help ,  @Louis LIETAER