We help IT Professionals succeed at work.
Get Started

Modifying CSV Headers in Pyspark

142 Views
1 Endorsement
Last Modified: 2020-04-13
Trying to modify  CSV headers in Pyspark in order to get rid of blank space and extra characters from CSV columns.
For example , attribute "Loan Account" need to be renamed to "LoanAccount" and  "Late Payment Fee(ACC)" renamed to "LatePaymentFeeACC".

Need assistance to rename csv headers and convert into a dataframe .I have tried to rename attributes by using StructType . Kindly advise
if this can be resolved in different ways (in python) without using Pandas dataframe .

Below is my code snippet throwing Null values for all attributes from CSV file (Attached).

--DataType from account_Data.csv
df.printSchema()
ID: string (nullable = true)
Created: string (nullable = true)
Modified: string (nullable = true)
CaseId: string (nullable = true)
Loan Account: string (nullable = true)
Follow-up Date: string (nullable = true)
Late Payment Fee(ACC): string (nullable = true)

***************************************************************************************************
from datetime import *
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType
import os, datetime

sourceFile='hdfs://nameservice1/data/test/dev01/pk/data/account_Data.csv"----In HDFS

class CSVData_Load:

schema = StructType([
      StructField("ID",StringType(),True),
      StructField("Created",StringType(),True),
      StructField("Modified",StringType(),True),
      StructField("CaseId",StringType(),True),
      StructField("LoanAccount",StringType(),True), # Header Loan Account renamed to LoanAccount
        StructField("FollowupDate",StringType(),True), # Header Follow-up Date renamed to FollowupDate
        StructField("LatePaymentFeeACC",StringType(),True)]) # Header Late Payment Fee(ACC) renamed to LatePaymentFeeACC

        
        if __name__ == "__main__":
    try:
       df=spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").csv(sourceFile, schema=schema)
       df.show(2,False)---->Resulting All Null
***************************************************************************************************

Would appreciate your help to get the headers  modified and readable in Pyspark dataframe without using Pandas utility .  account_data.csv
Comment
Watch Question
System Infrastructure Architect
CERTIFIED EXPERT
Commented:
This problem has been solved!
Unlock 1 Answer and 2 Comments.
See Answer
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE