pravelk
asked on
Modifying CSV Headers in Pyspark
Trying to modify CSV headers in Pyspark in order to get rid of blank space and extra characters from CSV columns.
For example , attribute "Loan Account" need to be renamed to "LoanAccount" and "Late Payment Fee(ACC)" renamed to "LatePaymentFeeACC".
Need assistance to rename csv headers and convert into a dataframe .I have tried to rename attributes by using StructType . Kindly advise
if this can be resolved in different ways (in python) without using Pandas dataframe .
Below is my code snippet throwing Null values for all attributes from CSV file (Attached).
--DataType from account_Data.csv
df.printSchema()
ID: string (nullable = true)
Created: string (nullable = true)
Modified: string (nullable = true)
CaseId: string (nullable = true)
Loan Account: string (nullable = true)
Follow-up Date: string (nullable = true)
Late Payment Fee(ACC): string (nullable = true)
************************** ********** ********** ********** ********** ********** ********** ********** ***
from datetime import *
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType
import os, datetime
sourceFile='hdfs://nameser vice1/data /test/dev0 1/pk/data/ account_Da ta.csv"--- -In HDFS
class CSVData_Load:
schema = StructType([
StructField("ID",StringTyp e(),True),
StructField("Created",Stri ngType(),T rue),
StructField("Modified",Str ingType(), True),
StructField("CaseId",Strin gType(),Tr ue),
StructField("LoanAccount", StringType (),True), # Header Loan Account renamed to LoanAccount
StructField("FollowupDate" ,StringTyp e(),True), # Header Follow-up Date renamed to FollowupDate
StructField("LatePaymentFe eACC",Stri ngType(),T rue)]) # Header Late Payment Fee(ACC) renamed to LatePaymentFeeACC
if __name__ == "__main__":
try:
df=spark.read.option("deli miter", ",").option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchem a", "true").csv(sourceFile, schema=schema)
df.show(2,False)---->Resul ting All Null
************************** ********** ********** ********** ********** ********** ********** ********** ***
Would appreciate your help to get the headers modified and readable in Pyspark dataframe without using Pandas utility . account_data.csv
For example , attribute "Loan Account" need to be renamed to "LoanAccount" and "Late Payment Fee(ACC)" renamed to "LatePaymentFeeACC".
Need assistance to rename csv headers and convert into a dataframe .I have tried to rename attributes by using StructType . Kindly advise
if this can be resolved in different ways (in python) without using Pandas dataframe .
Below is my code snippet throwing Null values for all attributes from CSV file (Attached).
--DataType from account_Data.csv
df.printSchema()
ID: string (nullable = true)
Created: string (nullable = true)
Modified: string (nullable = true)
CaseId: string (nullable = true)
Loan Account: string (nullable = true)
Follow-up Date: string (nullable = true)
Late Payment Fee(ACC): string (nullable = true)
**************************
from datetime import *
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, StringType
import os, datetime
sourceFile='hdfs://nameser
class CSVData_Load:
schema = StructType([
StructField("ID",StringTyp
StructField("Created",Stri
StructField("Modified",Str
StructField("CaseId",Strin
StructField("LoanAccount",
StructField("FollowupDate"
StructField("LatePaymentFe
if __name__ == "__main__":
try:
df=spark.read.option("deli
df.show(2,False)---->Resul
**************************
Would appreciate your help to get the headers modified and readable in Pyspark dataframe without using Pandas utility . account_data.csv
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER