Link to home
Start Free TrialLog in
Avatar of Leo Torres
Leo TorresFlag for United States of America

asked on

How can I Convert Powershell data clean up code to Python Code

I would like to convert the Powershell script below to Python code.
- Here is the objective of this code.
 + The code takes in a comma delimited filename and file extension.
 + The code below exports file as a pipe delimited file
 + Then it removes commas that exists within the data
 + Finally it also removes the double quotes used to qualify the data.

This results in the final file being pipe delimited with no double quotes or commas in the data. In doing this work I used this order because if you try to just replace double quotes and commas before establishing pipes the columns and data would break.  

    Param([string]$RootString,  [string]$Ext)	
    
    $OrgFile = $RootString
    $NewFile = $RootString.replace($Ext,"out")
    
    Import-Csv $OrgFile -Encoding UTF8 | Export-Csv tempfile.csv -Delimiter "|" -NoTypeInformation
    (Get-Content tempfile.csv).Replace(",","").Replace('"',"") | Out-File $NewFile -Encoding UTF8
    
    Copy-Item -Force $NewFile $OrgFile
    Remove-Item –path $NewFile -Force

Open in new window


Issues with this code is the method of find and replace. This was creating extra column in the beginning due to double quote when I did find and replace. Then sometimes extra column at the end since sometimes there was a double quote at the end. This caused data from different rows to merge together. More relevant seemed to put working code to create a better idea of.  Here is the code i have so far.
This was as close as I could get to my goal. But I think there has to be a better way of doing this
    for index in range(len(dfcsv)):
        filename = dfcsv['csvpath'].iloc[index]
        print(filename)
        print(i)
        with open(filename, 'r+') as f:
            text = f.read()
            print(datetime.now())
            text = re.sub('","', '|', text)
            print(datetime.now())
            f.seek(0)
            f.write(text)
            f.truncate()
            i = i + 1

Open in new window




Sample file
"Contract ID","Plan ID","Organization Type","Plan Type","Offers Part D","SNP Plan","EGHP","Organization Name","Organization Marketing Name","Plan Name","Parent Organization","Contract Effective Date"
"E7316","801","Employer/Union Only Direct Contract PDP","Employer/Union Only Direct Contract PDP","Yes","No","Yes","UNION PACIFIC RAILROAD EMPLOYES HEALTH SYSTEMS","Union Pacific Railroad  Employes Health Systems","The UPREHS Prime Medicare Plan (Employer PDP)","Union Pacific Railroad Employes Health Systems",01/01/2007 0:00:00
"H0022","001","Demo","Medicare-Medicaid Plan HMO/HMOPOS","Yes","No","No","BUCKEYE COMMUNITY HEALTH PLAN, INC.","Buckeye Health Plan - MyCare Ohio","Buckeye Health Plan - MyCare Ohio (Medicare-Medicaid Plan)","Centene Corporation",05/01/2014 0:00:00
"H0028","004","Local CCP","HMO/HMOPOS","Yes","No","No","CHA HMO, INC.","CHA HMO, Inc.","Humana Gold Plus H0028-004 (HMO)","Humana Inc.",01/01/2013 0:00:00
"H0028","007","Local CCP","HMO/HMOPOS","Yes","Yes","No","CHA HMO, INC.","CHA HMO, Inc.","Humana Gold Plus SNP-DE H0028-007 (HMO SNP)","Humana Inc.",01/01/2013 0:00:00

Open in new window


Final File:
Contract ID|Plan ID|Organization Type|Plan Type|Offers Part D|SNP Plan|EGHP|Organization Name|Organization Marketing Name|Plan Name|Parent Organization|Contract Effective Date
E7316|801|Employer/Union Only Direct Contract PDP|Employer/Union Only Direct Contract PDP|Yes|No|Yes|UNION PACIFIC RAILROAD EMPLOYES HEALTH SYSTEMS|Union Pacific Railroad  Employes Health Systems|The UPREHS Prime Medicare Plan (Employer PDP)|Union Pacific Railroad Employes Health Systems|01/01/2007 0:00:00
H0022|001|Demo|Medicare-Medicaid Plan HMO/HMOPOS|Yes|No|No|BUCKEYE COMMUNITY HEALTH PLAN, INC.|Buckeye Health Plan - MyCare Ohio|Buckeye Health Plan - MyCare Ohio (Medicare-Medicaid Plan)|Centene Corporation|05/01/2014 0:00:00
H0028|004|Local CCP|HMO/HMOPOS|Yes|No|No|CHA HMO, INC.|CHA HMO, Inc.|Humana Gold Plus H0028-004 (HMO)|Humana Inc.|01/01/2013 0:00:00
H0028|007|Local CCP|HMO/HMOPOS|Yes|Yes|No|CHA HMO, INC.|CHA HMO, Inc.|Humana Gold Plus SNP-DE H0028-007 (HMO SNP)|Humana Inc.|01/01/2013 0:00:00

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Norie
Norie

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
What's the reason for converting to Python?  Just because of the issues you mentioned?

Do you have sample input text that when run through your code produces the issues you mentioned?
Avatar of Leo Torres

ASKER

Hi footech,
The powershell script works but on large files it times out. I waited it out just to see how long it takes. I started it on Friday an 11:09AM and Monday Morning it was still running. I have ran the python script above and it would take hours to complete but it would at least complete. However the files where a little flawed.  

Norie, I am just seeing your post will test thank you!
How large a file?  I'm seeing some possibilities for optimization, but would need to do some testing.
I used your code in a loop and what it did was replace the file with the file paths I am trying to process. 


for index in range(len(df2)):
    filename = df2[0].iloc[index]
    df2.to_csv(filename,sep='|', index=False)
    print(filename)
    print(index)

Open in new window


Here is the output that was inserted into the file. Each file has this exact optput
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_01.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_02.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_03.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_04.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_05.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_06.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_07.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_08.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_09.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_10.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_11.csv
//Folder/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test\CPSC_Enrollment_Info_2017_12.csv

Open in new window



Looks like the file content was replaced
Avatar of Norie
Norie

Leo

Where in that code are you reading in the CSV files?
Here is all the code

import os
import pandas as pd
import sys
from glob import glob

rootext = '//Path/DataFeeds/CMS_ResearchStatistics/ZipArchive/2017/MonthlyEnrollmentByCPSC/Test/'


df2 = pd.DataFrame()
EXT = "*.csv"
all_csv_files = [file
        for path, subdir, files in os.walk(rootext)
        for file in glob(os.path.join(path, EXT))]
all_csv_files = pd.DataFrame(all_csv_files)
df2 = df2.append(all_csv_files)


for index in range(len(df2)):     
   filename = df2[0].iloc[index]     
   df2.to_csv(filename,sep='|', index=False)     
   print(filename)     
   print(index) 

Open in new window

Sorry, still not seeing anywhere in that code that you are actually reading in a CSV file.
Are you sure you have enough physical RAM for this?  You are loading all of your CMS data into memory before writing it out.
Norie, your right I missed that line. I added line testing. I think this works. 
Thank you Norie. Your code worked I as you said i didn't read the data in..