Link to home
Start Free TrialLog in
Avatar of Wm Allen Smith
Wm Allen Smith

asked on

Using Python to Loop through a folder and print specified sheet(s) in excel workbook

Hi,

I am attempting  to  use python to loop  through a specified folder (sales_folder monthlysales*.xlsx) and  pull column data (salesTot) from a named tab (currentmonth)  in within the spreadsheet. I can get this
to work with one file, but not with looping through the folder. Th ecode for one file:

import openpyxl
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns',25)
pd.set_option('display.width', 200)
from pandas import ExcelWriter
from pandas import ExcelFile



d1 = pd.read_excel("C:\\sales_folder\ monthlysales", sheet_name= "currentmonth", index=range(5), columns=list('abc')
)
print(d1)
d1.columns
print(d1.columns)

df1=pd.DataFrame(data=d1)
print(d1)
d1_transposed=d1.T
print(d1_transposed)

d1_transposed.to_csv('C:\\ Folder\\Sales.csv')


I have tried to use glob but I cannot get the loop to run through the folder, find the tab and print the output to a csv. Please advise.
Avatar of Norie
Norie

This code should loop through all the Excel files in the folder with names starting with 'monthlysales',

For each file it finds it'll then try to read data from a sheet named 'currentmonth' and print it out
for filename in glob.glob('C:/sales_folder/monthlysales*.xls*'):
    try:   
        d1 = pd.read_excel(filename, sheet_name= 'currentmonth', index=range(5), columns=list('abc'))
        print(d1)
    except:
        print(f'Problem reading from file {filename}')

Open in new window


If you want to collect the data from all the files in the folder and output them in a separate file you'll need to give more details.
Avatar of Wm Allen Smith

ASKER

Norie,

Great! This first step worked.  Here is the tricky part--The actual data is laid out with each field as a row  and the value of the file in in the neighboring column:

 Col1 Col 2
Fielid1 Value1
Field2 Value2....

I need to get his data into a dataframe, transpose it  so that  a standard table format and save it to  a csv file.

Please advise.
Where is the data actually located on the sheet?

How do you want the it all combined in the output file?
Filenames are in Column A, Values in Column B.

Thanks
ASKER CERTIFIED SOLUTION
Avatar of Norie
Norie

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Norrie, this worked quite well! Thank you for your input.

Another thing I would like to do is in the error handling, if there is a problem file, move it to another folder. I was using the shutil  module, but that moved all of the files to another  destination folder, not just the ones that triggered an error:
try:  
        # read data from sales file
        df = pd.read_excel(filename, sheet_name= 'Data', header=None)  
     
        # transpose data
        df = df.T

        # use first row as headers
        df.columns = df.iloc[0]

        # remove first row
        df = df[1:]

        # append to main dataframe
        dfAllSales = dfAllSales.append(df)
       
    except:
        print(f'Problem reading from file {filename}')
        source = 'C:\\\Documents\\UpdatedataTestFolder\\'
        dest1 ='C:\\Documents\\UpdatedataTestFolder\\badfolder'
        files = os.listdir(source)

        for f in files:
            print(f'Problem reading from file {filename}')
            shutil.move(source+f, dest1)
       
print (filename)



Please advise.
You can use shutil to copy/move a single file.
import glob
import os
import shutil
import pandas as pd

dfAllSales = pd.DataFrame()

pathBadFiles = 'bad_files'
pathGoodFiles = 'good_files'
pathSales = 'sales_folder'

for filepath in glob.glob(f'C:/{pathSales}/monthlysales*.xls*'):

    filename = os.path.split(filepath)[1]

    try:           

        # read data from sales file
        df = pd.read_excel(filepath, sheet_name= 'currentmonth', header=None)   
        # transpose 
        df = df.T
        # use first row as headers
        df.columns = df.iloc[0]
        # remove first row
        df = df[1:]
        # append to main dataframe
        dfAllSales = dfAllSales.append(df)

        # copy file to good_files folder
        shutil.copy(filepath, os.path.join(f'c:/{pathSales}/{pathGoodFiles}', filename))

    except:

        print(f'Problem reading from file {filename}')

        # move file to bad_files folder
        shutil.move(filepath, os.path.join(f'c:/{pathSales}/{pathBadFiles}', filename))

print(dfAllSales)

# save combined data to CSV
dfAllSales.to_csv('C:/sales_folder/Sales.csv', index=None)

Open in new window

I was not able to get this version with shutil to work, only returned empty data sets, even when I hard coded the  file paths.
Do you mean it didn't work at all?

The only thing that changed was the copying/moving of the files and that shouldn't affect the original code that was getting the data.
Question: You are using a variable for the Path Sales in the "for filepath in glob.glob(f'C:/{pathSales}/monthlysales*.xls*'):"
statement. I am a supplying a path value for the  pathSales i.e. pathSales = 'N:/Documents_Main/ExcelFiles'. When referencing the path in the "for" statement, you have "for filepath in glob.glob(f'C:/{pathSales}/monthlysales*.xls*'):", I am able to get this to work when I replace the C:/{pathSales}/monthlysales*.xls*' with the actual path--which I though te pathSales variable was doing. Is this an incorrect assumption?
In the code I posted the variable pathSales refers to the string 'sales_folder', i.e. only the folder name.

In this code {pathSales} is replaced with the value it refers to.

glob.glob(f'C:/{pathSales}/monthlysales*.xls*')

Open in new window


So, without the variable the above would look like this.
glob.glob(f'C:/sales_folder/monthlysales*.xls*')

Open in new window


If you have pathSales referring to the full path, including the drive letter you could try this.
glob.glob(f'{pathSales}/monthlysales*.xls*')

Open in new window

I got it to work without using the variable. thank you so much for your assistance!! :)