Link to home
Start Free TrialLog in
Avatar of Arana (G.P.)
Arana (G.P.)

asked on

script to generate table from weird directory structure

I have a the following dir structure

..\area\year\month\PDFFILE

not all areas have the same years inside
not all years have the same months (some months have no data so no month dir is created for those)
if there is a month directory it means it contains a pdf inside it


sample dir:

ford\2018
ford\2019\Jan\chart.pdf
ford\2019\Feb\chart.pdf
chevy\2017
chevy\2018
chevy\2019
chevy\2019\Jan\chart.pdf
toyota\
.
.
I need a script batch,vb or powershell or python whatever
that will help me create the following HTML TABLE


                   2019
     ford   chevy   toyota
Jan   x       x
Feb   x
Mar

The X should be an html link to the PDF, but I can do this with search / replace for the X so the main objective is to generate the table code.

I cannot change the directory structure
If you have a solution using csv file with excel and pivot tables it is also acceptable (so i can then use some text manipulating functions to get my desired html)

I have tried with some dir to html utils but they are very
limited and I end up with many tables that then I have to manually stitch together

Please Help
Avatar of aikimark
aikimark
Flag of United States of America image

What if there are multiple years?
Avatar of Arana (G.P.)
Arana (G.P.)

ASKER

there are, the ideal would be a different table for each year, but now I need to focus only in 2019
So 1 table per year.
Would you like to show a month for a year if no pdf exists for it.
Do you want to show a line for an area of no pdf exists for that area?
I guess you want the months ordered?

I would split the code into two parts.

One extracting the data and the other one writing out the result.
if no pdf for the month no need to show that row
all areas should show even if empty
A quite quick and dirty solution (still not creating the tables but showing you kind of the expected results)

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    jan=1, feb=2, mar=3, apr=4, may=5, jun=6,
    jul=7, aug=8, sep=9, oct=10, nov=11, dec=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/chart.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("mon", end=": ")
        for area in sorted(areas):
            print("%10s" % area, end=" ")
        print()
        print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print(month, end=": ")
            for area in sorted(areas):
                print("%10s" % ("X" if area in urls else "-"), end=" ")
            print("")


info = gen_info()

show_info(info)

Open in new window


In order to create the html tables it would be required to know whether each table should be in a separate file or be inserted into an existing html document, etc.

However as data extraction and output is separated you wouldn't have to rewrite the function gen_info()
as long as I get the table part of the code, I really dont care if I get it in different file or same file,
 
This is to be run from the starting Path right?
Yes the script is intended to be run from the starting path. if you wanted, then the starting path could be a parameter passed to the script.

Could you try the script and see if the output looks right to you.
I don't have a windows machine at hand so can't test.

If the output looks OK, then I can change the code to create an html table.
So far great
wasn't working since I had 2.7 (my omission), installed 3.7 changed some stuff and works fine
also had to change some month names in the dict

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
    julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/*.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        #print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("<tr>", end="")
        for area in sorted(areas):
            #print("<td>","%10s" % area, end="</td>")
        print("</tr>")
        #print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print("<tr><td>",month, end=" </td>")
            for area in sorted(areas):
                # print("%s" % "<td>", ("X" if area in urls else "-"), end=" </td>")
                print("%s" % "<td>", ("<a href=\"images/pidmxdocuments/MejoraContinua/Programas/5s/Metrico/"+area+"/"+str(year)+"/"+month+"/"+"Radar%20Chart.pdf\">X</a>" if area in urls else "-"), end="</td>")
            print ("</tr>")
			
info = gen_info()
show_info(info)

Open in new window


I am getting an unwanted space at the begining of AREA I changed the line to
print("<td>"+area, end="</td>") to remove it and it works
but I dont like that :P, and area.strip threw me some errrors

Also I found out the pdf filenames are inconsistent, so I would like to take them from the IGLOB instead of using it as a constant, tried using fname but not working.
ASKER CERTIFIED SOLUTION
Avatar of gelonida
gelonida
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If the names of the pdf files are inconsistent, but you have only one pdf file per directory, then just use:
 for fname in iglob("*/*/*/*.pdf")

Open in new window


I don't understand the problem with 'area' can you run the new code with the new iglob and copy paste parts of the generated html to explain?


I just understood now the issue with the spaces and area. The reason is, that I used "%10s" as the formatting string. (My first example was supposed to look 'good' on the console, so I said to always use 10 character per area for better alginment)

Just replace it with "%s" and it should be fine

Concerning python3:

The code might work with python2 if you insert following line at the beginning of the file (after the #!/usr/bin/env python line)
from __future__ import print_function

Open in new window


If filenames or directory names or the generated HTML contain non ASCII characters, then other changes might be required
if you can use pip for installing the package future, then it is rather easy create code, that works with python2 and python3. As python2 will be obsoleted January 2020 it might be a good idea to write any new code such, that it will also run with python3.
PERFECT, just made minor adjustments (begin of absolute path, month names etc)
and your code works perfectly.
For the fun of it, a PowerShell version; assumes there's at most one pdf file in the month directory.
$root = 'C:\Charts'
$outFile = 'C:\Temp\chart.html'
$years = '2019', '2018', '2017'
$htmlTitle = 'Report'
$htmlHead = @'
<style>
	TABLE	{border-width:1px; border-style:solid; border-color:black;}
	TH		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
	TD		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
</style>
'@

Add-Type -AssemblyName System.Web
$areas = Get-ChildItem -Path $root -Directory | Select-Object -ExpandProperty Name
$months = 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'
$html = $years | ForEach-Object {
	$year = $_
	$months | ForEach-Object {
		$month = $_
		$out = '' | Select-Object -Property (@(@{n='Month'; e={"$($month) $($year)"}}) + $areas)
		ForEach ($area In $areas) {
			If ($pdf = Get-ChildItem -Path "$($root)\$($area)\$($year)\$($month)" -File -Filter *.pdf -ErrorAction SilentlyContinue) {
				$out.$area = "<a href='file://$($pdf.FullName)'>$($pdf.Name)</a>"
			}
		}
		$out
	}
} | ConvertTo-Html -Head $htmlHead -Title $htmlTitle 
[System.Web.HttpUtility]::HtmlDecode($html) | Set-Content -Path $outFile
Write-Host "Report written to $($outFile)"

Open in new window

Thank you all,
sorry oBdA, just had awarded the points when your solution appeared will check it out non the less.
An alternative If you want to create more complex html would be to use a templating engine like jinja (would require installation of an external python package)

But that's overkill for very basic html.
thanks for the suggestion, this is just a tedious monthly task that goes into a CMS , no need to kill flies with a canyon :D