script to generate table from weird directory structure

I have a the following dir structure

..\area\year\month\PDFFILE

not all areas have the same years inside
not all years have the same months (some months have no data so no month dir is created for those)
if there is a month directory it means it contains a pdf inside it


sample dir:

ford\2018
ford\2019\Jan\chart.pdf
ford\2019\Feb\chart.pdf
chevy\2017
chevy\2018
chevy\2019
chevy\2019\Jan\chart.pdf
toyota\
.
.
I need a script batch,vb or powershell or python whatever
that will help me create the following HTML TABLE


                   2019
     ford   chevy   toyota
Jan   x       x
Feb   x
Mar

The X should be an html link to the PDF, but I can do this with search / replace for the X so the main objective is to generate the table code.

I cannot change the directory structure
If you have a solution using csv file with excel and pivot tables it is also acceptable (so i can then use some text manipulating functions to get my desired html)

I have tried with some dir to html utils but they are very
limited and I end up with many tables that then I have to manually stitch together

Please Help
LVL 14
aranaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

aikimarkCommented:
What if there are multiple years?
aranaAuthor Commented:
there are, the ideal would be a different table for each year, but now I need to focus only in 2019
gelonidaCommented:
So 1 table per year.
Would you like to show a month for a year if no pdf exists for it.
Do you want to show a line for an area of no pdf exists for that area?
I guess you want the months ordered?

I would split the code into two parts.

One extracting the data and the other one writing out the result.
Become a Microsoft Certified Solutions Expert

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

aranaAuthor Commented:
if no pdf for the month no need to show that row
all areas should show even if empty
gelonidaCommented:
A quite quick and dirty solution (still not creating the tables but showing you kind of the expected results)

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    jan=1, feb=2, mar=3, apr=4, may=5, jun=6,
    jul=7, aug=8, sep=9, oct=10, nov=11, dec=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/chart.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("mon", end=": ")
        for area in sorted(areas):
            print("%10s" % area, end=" ")
        print()
        print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print(month, end=": ")
            for area in sorted(areas):
                print("%10s" % ("X" if area in urls else "-"), end=" ")
            print("")


info = gen_info()

show_info(info)

Open in new window


In order to create the html tables it would be required to know whether each table should be in a separate file or be inserted into an existing html document, etc.

However as data extraction and output is separated you wouldn't have to rewrite the function gen_info()
aranaAuthor Commented:
as long as I get the table part of the code, I really dont care if I get it in different file or same file,
 
This is to be run from the starting Path right?
gelonidaCommented:
Yes the script is intended to be run from the starting path. if you wanted, then the starting path could be a parameter passed to the script.

Could you try the script and see if the output looks right to you.
I don't have a windows machine at hand so can't test.

If the output looks OK, then I can change the code to create an html table.
aranaAuthor Commented:
So far great
wasn't working since I had 2.7 (my omission), installed 3.7 changed some stuff and works fine
also had to change some month names in the dict

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
    julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/*.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        #print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("<tr>", end="")
        for area in sorted(areas):
            #print("<td>","%10s" % area, end="</td>")
        print("</tr>")
        #print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print("<tr><td>",month, end=" </td>")
            for area in sorted(areas):
                # print("%s" % "<td>", ("X" if area in urls else "-"), end=" </td>")
                print("%s" % "<td>", ("<a href=\"images/pidmxdocuments/MejoraContinua/Programas/5s/Metrico/"+area+"/"+str(year)+"/"+month+"/"+"Radar%20Chart.pdf\">X</a>" if area in urls else "-"), end="</td>")
            print ("</tr>")
			
info = gen_info()
show_info(info)

Open in new window


I am getting an unwanted space at the begining of AREA I changed the line to
print("<td>"+area, end="</td>") to remove it and it works
but I dont like that :P, and area.strip threw me some errrors

Also I found out the pdf filenames are inconsistent, so I would like to take them from the IGLOB instead of using it as a constant, tried using fname but not working.
gelonidaCommented:
EDIT: Just saw that our messages crossed. just read my response.
I will read yours.


Here a version, that has a function, that returns the html code as string (it creates a list of strings which are concatenated and returned).

This allows you to print the html or integrate it (concatenate) into anohter html file.
Or you just write to a file.
#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
    julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/chart.pdf"):
        area, year_str, month, fname = fname.split(sep)
        year = int(year_str)
        print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # change next line if absolute or other urls are needed
        urls[year][month][area] = "/".join([area, year_str, month, fname])

    return areas, urls


def gen_html_table(year, info):
    areas, urls = info
    result = []
    year_urls = urls[year]
    result.append("<h1>Year %d</h1>\n" % year)
    result.append("<table>\n")
    result.append("<tr>")
    result.append("<th>month</th>")
    for area in sorted(areas):
        result.append("<th>%s</th>" % area)
    result.append("</tr>\n")
    for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
        result.append("<tr>")
        result.append("<th>%s</th>" % month)
        for area in sorted(areas):
            if area in urls:
                result.append('<th><a href="%s">X</a></th>' % urls[area])
            else:
                result.append("<th></th>")
        result.append("</tr>\n")
    result.append("</table>\n")
    return "".join(result)



info = gen_info()
years, url = info

for year in sorted(url.keys()):
    html = gen_html_table(year, info)
    # save html to file
    with open("table_%d.html" % year, "w") as fout:
        fout.write(html)

    # print to stdout
    print(html)

Open in new window


month_info is a table to help for correct ordering by month
gen_info() extracts the information from the current directory.
gen_html_table() returns a string containing the html code for one year's table.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
gelonidaCommented:
If the names of the pdf files are inconsistent, but you have only one pdf file per directory, then just use:
 for fname in iglob("*/*/*/*.pdf")

Open in new window


I don't understand the problem with 'area' can you run the new code with the new iglob and copy paste parts of the generated html to explain?


I just understood now the issue with the spaces and area. The reason is, that I used "%10s" as the formatting string. (My first example was supposed to look 'good' on the console, so I said to always use 10 character per area for better alginment)

Just replace it with "%s" and it should be fine

Concerning python3:

The code might work with python2 if you insert following line at the beginning of the file (after the #!/usr/bin/env python line)
from __future__ import print_function

Open in new window


If filenames or directory names or the generated HTML contain non ASCII characters, then other changes might be required
if you can use pip for installing the package future, then it is rather easy create code, that works with python2 and python3. As python2 will be obsoleted January 2020 it might be a good idea to write any new code such, that it will also run with python3.
aranaAuthor Commented:
PERFECT, just made minor adjustments (begin of absolute path, month names etc)
and your code works perfectly.
oBdACommented:
For the fun of it, a PowerShell version; assumes there's at most one pdf file in the month directory.
$root = 'C:\Charts'
$outFile = 'C:\Temp\chart.html'
$years = '2019', '2018', '2017'
$htmlTitle = 'Report'
$htmlHead = @'
<style>
	TABLE	{border-width:1px; border-style:solid; border-color:black;}
	TH		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
	TD		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
</style>
'@

Add-Type -AssemblyName System.Web
$areas = Get-ChildItem -Path $root -Directory | Select-Object -ExpandProperty Name
$months = 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'
$html = $years | ForEach-Object {
	$year = $_
	$months | ForEach-Object {
		$month = $_
		$out = '' | Select-Object -Property (@(@{n='Month'; e={"$($month) $($year)"}}) + $areas)
		ForEach ($area In $areas) {
			If ($pdf = Get-ChildItem -Path "$($root)\$($area)\$($year)\$($month)" -File -Filter *.pdf -ErrorAction SilentlyContinue) {
				$out.$area = "<a href='file://$($pdf.FullName)'>$($pdf.Name)</a>"
			}
		}
		$out
	}
} | ConvertTo-Html -Head $htmlHead -Title $htmlTitle 
[System.Web.HttpUtility]::HtmlDecode($html) | Set-Content -Path $outFile
Write-Host "Report written to $($outFile)"

Open in new window

aranaAuthor Commented:
Thank you all,
sorry oBdA, just had awarded the points when your solution appeared will check it out non the less.
gelonidaCommented:
An alternative If you want to create more complex html would be to use a templating engine like jinja (would require installation of an external python package)

But that's overkill for very basic html.
aranaAuthor Commented:
thanks for the suggestion, this is just a tedious monthly task that goes into a CMS , no need to kill flies with a canyon :D
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Powershell

From novice to tech pro — start learning today.