script to generate table from weird directory structure

arana
arana used Ask the Experts™
on
I have a the following dir structure

..\area\year\month\PDFFILE

not all areas have the same years inside
not all years have the same months (some months have no data so no month dir is created for those)
if there is a month directory it means it contains a pdf inside it


sample dir:

ford\2018
ford\2019\Jan\chart.pdf
ford\2019\Feb\chart.pdf
chevy\2017
chevy\2018
chevy\2019
chevy\2019\Jan\chart.pdf
toyota\
.
.
I need a script batch,vb or powershell or python whatever
that will help me create the following HTML TABLE


                   2019
     ford   chevy   toyota
Jan   x       x
Feb   x
Mar

The X should be an html link to the PDF, but I can do this with search / replace for the X so the main objective is to generate the table code.

I cannot change the directory structure
If you have a solution using csv file with excel and pivot tables it is also acceptable (so i can then use some text manipulating functions to get my desired html)

I have tried with some dir to html utils but they are very
limited and I end up with many tables that then I have to manually stitch together

Please Help
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2014

Commented:
What if there are multiple years?

Author

Commented:
there are, the ideal would be a different table for each year, but now I need to focus only in 2019
So 1 table per year.
Would you like to show a month for a year if no pdf exists for it.
Do you want to show a line for an area of no pdf exists for that area?
I guess you want the months ordered?

I would split the code into two parts.

One extracting the data and the other one writing out the result.
Build an E-Commerce Site with Angular 5

Learn how to build an E-Commerce site with Angular 5, a JavaScript framework used by developers to build web, desktop, and mobile applications.

Author

Commented:
if no pdf for the month no need to show that row
all areas should show even if empty
A quite quick and dirty solution (still not creating the tables but showing you kind of the expected results)

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    jan=1, feb=2, mar=3, apr=4, may=5, jun=6,
    jul=7, aug=8, sep=9, oct=10, nov=11, dec=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/chart.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("mon", end=": ")
        for area in sorted(areas):
            print("%10s" % area, end=" ")
        print()
        print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print(month, end=": ")
            for area in sorted(areas):
                print("%10s" % ("X" if area in urls else "-"), end=" ")
            print("")


info = gen_info()

show_info(info)

Open in new window


In order to create the html tables it would be required to know whether each table should be in a separate file or be inserted into an existing html document, etc.

However as data extraction and output is separated you wouldn't have to rewrite the function gen_info()

Author

Commented:
as long as I get the table part of the code, I really dont care if I get it in different file or same file,
 
This is to be run from the starting Path right?
Yes the script is intended to be run from the starting path. if you wanted, then the starting path could be a parameter passed to the script.

Could you try the script and see if the output looks right to you.
I don't have a windows machine at hand so can't test.

If the output looks OK, then I can change the code to create an html table.

Author

Commented:
So far great
wasn't working since I had 2.7 (my omission), installed 3.7 changed some stuff and works fine
also had to change some month names in the dict

#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
    julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/*.pdf"):
        area, year, month, fname = fname.split(sep)
        year = int(year)
        #print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # add next line of absolute or other urls are needed
        urls[year][month][area] = fname

    return areas, urls


def show_info(info):
    areas, urls = info
    for year, year_urls in sorted(urls.items()):
        print()
        print("<h1>Year %d</h1>" % year)
        print("<tr>", end="")
        for area in sorted(areas):
            #print("<td>","%10s" % area, end="</td>")
        print("</tr>")
        #print("-" * 76)
        for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
            print("<tr><td>",month, end=" </td>")
            for area in sorted(areas):
                # print("%s" % "<td>", ("X" if area in urls else "-"), end=" </td>")
                print("%s" % "<td>", ("<a href=\"images/pidmxdocuments/MejoraContinua/Programas/5s/Metrico/"+area+"/"+str(year)+"/"+month+"/"+"Radar%20Chart.pdf\">X</a>" if area in urls else "-"), end="</td>")
            print ("</tr>")
			
info = gen_info()
show_info(info)

Open in new window


I am getting an unwanted space at the begining of AREA I changed the line to
print("<td>"+area, end="</td>") to remove it and it works
but I dont like that :P, and area.strip threw me some errrors

Also I found out the pdf filenames are inconsistent, so I would like to take them from the IGLOB instead of using it as a constant, tried using fname but not working.
EDIT: Just saw that our messages crossed. just read my response.
I will read yours.


Here a version, that has a function, that returns the html code as string (it creates a list of strings which are concatenated and returned).

This allows you to print the html or integrate it (concatenate) into anohter html file.
Or you just write to a file.
#!/usr/bin/env python

from collections import defaultdict
from glob import iglob
from os import sep  # path separator '/' for linux '\' for windows

# table for ordering by month
month_info = dict(
    enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
    julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)


def gen_info():
    urls = defaultdict(lambda: defaultdict(dict))
    areas = set()
    for fname in iglob("*/*/*/chart.pdf"):
        area, year_str, month, fname = fname.split(sep)
        year = int(year_str)
        print("%r %r %r %r" % (area, year, month, fname))
        areas.add(area)
        # change next line if absolute or other urls are needed
        urls[year][month][area] = "/".join([area, year_str, month, fname])

    return areas, urls


def gen_html_table(year, info):
    areas, urls = info
    result = []
    year_urls = urls[year]
    result.append("<h1>Year %d</h1>\n" % year)
    result.append("<table>\n")
    result.append("<tr>")
    result.append("<th>month</th>")
    for area in sorted(areas):
        result.append("<th>%s</th>" % area)
    result.append("</tr>\n")
    for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
        result.append("<tr>")
        result.append("<th>%s</th>" % month)
        for area in sorted(areas):
            if area in urls:
                result.append('<th><a href="%s">X</a></th>' % urls[area])
            else:
                result.append("<th></th>")
        result.append("</tr>\n")
    result.append("</table>\n")
    return "".join(result)



info = gen_info()
years, url = info

for year in sorted(url.keys()):
    html = gen_html_table(year, info)
    # save html to file
    with open("table_%d.html" % year, "w") as fout:
        fout.write(html)

    # print to stdout
    print(html)

Open in new window


month_info is a table to help for correct ordering by month
gen_info() extracts the information from the current directory.
gen_html_table() returns a string containing the html code for one year's table.
If the names of the pdf files are inconsistent, but you have only one pdf file per directory, then just use:
 for fname in iglob("*/*/*/*.pdf")

Open in new window


I don't understand the problem with 'area' can you run the new code with the new iglob and copy paste parts of the generated html to explain?


I just understood now the issue with the spaces and area. The reason is, that I used "%10s" as the formatting string. (My first example was supposed to look 'good' on the console, so I said to always use 10 character per area for better alginment)

Just replace it with "%s" and it should be fine

Concerning python3:

The code might work with python2 if you insert following line at the beginning of the file (after the #!/usr/bin/env python line)
from __future__ import print_function

Open in new window


If filenames or directory names or the generated HTML contain non ASCII characters, then other changes might be required
if you can use pip for installing the package future, then it is rather easy create code, that works with python2 and python3. As python2 will be obsoleted January 2020 it might be a good idea to write any new code such, that it will also run with python3.

Author

Commented:
PERFECT, just made minor adjustments (begin of absolute path, month names etc)
and your code works perfectly.
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
For the fun of it, a PowerShell version; assumes there's at most one pdf file in the month directory.
$root = 'C:\Charts'
$outFile = 'C:\Temp\chart.html'
$years = '2019', '2018', '2017'
$htmlTitle = 'Report'
$htmlHead = @'
<style>
	TABLE	{border-width:1px; border-style:solid; border-color:black;}
	TH		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
	TD		{border-width:1px; padding:1px; border-style:solid; border-color:black;}
</style>
'@

Add-Type -AssemblyName System.Web
$areas = Get-ChildItem -Path $root -Directory | Select-Object -ExpandProperty Name
$months = 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'
$html = $years | ForEach-Object {
	$year = $_
	$months | ForEach-Object {
		$month = $_
		$out = '' | Select-Object -Property (@(@{n='Month'; e={"$($month) $($year)"}}) + $areas)
		ForEach ($area In $areas) {
			If ($pdf = Get-ChildItem -Path "$($root)\$($area)\$($year)\$($month)" -File -Filter *.pdf -ErrorAction SilentlyContinue) {
				$out.$area = "<a href='file://$($pdf.FullName)'>$($pdf.Name)</a>"
			}
		}
		$out
	}
} | ConvertTo-Html -Head $htmlHead -Title $htmlTitle 
[System.Web.HttpUtility]::HtmlDecode($html) | Set-Content -Path $outFile
Write-Host "Report written to $($outFile)"

Open in new window

Author

Commented:
Thank you all,
sorry oBdA, just had awarded the points when your solution appeared will check it out non the less.
An alternative If you want to create more complex html would be to use a templating engine like jinja (would require installation of an external python package)

But that's overkill for very basic html.

Author

Commented:
thanks for the suggestion, this is just a tedious monthly task that goes into a CMS , no need to kill flies with a canyon :D

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial