Arana (G.P.)
asked on
script to generate table from weird directory structure
I have a the following dir structure
..\area\year\month\PDFFILE
not all areas have the same years inside
not all years have the same months (some months have no data so no month dir is created for those)
if there is a month directory it means it contains a pdf inside it
sample dir:
ford\2018
ford\2019\Jan\chart.pdf
ford\2019\Feb\chart.pdf
chevy\2017
chevy\2018
chevy\2019
chevy\2019\Jan\chart.pdf
toyota\
.
.
I need a script batch,vb or powershell or python whatever
that will help me create the following HTML TABLE
2019
ford chevy toyota
Jan x x
Feb x
Mar
The X should be an html link to the PDF, but I can do this with search / replace for the X so the main objective is to generate the table code.
I cannot change the directory structure
If you have a solution using csv file with excel and pivot tables it is also acceptable (so i can then use some text manipulating functions to get my desired html)
I have tried with some dir to html utils but they are very
limited and I end up with many tables that then I have to manually stitch together
Please Help
..\area\year\month\PDFFILE
not all areas have the same years inside
not all years have the same months (some months have no data so no month dir is created for those)
if there is a month directory it means it contains a pdf inside it
sample dir:
ford\2018
ford\2019\Jan\chart.pdf
ford\2019\Feb\chart.pdf
chevy\2017
chevy\2018
chevy\2019
chevy\2019\Jan\chart.pdf
toyota\
.
.
I need a script batch,vb or powershell or python whatever
that will help me create the following HTML TABLE
2019
ford chevy toyota
Jan x x
Feb x
Mar
The X should be an html link to the PDF, but I can do this with search / replace for the X so the main objective is to generate the table code.
I cannot change the directory structure
If you have a solution using csv file with excel and pivot tables it is also acceptable (so i can then use some text manipulating functions to get my desired html)
I have tried with some dir to html utils but they are very
limited and I end up with many tables that then I have to manually stitch together
Please Help
What if there are multiple years?
ASKER
there are, the ideal would be a different table for each year, but now I need to focus only in 2019
So 1 table per year.
Would you like to show a month for a year if no pdf exists for it.
Do you want to show a line for an area of no pdf exists for that area?
I guess you want the months ordered?
I would split the code into two parts.
One extracting the data and the other one writing out the result.
Would you like to show a month for a year if no pdf exists for it.
Do you want to show a line for an area of no pdf exists for that area?
I guess you want the months ordered?
I would split the code into two parts.
One extracting the data and the other one writing out the result.
ASKER
if no pdf for the month no need to show that row
all areas should show even if empty
all areas should show even if empty
A quite quick and dirty solution (still not creating the tables but showing you kind of the expected results)
In order to create the html tables it would be required to know whether each table should be in a separate file or be inserted into an existing html document, etc.
However as data extraction and output is separated you wouldn't have to rewrite the function gen_info()
#!/usr/bin/env python
from collections import defaultdict
from glob import iglob
from os import sep # path separator '/' for linux '\' for windows
# table for ordering by month
month_info = dict(
jan=1, feb=2, mar=3, apr=4, may=5, jun=6,
jul=7, aug=8, sep=9, oct=10, nov=11, dec=12)
def gen_info():
urls = defaultdict(lambda: defaultdict(dict))
areas = set()
for fname in iglob("*/*/*/chart.pdf"):
area, year, month, fname = fname.split(sep)
year = int(year)
print("%r %r %r %r" % (area, year, month, fname))
areas.add(area)
# add next line of absolute or other urls are needed
urls[year][month][area] = fname
return areas, urls
def show_info(info):
areas, urls = info
for year, year_urls in sorted(urls.items()):
print()
print("<h1>Year %d</h1>" % year)
print("mon", end=": ")
for area in sorted(areas):
print("%10s" % area, end=" ")
print()
print("-" * 76)
for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
print(month, end=": ")
for area in sorted(areas):
print("%10s" % ("X" if area in urls else "-"), end=" ")
print("")
info = gen_info()
show_info(info)
In order to create the html tables it would be required to know whether each table should be in a separate file or be inserted into an existing html document, etc.
However as data extraction and output is separated you wouldn't have to rewrite the function gen_info()
ASKER
as long as I get the table part of the code, I really dont care if I get it in different file or same file,
This is to be run from the starting Path right?
This is to be run from the starting Path right?
Yes the script is intended to be run from the starting path. if you wanted, then the starting path could be a parameter passed to the script.
Could you try the script and see if the output looks right to you.
I don't have a windows machine at hand so can't test.
If the output looks OK, then I can change the code to create an html table.
Could you try the script and see if the output looks right to you.
I don't have a windows machine at hand so can't test.
If the output looks OK, then I can change the code to create an html table.
ASKER
So far great
wasn't working since I had 2.7 (my omission), installed 3.7 changed some stuff and works fine
also had to change some month names in the dict
I am getting an unwanted space at the begining of AREA I changed the line to
print("<td>"+area, end="</td>") to remove it and it works
but I dont like that :P, and area.strip threw me some errrors
Also I found out the pdf filenames are inconsistent, so I would like to take them from the IGLOB instead of using it as a constant, tried using fname but not working.
wasn't working since I had 2.7 (my omission), installed 3.7 changed some stuff and works fine
also had to change some month names in the dict
#!/usr/bin/env python
from collections import defaultdict
from glob import iglob
from os import sep # path separator '/' for linux '\' for windows
# table for ordering by month
month_info = dict(
enero=1, febrero=2, marzo=3, abril=4, mayo=5, junio=6,
julio=7, agosto=8, septiembre=9, octubre=10, noviembre=11, diciembre=12)
def gen_info():
urls = defaultdict(lambda: defaultdict(dict))
areas = set()
for fname in iglob("*/*/*/*.pdf"):
area, year, month, fname = fname.split(sep)
year = int(year)
#print("%r %r %r %r" % (area, year, month, fname))
areas.add(area)
# add next line of absolute or other urls are needed
urls[year][month][area] = fname
return areas, urls
def show_info(info):
areas, urls = info
for year, year_urls in sorted(urls.items()):
print()
print("<h1>Year %d</h1>" % year)
print("<tr>", end="")
for area in sorted(areas):
#print("<td>","%10s" % area, end="</td>")
print("</tr>")
#print("-" * 76)
for month, urls in sorted(year_urls.items(), key=lambda mon_urls: month_info[mon_urls[0].lower()]):
print("<tr><td>",month, end=" </td>")
for area in sorted(areas):
# print("%s" % "<td>", ("X" if area in urls else "-"), end=" </td>")
print("%s" % "<td>", ("<a href=\"images/pidmxdocuments/MejoraContinua/Programas/5s/Metrico/"+area+"/"+str(year)+"/"+month+"/"+"Radar%20Chart.pdf\">X</a>" if area in urls else "-"), end="</td>")
print ("</tr>")
info = gen_info()
show_info(info)
I am getting an unwanted space at the begining of AREA I changed the line to
print("<td>"+area, end="</td>") to remove it and it works
but I dont like that :P, and area.strip threw me some errrors
Also I found out the pdf filenames are inconsistent, so I would like to take them from the IGLOB instead of using it as a constant, tried using fname but not working.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If the names of the pdf files are inconsistent, but you have only one pdf file per directory, then just use:
I don't understand the problem with 'area' can you run the new code with the new iglob and copy paste parts of the generated html to explain?
I just understood now the issue with the spaces and area. The reason is, that I used "%10s" as the formatting string. (My first example was supposed to look 'good' on the console, so I said to always use 10 character per area for better alginment)
Just replace it with "%s" and it should be fine
Concerning python3:
The code might work with python2 if you insert following line at the beginning of the file (after the #!/usr/bin/env python line)
If filenames or directory names or the generated HTML contain non ASCII characters, then other changes might be required
if you can use pip for installing the package future, then it is rather easy create code, that works with python2 and python3. As python2 will be obsoleted January 2020 it might be a good idea to write any new code such, that it will also run with python3.
for fname in iglob("*/*/*/*.pdf")
I don't understand the problem with 'area' can you run the new code with the new iglob and copy paste parts of the generated html to explain?
I just understood now the issue with the spaces and area. The reason is, that I used "%10s" as the formatting string. (My first example was supposed to look 'good' on the console, so I said to always use 10 character per area for better alginment)
Just replace it with "%s" and it should be fine
Concerning python3:
The code might work with python2 if you insert following line at the beginning of the file (after the #!/usr/bin/env python line)
from __future__ import print_function
If filenames or directory names or the generated HTML contain non ASCII characters, then other changes might be required
if you can use pip for installing the package future, then it is rather easy create code, that works with python2 and python3. As python2 will be obsoleted January 2020 it might be a good idea to write any new code such, that it will also run with python3.
ASKER
PERFECT, just made minor adjustments (begin of absolute path, month names etc)
and your code works perfectly.
and your code works perfectly.
For the fun of it, a PowerShell version; assumes there's at most one pdf file in the month directory.
$root = 'C:\Charts'
$outFile = 'C:\Temp\chart.html'
$years = '2019', '2018', '2017'
$htmlTitle = 'Report'
$htmlHead = @'
<style>
TABLE {border-width:1px; border-style:solid; border-color:black;}
TH {border-width:1px; padding:1px; border-style:solid; border-color:black;}
TD {border-width:1px; padding:1px; border-style:solid; border-color:black;}
</style>
'@
Add-Type -AssemblyName System.Web
$areas = Get-ChildItem -Path $root -Directory | Select-Object -ExpandProperty Name
$months = 'enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio', 'julio', 'agosto', 'septiembre', 'octubre', 'noviembre', 'diciembre'
$html = $years | ForEach-Object {
$year = $_
$months | ForEach-Object {
$month = $_
$out = '' | Select-Object -Property (@(@{n='Month'; e={"$($month) $($year)"}}) + $areas)
ForEach ($area In $areas) {
If ($pdf = Get-ChildItem -Path "$($root)\$($area)\$($year)\$($month)" -File -Filter *.pdf -ErrorAction SilentlyContinue) {
$out.$area = "<a href='file://$($pdf.FullName)'>$($pdf.Name)</a>"
}
}
$out
}
} | ConvertTo-Html -Head $htmlHead -Title $htmlTitle
[System.Web.HttpUtility]::HtmlDecode($html) | Set-Content -Path $outFile
Write-Host "Report written to $($outFile)"
ASKER
Thank you all,
sorry oBdA, just had awarded the points when your solution appeared will check it out non the less.
sorry oBdA, just had awarded the points when your solution appeared will check it out non the less.
An alternative If you want to create more complex html would be to use a templating engine like jinja (would require installation of an external python package)
But that's overkill for very basic html.
But that's overkill for very basic html.
ASKER
thanks for the suggestion, this is just a tedious monthly task that goes into a CMS , no need to kill flies with a canyon :D