ntr2def
asked on
Scraping specific data within an XML document
Working on scraping specific governing data from xml. Currently got it working somewhat exporting certain data tags to a csv file. However, it errors out stating:.
Here is the script:
Attached is the XML file I am trying to grab data from.
CFR-2016-title40-vol17.xml
The input line is too long.
The syntax of the command is incorrect.
Here is the script:
@echo off
setlocal enabledelayedexpansion
set inFile="C:\temp\CFR-2016-title40-vol17.xml"
set outFile="C:\temp\output.csv"
set req_tags=SECTNO SUBJECT P E
set outLine=
echo SECTNO,SUBJECT,P,E > %outFile%
for %%a in (%inFile%) do (
for %%c in (%req_tags%) do (
set search_tag=%%c
for /f "tokens=2 delims=>< " %%b in ('type "%%a" ^|findstr /i !search_tag!' ) do (
if [%%b] NEQ [] (
rem we don't want to match /BSN
if [%%b] NEQ [/BSN] (
set outline=!outline!%%b,
)
)
)
)
)
rem output the values
rem remove trailing ,
set outline=%outline:~0,-1%
echo %outline%>>"%outFile%"
endlocal
Attached is the XML file I am trying to grab data from.
CFR-2016-title40-vol17.xml
Please double check your Powershell script. I don't see the same results when I run it against the xml file you posted.
ASKER
How are you exporting the information? I use:
ASKER
$XmlDocument.cfrdoc.title.chapter.subchap.part.section | export-csv .\testoutput.csv -notypeinformation
When I use the $XmlDocument.cfrdoc.title. chapter.su bchap.part .section variable, it doesn't display the thing that you showed in your earlier comment.
ASKER
what do you get? and are you exporting to csv?
I'm not there yet. What do you need the CSV to look like? There are lots of paragraphs in a section, Do you want one CSV line per paragraph? Do you want to concatenate the paragraphs into a single column/field in the CSV?
CSV can not easily represent hierarchical or encapsulated data
ASKER
It would be nice to concatenate the paragraph info into on cell but it sounds like I'll be doing some macro work on top of the powershell script?
If you concatenate the paragraphs, how should they be delimited? Since CSV lines end with a CRLF, you will be challenged to concatenate the paragraphs into a single block of text without sacrificing readability or breaking the CSV
ASKER
Well now that you put it that way I think one paragraph per cell would be best and the manipulation of the data could be done after of the fact. It's being able to export that data as cleanly as possible which I am have trouble with.
How are your PS skills?
ASKER
I would like to think I'm rather decent, I have just not worked a ton with XML files.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I have to say the output is different then what I envisioned but it was done very well. I rather like the output much better, its easy to read and follow.
I skipped the E elements. You will need to tweak the PS script if you need those.
ASKER
no worries. You gave me a path that I can follow and tweak to my liking. I again thanks for your help on this.
I just spotted and corrected a line. I was updating $o and have changed it to $outputArray in the posted code
I encountered some Null values during my testing and added another check for null to make it more resilient.
function get-sectiondata($parmSection){
$o=@()
$parmsection | %{
$sectno = $_.sectno
$subject = $_.subject
$_.P | %{
try{
if ($_.gettype().name -eq "String"){
#($sectno, $subject, $_) | ft -autosize
$newobj= new-object pscustomobject -property @{'sectno'=$sectno;'subject'=$subject;'P'=$_}
$o += $newobj
}
}
catch{}
}
}
return $o
}
[xml]$xmlDocument = get-content c:\test\CFR-2016-title40-vol17.xml
$outputArray=@()
$sectno=""
$subject=""
$XmlDocument.cfrdoc.title.chapter.subchap.part |
%{
if (($_ | gm -MemberType property -name subpart) -ne $null){
#break
$_.subpart | %{
$r=get-sectiondata($_.section)
if ($r -ne $null){
$outputArray+=$r #get-sectiondata($_.section)
}
}
}
if (($_ | gm -MemberType property -name section) -ne $null){
$_ | %{
$r=get-sectiondata($_.section)
if ($r -ne $null){
$outputArray+=$r #get-sectiondata($_.section)
}
}
}
}
$outputArray | Export-Csv -NoTypeInformation -Path .\CFR-2016-title40-vol17.csv
ASKER
Open in new window
The output became:
Open in new window
However, when I try export that data to CSV i now get an error:
Open in new window