Solved

Scraping specific data within an XML document

Posted on 2016-10-27
19
68 Views
Last Modified: 2016-10-28
Working on scraping specific governing data from xml. Currently got it working somewhat exporting certain data tags to a csv file. However, it errors out stating:.

The input line is too long.
The syntax of the command is incorrect.

Open in new window


Here is the script:
@echo off
setlocal enabledelayedexpansion
set inFile="C:\temp\CFR-2016-title40-vol17.xml"
set outFile="C:\temp\output.csv"
set req_tags=SECTNO SUBJECT P E
set outLine=
echo SECTNO,SUBJECT,P,E > %outFile% 
for %%a in (%inFile%) do (
  for %%c in (%req_tags%) do (
    set search_tag=%%c
    for /f "tokens=2 delims=><  " %%b in ('type "%%a" ^|findstr /i !search_tag!' ) do (
      if [%%b] NEQ [] (
        rem we don't want to match /BSN
        if [%%b] NEQ [/BSN] (
          set outline=!outline!%%b,
        )
      )
    )
  )
)
rem output the values
rem remove trailing ,
set outline=%outline:~0,-1%
echo %outline%>>"%outFile%"
endlocal

Open in new window


Attached is the XML file I am trying to grab data from.
CFR-2016-title40-vol17.xml
0
Comment
Question by:ntr2def
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 9
19 Comments
 
LVL 1

Author Comment

by:ntr2def
ID: 41863033
So I Attempted to do this in PowerShell:

PS C:\temp> [xml]$xmlDocument = get-content c:\temp\CFR-2016-title40-vol17.xml
PS C:\temp> $XmlDocument.cfrdoc.title.chapter.subchap.part.section

Open in new window


The output became:
SECTNO     SUBJECT                                              P
------     -------                                              -
§ 64.1  Definitions.                                         {The following definitions apply to this part. Excep...
§ 64.2  Applicability.                                       {P, (1) The unit is subject to an emission limitatio...
§ 64.3  Monitoring design criteria.                          {P, (1) The owner or operator shall design the monit...
§ 64.4  Submittal requirements.                              {(a) The owner or operator shall submit to the permi...
§ 64.5  Deadlines for submittals.                            {P, (1) On or after April 20, 1998, the owner or ope...
§ 64.6  Approval of monitoring.                              {(a) Based on an application that includes the infor...
§ 64.7  Operation of approved monitoring.                    {P, P, P, P...}
§ 64.8  Quality improvement plan (QIP) requirements.         {(a) Based on the results of a determination made un...
§ 64.9  Reporting and recordkeeping requirements.            {P, (2) A report for monitoring under this part shal...
§ 64.10 Savings provisions.                                  {(a) Nothing in this part shall:, P, (2) Restrict or...
§ 70.1  Program overview.                                    {P, (b) All sources subject to these regulations sha...
§ 70.2  Definitions.                                         {The following definitions apply to part 70. Except ...
§ 70.3  Applicability.                                       {P, (1) Any major source;, (2) Any source, including...
§ 70.4  State program submittals and transition.             {P, P, (1) A complete program description describing...
§ 70.5  Permit applications.                                 {P, P, P, (iii) For purposes of permit renewal, a ti...
§ 70.6  Permit content.                                      {P, (1) Emissions limitations and standards, includi...
§ 70.7  Permit issuance, renewal, reopenings, and revisions. {P, (i) The permitting authority has received a comp...
§ 70.8  Permit review by EPA and affected States.            {P, (2) The Administrator may waive the requirements...
§ 70.9  Fee determination and certification.                 {P, P, (i) Preparing generally applicable regulation...
§ 70.10 Federal oversight and sanctions.                     {P, (i) At any time the Administrator may apply any ...
§ 70.11 Requirements for enforcement authority.              {All programs to be approved under this part must co...

Open in new window


However, when I try export that data to CSV i now get an error:

Export-Csv : Cannot bind argument to parameter 'InputObject' because it is null.
At line:1 char:58
+ ... subchap.part.section | export-csv .\testoutput.csv -notypeinformation
+                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Export-Csv], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Microsoft.PowerShell.Commands.ExportCsvCo
   mmand

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 41863670
Please double check your Powershell script.  I don't see the same results when I run it against the xml file you posted.
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41863952
How are you exporting the information? I use:

Open in new window

0
How Do You Stack Up Against Your Peers?

With today’s modern enterprise so dependent on digital infrastructures, the impact of major incidents has increased dramatically. Grab the report now to gain insight into how your organization ranks against your peers and learn best-in-class strategies to resolve incidents.

 
LVL 1

Author Comment

by:ntr2def
ID: 41863955
$XmlDocument.cfrdoc.title.chapter.subchap.part.section | export-csv .\testoutput.csv -notypeinformation

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 41863973
When I use the $XmlDocument.cfrdoc.title.chapter.subchap.part.section variable, it doesn't display the thing that you showed in your earlier comment.
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41863979
what do you get? and are you exporting to csv?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864057
I'm not there yet.  What do you need the CSV to look like?  There are lots of paragraphs in a section, Do you want one CSV line per paragraph?  Do you want to concatenate the paragraphs into a single column/field in the CSV?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864059
CSV can not easily represent hierarchical or encapsulated data
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41864066
It would be nice to concatenate the paragraph info into on cell but it sounds like I'll be doing some macro work on top of the powershell script?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864082
If you concatenate the paragraphs, how should they be delimited?  Since CSV lines end with a CRLF, you will be challenged to concatenate the paragraphs into a single block of text without sacrificing readability or breaking the CSV
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41864089
Well now that you put it that way I think one paragraph per cell would be best and the manipulation of the data could be done after of the fact. It's being able to export that data as cleanly as possible which I am have trouble with.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864265
How are your PS skills?
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41864292
I would like to think I'm rather decent, I have just not worked a ton with XML files.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 41864392
Please test this:
function get-sectiondata($parmSection){
    $o=@()
    $parmsection | %{
        $sectno = $_.sectno
        $subject = $_.subject
        $_.P | %{
            try{
            if ($_.gettype().name -eq "String"){
                #($sectno, $subject, $_) | ft -autosize
                $newobj= new-object pscustomobject -property @{'sectno'=$sectno;'subject'=$subject;'P'=$_}
                $o += $newobj
                }
                }
            catch{}
            }
        }

return $o
}

[xml]$xmlDocument = get-content c:\temp\CFR-2016-title40-vol17.xml
$outputArray=@()
$sectno=""
$subject=""
$XmlDocument.cfrdoc.title.chapter.subchap.part |
    %{
      if (($_ | gm -MemberType property -name subpart) -ne $null){
          $_.subpart | %{
                $outputArray+=get-sectiondata($_.section)
                }
          }
        
      if (($_ | gm -MemberType property -name section) -ne $null){
          $_ | %{
                $outputArray+=get-sectiondata($_.section)
                }
          }
      }
    
$outputArray | Export-Csv -NoTypeInformation -Path .\testoutput.csv

Open in new window

1
 
LVL 1

Author Closing Comment

by:ntr2def
ID: 41864444
I have to say the output is different then what I envisioned but it was done very well. I rather like the output much better, its easy to read and follow.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864453
I skipped the E elements.  You will need to tweak the PS script if you need those.
0
 
LVL 1

Author Comment

by:ntr2def
ID: 41864465
no worries. You gave me a path that I can follow and tweak to my liking. I again thanks for your help on this.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864480
I just spotted and corrected a line.  I was updating $o and have changed it to $outputArray in the posted code
0
 
LVL 45

Expert Comment

by:aikimark
ID: 41864518
I encountered some Null values during my testing and added another check for null to make it more resilient.
function get-sectiondata($parmSection){

    $o=@()
    $parmsection | %{
        $sectno = $_.sectno
        $subject = $_.subject
        $_.P | %{
            try{
                if ($_.gettype().name -eq "String"){
                    #($sectno, $subject, $_) | ft -autosize
                    $newobj= new-object pscustomobject -property @{'sectno'=$sectno;'subject'=$subject;'P'=$_}
                    $o += $newobj
                    }
                }
            catch{}
            }
        }

return $o
}

[xml]$xmlDocument = get-content c:\test\CFR-2016-title40-vol17.xml
$outputArray=@()
$sectno=""
$subject=""
$XmlDocument.cfrdoc.title.chapter.subchap.part |
    %{
      if (($_ | gm -MemberType property -name subpart) -ne $null){
        #break
          $_.subpart | %{
                $r=get-sectiondata($_.section)
                if ($r -ne $null){
                    $outputArray+=$r   #get-sectiondata($_.section)
                }
                }
          }
        
      if (($_ | gm -MemberType property -name section) -ne $null){
          $_ | %{
                $r=get-sectiondata($_.section)
                if ($r -ne $null){
                    $outputArray+=$r   #get-sectiondata($_.section)
                }                
                }
          }
      }
    
$outputArray | Export-Csv -NoTypeInformation -Path .\CFR-2016-title40-vol17.csv

Open in new window

1

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article was inspired by a question here at Experts Exchange (http://www.experts-exchange.com/Software/Photos_Graphics/Images_and_Photos/Q_28629170.html). The requirements stated in that question are (1) reduce the file size of a large number of…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question