Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Need a Powershell Parse of some HTML code

Posted on 2014-03-04
25
Medium Priority
?
2,745 Views
Last Modified: 2014-03-10
I have an HTML similar type file. The file gets generated from an app so its not exactly HTML, but I need all the values exported into an csv file.

The file has a Header section and detail section. I don't mind having all the data into 1 table if  makes it easier.

Below I have included sample file to parse and small Powershell script the only thing I can do with Powershell i an still too green. Tried using the HTMLAgilty pack and was unsuccessful.

Here are the column details:
Header Info
codebase, instance_id,guid, id, name, version, registrar, risk_index, force_start, locale_id, template_version, status, score, date, source_object_type,source_object_id, calling_hra_id, external_page_id, in_external_page, from_other_hra, doctorsession_id, obxid, case_id,
loggedin_userid, patient_id, patientname, allow_void,

--Detail Rows
SKIPPED_HISTORY,PAGE id, flush_concepts, update_date, valid,QUESTION id, Score, name, value
Sample2.html
GetHTMLData.txt
0
Comment
Question by:Leo Torres
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 7
  • 5
  • +1
25 Comments
 
LVL 4

Expert Comment

by:Jason Ryberg
ID: 39904170
PowerShell has a builtin XML object type that will parse the XML content for you:

[xml]$xml = Get-Content C:\users\Jason\Desktop\test.xml

Open in new window


Once the XML object is loaded, you can traverse the object similarly to other PS objects!
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39904409
once its xml Content any good tutorials on how to extract data from this variable.

And I also see you are passing an xml file.  Could I just rename the file below as xml? Does not seem to have proper tags?
0
 
LVL 4

Expert Comment

by:Jason Ryberg
ID: 39904459
About halfway down this post is a good overview of how navigating XML objects works:

https://blogs.technet.com/b/heyscriptingguy/archive/2013/04/01/working-with-xml.aspx
0
Creating Active Directory Users from a Text File

If your organization has a need to mass-create AD user accounts, watch this video to see how its done without the need for scripting or other unnecessary complexities.

 
LVL 46

Expert Comment

by:aikimark
ID: 39905435
@ltorres321

What would you like your CSV file to look like, given this input?  I concur with the assertion that you are working with XML data.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 39905451
Another way to get at (parse) the question data would be to apply a regular expression pattern to the contents of the XML file.  The following pattern worked for the question data in the file you posted.
<QUESTION id="(\d+)" Score="(\d+)" name="(.*?)" (?:value="){0,1}(.*?)(?:" ){0,1}/>

Open in new window

0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39906980
aikimark your line of code where do I incorporate it into my code?
0
 
LVL 46

Expert Comment

by:aikimark
ID: 39907173
where do I incorporate it into my code?
Please answer my prior question about what your output needs to look like?
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39907719
The out needs to in a table Format so I can import into a SQL table
0
 
LVL 4

Expert Comment

by:Jason Ryberg
ID: 39907729
Using my method, it would be fairly simple:

[xml]$xml = Get-Content C:\tmp\test.xml
$xml.RESULTSCACHE | export-csv test.csv

Open in new window

0
 
LVL 46

Expert Comment

by:aikimark
ID: 39908186
@Jason

I don't think that CSV file is what ltorres321 had in mind.

You posted an elegant PS script, however.  It is a bit too simple for the detail I expect is needed.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 39908190
The out needs to in a table Format so I can import into a SQL table

Please post a couple of lines of CSV data from that XML file.  Do this editing by hand.
0
 
LVL 4

Expert Comment

by:Jason Ryberg
ID: 39908229
I'll be interested to see what kind of table ltorres321 would prefer to use to import into SQL if it's not CSV... seems pretty obvious, but I could be wrong!
0
 
LVL 46

Expert Comment

by:aikimark
ID: 39908258
Yes.  If you have the data in memory, why persist it to CSV?  Just pump it directly into SQL Server.

However, if the import process is SSIS and the data gets validated, merged or split, I can see a reason why the existing SSIS code might be retained.
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39909479
There is Header data and detail data in this file.

Jason Ryberg : as per your code that actually returns the Header data
which is between  <RESULTSCACHE  and ends at the <SKIPPED_HISTORY value=""/>

There is more data after that that is the Line Detail for this header.

As for CSV it was just to make it easier to get into the database. I was going to use SSIS to move move that data into tables but if Powershell can do it then so be it as well.
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39909481
Jason your code is nice and very clean but how about the other fields?
0
 
LVL 4

Assisted Solution

by:Jason Ryberg
Jason Ryberg earned 100 total points
ID: 39909632
I don't think we're very clear on how you're going to need this data formatted.  As aikimark has suggested, do you think you could give a brief example of what the data will look like after it's been formatted?

I see that there are several "Question" data nodes, which can be tabulated by:

$xml.RESULTSCACHE.ChildNodes.ChildNodes | Export-CSV childnodes.csv

Open in new window


But, you're still losing some of your data that way.
0
 
LVL 71

Expert Comment

by:Qlemo
ID: 39909643
You'll need to separate RESULTSCACHE data and QUESTION data, so I would use this approach:
[xml] $xml = Get-Content C:\tmp\test.xml
$xml.ResultsCache | export-csv -NoType C:\tmp\Results.csv
$xml.ResultsCache.Page | % { $_.Question } | export-csv -NoType C:\Tmp\Questions.csv

Open in new window

0
 
LVL 71

Expert Comment

by:Qlemo
ID: 39909656
Note that Jason's approach only works starting with PS 3. The equivalent code by naming the properties (as in my code snippet) is
$xml.ResultsCache.Page.Question | export-csv -NoType c:\temp\Questions.csv

Open in new window

0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39909688
The data just needs to go into a table table. That's just rows and Columns.

Qlemo: that's impressive. Now how do I include the page number as well in the data.

And is there an ID I can create on the fly so when I put this into a table I can tie the header to the details?  Hope that makes sense.

Thanks for the help. I thought this was going to be couple hundred lines of code you did most if it in 3!
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39909694
FYI PS 3 is fine I am using that.

As I also Compare data to what you are producinf Qlemo;

I see this strings of data is missing

Page ID
Flush_Concepts
Update_Date
valid


<PAGE id="1" flush_concepts="0" update_date="7/25/2012 8:43:58 AM" valid="1">
0
 
LVL 71

Accepted Solution

by:
Qlemo earned 1600 total points
ID: 39909871
Correct. We are just dumping the QUESTION part, dismissing any info of PAGE, so you won't find that. I didn't consider the page info important by any means, and probably they aren't. But if you insist in keeping them:
[xml]$xml = Get-Content C:\tmp\Sample2.html
$xml.ResultsCache | export-csv -NoType C:\tmp\Results.csv
$xml.ResultsCache.Page | % { $page = $_; $_.Question | select `
  @{n='PageID'        ; e = {$page.ID}},
  @{n='flush_concepts'; e = {$page.flush_concepts}},
  @{n='valid'         ; e = {$page.valid}},
  @{n='update_date'   ; e = {$page.update_date}},
  ID, Score, Value
} | export-csv -NoType C:\tmp\Questions.csv

Open in new window

The ResultsCache ID stuff depends on how it should be built. If you we can use instance_id, all we have to do is to include that in each question row, similar to the page data above.
0
 
LVL 46

Assisted Solution

by:aikimark
aikimark earned 300 total points
ID: 39910101
Assuming that ltorres321 wants the skipped_history data, a slight tweak of Qlemo's script should produce the desired results in any version of PS.
[xml] $xml = Get-Content '\\pcna-02\mydocs\ltorres\My Documents\_CCA\Sample2.html'
$SH = $XML.resultscache.skipped_history
$xml.ResultsCache.Page | % { $page = $_; $_.Question | select `
  @{n='SkippedHistory'        ; e = {$SH.value}},
  @{n='PageID'        ; e = {$page.ID}},
  @{n='flush_concepts'; e = {$page.flush_concepts}},
  @{n='valid'         ; e = {$page.valid}},
  @{n='update_date'   ; e = {$page.update_date}},
  ID, Score, Value
} | export-csv -NoType '\\pcna-02\mydocs\ltorres\My Documents\_CCA\Sample2.csv'

Open in new window

I had worked up a nested foreach solution, but the Qlemo code is simpler and probably performs better.
0
 
LVL 8

Author Closing Comment

by:Leo Torres
ID: 39910500
Wow, Excellent I will close this here for now. I will try to get this data into SQL tables. If I cant I will post another question to be fair. Thanks to all for the exceptional HELP! I need every bit of it.
0
 
LVL 8

Author Comment

by:Leo Torres
ID: 39917242
0

Featured Post

Ask an Anonymous Question!

Don't feel intimidated by what you don't know. Ask your question anonymously. It's easy! Learn more and upgrade.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A recent project that involved parsing Tableau Desktop and Server log files to extract reusable user queries for use in other systems. I chose to use PowerShell to gather the data, and SharePoint to present it...
In the absence of a fully-fledged GPO Management product like AGPM, the script in this article will provide you with a simple way to watch the domain (or a select OU) for GPOs changes and automatically take backups when policies are added, removed o…
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…
In a question here at Experts Exchange (https://www.experts-exchange.com/questions/29062564/Adobe-acrobat-reader-DC.html), a member asked how to create a signature in Adobe Acrobat Reader DC (the free Reader product, not the paid, full Acrobat produ…

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question