Need Help Powershell scraping a log file and Adding named regex captured matches to an array .

Input
07/12/2001 Jane made 3 pies and stopped after 5 min
08/14/2002 Joe ate 9 apples and started in 20 seconds
   08/14/2002 Joe ate 9 apples and started in 20 seconds


output
Date.           Name   Action      Time.  units

07/12/2001 Jane.    Stopped.   5    min
08/14/2002. joe.      Started 20      seconds

I want to create an array which holds a row for each entry above . [Date Name ,Action,Time,units]
adding a row for each line.

I hope I am clear .
Do I use Select-String with my regex code whch matches the items ?
Or is this a job for get-content ?
I am new to powershell and just nailed the regex to get the named captures
Now I just dont know how to stick them into an array as they are matched ..
Andre PAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

footechCommented:
As with most things PowerShell, there's more than one approach that will work.  Unless I'm trying to find out the line number of a match though, I generally won't use Select-String.  Below is what I'll usually use.
Get-Content file.txt | Where { $_ -match $pattern } | ForEach `
{
    New-Object PsObject -Property @{
                            Date = $Matches["date"]
                            Name = $Matches["name"]
                            Action = $Matches["action"]
                            Time = $Matches["time"]
                            units = $Matches["units"]
                            }
}

Open in new window


You said you have the regex pattern, so feel free to post it (set it as $pattern).  My example above assumes named captures - you can adjust accordingly.  It doesn't store things in an array, but creates an object with the appropriate properties.  You can store the entire output of objects as an array if you wish (just add $array = at the very beginning), or continue processing with the pipeline.
Andre PAuthor Commented:
Thanks for the help .

Just so I understand .
If I have say 100 text files with date time action etc .
each of which have 2 lines that match the regex.
I will have 200 objects with the properties of date , time etc ?
will they all be in one file ?

What if I for example want to get the average of the times of all the objects for a particular action .
I could sort by object property ? Then perform the calculation on the time property ?
I was thinking array but maybe a was wrong .
What I essentially want is to create a file much like your average get-process output file .
so i can then get metrics by doing calculations on the various properties.
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
This is another way, somewhat more automated:
Get-Content file*.txt|
  ? { $_ -match '^(?<date>\d\d/\d\d/\d\d\d\d) (?<name>\b\D+?\b) .* (?<action>started|stopped) (?:after|in) (?<time>\d+) (?<units>.*)' } |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 }

Open in new window

It generates the object based on the match groups found (and also goes thru all files matching the file pattern).
CompTIA Security+

Learn the essential functions of CompTIA Security+, which establishes the core knowledge required of any cybersecurity role and leads professionals into intermediate-level cybersecurity jobs.

Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
The resulting objects can be used in e.g. measure-object (to get sum, avg etc.) in conjunction with group-object to apply some grouping. However, using different units makes it difficult to properly calculate - you should normalize units pre doing calculations. For example, you would append this to line 6:
|
   group-object Name, action, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum time | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

That is, of course, a very simplified report.
Andre PAuthor Commented:
Here is an actual example of what I am trying to accomplish
Text string


 $x =
"08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs"


Here is my regex
$regex = "(?<date>^(0[1-9]|1[012])[\/](0[1-9]|[12][0-9]|3[01])[ \/](19|20)\d\d)d{2}:\d{2}\sAM|PM\s-\sDRIVE\s(?<drive>\w)+:(?<name>\w+\s\w+)\sIndex\s\sFinished\s\w+\s(?<duration>\d+)\s(?<unit>secs|mins|hours)"


Question 1 :
Why does this not match ?

$x| where { $_ -match $regex }

| ForEach
`
{ New-Object PsObject -Property @{
      Date = $Matches["date"]
      Drive = $Matches[“drive"]
      Process = $Matches[“name"]
      Duration = $Matches[“duration"]
      units = $Matches["unit"]
}
}
I should have:

 date,
drive,
name,
duration,
unit

 named captures no ?

Then I want to end up with a collection with each matching line being an object with the captures  as properties.

I am trying to learn but I am stuck at this point ,
footechCommented:
I will have 200 objects with the properties of date , time etc ?
Yes.

will they all be in one file ?
That depends on additional code.  What I posted above only submits the objects to the pipeline, and by default will be sent to your screen.  If you wanted it sent to a file, then you add a command that will write to a file.  Typically you'd use Export-CSV, so that later you could read the file back in and do additional processing.  And whether you create one file or two hundred is also up to you - it all depends on what you need.

Say you've read in a .CSV using Import-CSV.  You know that you've just created an array with the Import-CSV command, right (assuming you saved it to a variable and there was more then one item in the file)?  If you have no need to keep the data around long-term (i.e. longer than you have your PS session open), then you can skip writing out to a file, and the subsequent import, and just save the output to a variable.

An array is an object in memory.  It seems like you may be confusing that with a file that is written to disk.

Generally you'll do a sort utilizing the pipeline ( stuff | Sort ), and yes, what I posted above easily works with that - just put a pipe after the ForEach block.
footechCommented:
Named captures, yes, but the regex doesn't match.
I'll see what I can come up with unless Qlemo (or someone else) does first.
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Andre, honestly! Your new example is too far from the original. You should not do that. Our answers can only be as good as your input, and so you should choose to provide info as close as possible to what you really have (if you need to disguise info), and do that in the first place.

"Why does this not match ?" - because the RegEx is wrong, obviously. Your RegEx is very specific, and everything slightly being different from the pattern will stop the match completely.
footechCommented:
I was just looking at the last sample text, and I have doubts as to whether that's a true representation of any output.  It looks like you inserted line breaks, but not consistently, such that the date is sometimes at the beginning, end, or middle of a line.  And as Qlemo stated, it doesn't match your original question.
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Also something you need to consider when using a string for test instead of a file: you need to split the mulitline string into multiple strings, because that is what Get-Content does. Each line is a new string object. To make a correct test case, you need to use something like
$x = @"
...
"@ -split "`n"

Open in new window

You forgot the match against the space between date and time portion.

As I see it, you want to capture only those lines looking like:
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

Open in new window

I'm not clear whether you want to have "Name" being "Create", "Profile" or both. As-is, you capture "Create Profile" (or, more precise, anything between the drive and "Index").

After "Index", you left out the dash. No match here.

The RegEx working for me is
$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

Open in new window

Using spaces instead of \s makes it more restrictive and fragile, but much better to read and match manually.
Andre PAuthor Commented:
Sorry about that .

As I said I was trying to learn and didnt want someone doing it for me .
I am interested in "why " something worked or didnt work ..
I tried my regex on a regtester . it worked then .  although it was for javascript .
I couldnt find a tester that would allow me to use the (? <name>) Maybe when i added that it broke my match .
I
If there is one that will test powershell regex please let me know .
 I didnt know you could just leave a space instead of /s.
I did want to capture both "Create Profile"

The text I posted is the contents of a file .
I would be picking it up through Get-Content.
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Providing your own RegEx for us to look at it is fine. Comparing yours and mine should help in learning.
And yes, changing something, even slighlty, can break the expression ;-).
http://regexhero.net/tester/   is able to use named groups (just tested it this moment).
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
To make sure we know how to put things together, I'll provide the "complete" test code, with some added info in the stats part:
# Only for test:
$x =  @"
08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs
"@ -split "`n"

$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

$x |                # instead of   Get-Content file*.txt |
  ? { $_ -match $regex }  |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 } |
   group-object Drive, Name, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum duration | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
I just detected the link I provided is not for free if using more than 5 minutes >-/
Andre PAuthor Commented:
Thanks for all your help ! ( And patience !)


What is the significance of the following items :

$_.  in

$_.Name

the  %  in
% {
    $matches.Remove(0)    <---- Why is this done ??

And
    New-Object PsObject -Property $matches  <-----Where is this stored ?if i wanted to inspect its result at this stage what would i do ?
 }
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
$_ is "the current object" in a where-object, foreach-object aso. % is short for foreach-object, ? for where-object.

Here it is the result of looping thru the results of group-object. Each object contains of Count, Name, and Group. Name contains the grouping propertiy names as string (e.g "W, Create Profile, secs"). And that is what we want to keep to identify the record.

$matches.Remove(0) is used to get rid of the always present "0" match containing the complete match string. including all groups etc., as far as the regex pattern applies as a whole. Removing it allows to use it as property hash table for New-Object, and that again making the code more flexible - just add a regex group to get an additional property ...

Lastly, since we do not store the New-Object anywhwere, it is passed to the pipeline, for getting processed by the next command. If there is no command following, it is thrown to the console. So if you want to inspect the results, remove the complete pipeline following the command to inspect. Or you can add
... | tee -var results | ...

Open in new window

to get the results collected into $results in addition to being processed as coded.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Andre PAuthor Commented:
Wow ! Thank you for the wisdom !! You went above and beyond  - much gratitude !
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.