Avatar of Andre P
Andre P
Flag for Canada asked on

Need Help Powershell scraping a log file and Adding named regex captured matches to an array .

Input
07/12/2001 Jane made 3 pies and stopped after 5 min
08/14/2002 Joe ate 9 apples and started in 20 seconds
   08/14/2002 Joe ate 9 apples and started in 20 seconds


output
Date.           Name   Action      Time.  units

07/12/2001 Jane.    Stopped.   5    min
08/14/2002. joe.      Started 20      seconds

I want to create an array which holds a row for each entry above . [Date Name ,Action,Time,units]
adding a row for each line.

I hope I am clear .
Do I use Select-String with my regex code whch matches the items ?
Or is this a job for get-content ?
I am new to powershell and just nailed the regex to get the named captures
Now I just dont know how to stick them into an array as they are matched ..
Regular ExpressionsPowershell

Avatar of undefined
Last Comment
Andre P

8/22/2022 - Mon
footech

As with most things PowerShell, there's more than one approach that will work.  Unless I'm trying to find out the line number of a match though, I generally won't use Select-String.  Below is what I'll usually use.
Get-Content file.txt | Where { $_ -match $pattern } | ForEach `
{
    New-Object PsObject -Property @{
                            Date = $Matches["date"]
                            Name = $Matches["name"]
                            Action = $Matches["action"]
                            Time = $Matches["time"]
                            units = $Matches["units"]
                            }
}

Open in new window


You said you have the regex pattern, so feel free to post it (set it as $pattern).  My example above assumes named captures - you can adjust accordingly.  It doesn't store things in an array, but creates an object with the appropriate properties.  You can store the entire output of objects as an array if you wish (just add $array = at the very beginning), or continue processing with the pipeline.
Andre P

ASKER
Thanks for the help .

Just so I understand .
If I have say 100 text files with date time action etc .
each of which have 2 lines that match the regex.
I will have 200 objects with the properties of date , time etc ?
will they all be in one file ?

What if I for example want to get the average of the times of all the objects for a particular action .
I could sort by object property ? Then perform the calculation on the time property ?
I was thinking array but maybe a was wrong .
What I essentially want is to create a file much like your average get-process output file .
so i can then get metrics by doing calculations on the various properties.
Qlemo

This is another way, somewhat more automated:
Get-Content file*.txt|
  ? { $_ -match '^(?<date>\d\d/\d\d/\d\d\d\d) (?<name>\b\D+?\b) .* (?<action>started|stopped) (?:after|in) (?<time>\d+) (?<units>.*)' } |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 }

Open in new window

It generates the object based on the match groups found (and also goes thru all files matching the file pattern).
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Qlemo

The resulting objects can be used in e.g. measure-object (to get sum, avg etc.) in conjunction with group-object to apply some grouping. However, using different units makes it difficult to properly calculate - you should normalize units pre doing calculations. For example, you would append this to line 6:
|
   group-object Name, action, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum time | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

That is, of course, a very simplified report.
Andre P

ASKER
Here is an actual example of what I am trying to accomplish
Text string


 $x =
"08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs"


Here is my regex
$regex = "(?<date>^(0[1-9]|1[012])[\/](0[1-9]|[12][0-9]|3[01])[ \/](19|20)\d\d)d{2}:\d{2}\sAM|PM\s-\sDRIVE\s(?<drive>\w)+:(?<name>\w+\s\w+)\sIndex\s\sFinished\s\w+\s(?<duration>\d+)\s(?<unit>secs|mins|hours)"


Question 1 :
Why does this not match ?

$x| where { $_ -match $regex }

| ForEach
`
{ New-Object PsObject -Property @{
      Date = $Matches["date"]
      Drive = $Matches[“drive"]
      Process = $Matches[“name"]
      Duration = $Matches[“duration"]
      units = $Matches["unit"]
}
}
I should have:

 date,
drive,
name,
duration,
unit

 named captures no ?

Then I want to end up with a collection with each matching line being an object with the captures  as properties.

I am trying to learn but I am stuck at this point ,
footech

I will have 200 objects with the properties of date , time etc ?
Yes.

will they all be in one file ?
That depends on additional code.  What I posted above only submits the objects to the pipeline, and by default will be sent to your screen.  If you wanted it sent to a file, then you add a command that will write to a file.  Typically you'd use Export-CSV, so that later you could read the file back in and do additional processing.  And whether you create one file or two hundred is also up to you - it all depends on what you need.

Say you've read in a .CSV using Import-CSV.  You know that you've just created an array with the Import-CSV command, right (assuming you saved it to a variable and there was more then one item in the file)?  If you have no need to keep the data around long-term (i.e. longer than you have your PS session open), then you can skip writing out to a file, and the subsequent import, and just save the output to a variable.

An array is an object in memory.  It seems like you may be confusing that with a file that is written to disk.

Generally you'll do a sort utilizing the pipeline ( stuff | Sort ), and yes, what I posted above easily works with that - just put a pipe after the ForEach block.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
footech

Named captures, yes, but the regex doesn't match.
I'll see what I can come up with unless Qlemo (or someone else) does first.
Qlemo

Andre, honestly! Your new example is too far from the original. You should not do that. Our answers can only be as good as your input, and so you should choose to provide info as close as possible to what you really have (if you need to disguise info), and do that in the first place.

"Why does this not match ?" - because the RegEx is wrong, obviously. Your RegEx is very specific, and everything slightly being different from the pattern will stop the match completely.
footech

I was just looking at the last sample text, and I have doubts as to whether that's a true representation of any output.  It looks like you inserted line breaks, but not consistently, such that the date is sometimes at the beginning, end, or middle of a line.  And as Qlemo stated, it doesn't match your original question.
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Qlemo

Also something you need to consider when using a string for test instead of a file: you need to split the mulitline string into multiple strings, because that is what Get-Content does. Each line is a new string object. To make a correct test case, you need to use something like
$x = @"
...
"@ -split "`n"

Open in new window

You forgot the match against the space between date and time portion.

As I see it, you want to capture only those lines looking like:
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

Open in new window

I'm not clear whether you want to have "Name" being "Create", "Profile" or both. As-is, you capture "Create Profile" (or, more precise, anything between the drive and "Index").

After "Index", you left out the dash. No match here.

The RegEx working for me is
$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

Open in new window

Using spaces instead of \s makes it more restrictive and fragile, but much better to read and match manually.
Andre P

ASKER
Sorry about that .

As I said I was trying to learn and didnt want someone doing it for me .
I am interested in "why " something worked or didnt work ..
I tried my regex on a regtester . it worked then .  although it was for javascript .
I couldnt find a tester that would allow me to use the (? <name>) Maybe when i added that it broke my match .
I
If there is one that will test powershell regex please let me know .
 I didnt know you could just leave a space instead of /s.
I did want to capture both "Create Profile"

The text I posted is the contents of a file .
I would be picking it up through Get-Content.
Qlemo

Providing your own RegEx for us to look at it is fine. Comparing yours and mine should help in learning.
And yes, changing something, even slighlty, can break the expression ;-).
http://regexhero.net/tester/   is able to use named groups (just tested it this moment).
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Qlemo

To make sure we know how to put things together, I'll provide the "complete" test code, with some added info in the stats part:
# Only for test:
$x =  @"
08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs
"@ -split "`n"

$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

$x |                # instead of   Get-Content file*.txt |
  ? { $_ -match $regex }  |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 } |
   group-object Drive, Name, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum duration | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

Qlemo

I just detected the link I provided is not for free if using more than 5 minutes >-/
Andre P

ASKER
Thanks for all your help ! ( And patience !)


What is the significance of the following items :

$_.  in

$_.Name

the  %  in
% {
    $matches.Remove(0)    <---- Why is this done ??

And
    New-Object PsObject -Property $matches  <-----Where is this stored ?if i wanted to inspect its result at this stage what would i do ?
 }
Your help has saved me hundreds of hours of internet surfing.
fblack61
ASKER CERTIFIED SOLUTION
Qlemo

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Andre P

ASKER
Wow ! Thank you for the wisdom !! You went above and beyond  - much gratitude !