Link to home
Start Free TrialLog in
Avatar of Andre P
Andre PFlag for Canada

asked on

Need Help Powershell scraping a log file and Adding named regex captured matches to an array .

Input
07/12/2001 Jane made 3 pies and stopped after 5 min
08/14/2002 Joe ate 9 apples and started in 20 seconds
   08/14/2002 Joe ate 9 apples and started in 20 seconds


output
Date.           Name   Action      Time.  units

07/12/2001 Jane.    Stopped.   5    min
08/14/2002. joe.      Started 20      seconds

I want to create an array which holds a row for each entry above . [Date Name ,Action,Time,units]
adding a row for each line.

I hope I am clear .
Do I use Select-String with my regex code whch matches the items ?
Or is this a job for get-content ?
I am new to powershell and just nailed the regex to get the named captures
Now I just dont know how to stick them into an array as they are matched ..
Avatar of footech
footech
Flag of United States of America image

As with most things PowerShell, there's more than one approach that will work.  Unless I'm trying to find out the line number of a match though, I generally won't use Select-String.  Below is what I'll usually use.
Get-Content file.txt | Where { $_ -match $pattern } | ForEach `
{
    New-Object PsObject -Property @{
                            Date = $Matches["date"]
                            Name = $Matches["name"]
                            Action = $Matches["action"]
                            Time = $Matches["time"]
                            units = $Matches["units"]
                            }
}

Open in new window


You said you have the regex pattern, so feel free to post it (set it as $pattern).  My example above assumes named captures - you can adjust accordingly.  It doesn't store things in an array, but creates an object with the appropriate properties.  You can store the entire output of objects as an array if you wish (just add $array = at the very beginning), or continue processing with the pipeline.
Avatar of Andre P

ASKER

Thanks for the help .

Just so I understand .
If I have say 100 text files with date time action etc .
each of which have 2 lines that match the regex.
I will have 200 objects with the properties of date , time etc ?
will they all be in one file ?

What if I for example want to get the average of the times of all the objects for a particular action .
I could sort by object property ? Then perform the calculation on the time property ?
I was thinking array but maybe a was wrong .
What I essentially want is to create a file much like your average get-process output file .
so i can then get metrics by doing calculations on the various properties.
This is another way, somewhat more automated:
Get-Content file*.txt|
  ? { $_ -match '^(?<date>\d\d/\d\d/\d\d\d\d) (?<name>\b\D+?\b) .* (?<action>started|stopped) (?:after|in) (?<time>\d+) (?<units>.*)' } |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 }

Open in new window

It generates the object based on the match groups found (and also goes thru all files matching the file pattern).
The resulting objects can be used in e.g. measure-object (to get sum, avg etc.) in conjunction with group-object to apply some grouping. However, using different units makes it difficult to properly calculate - you should normalize units pre doing calculations. For example, you would append this to line 6:
|
   group-object Name, action, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum time | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

That is, of course, a very simplified report.
Avatar of Andre P

ASKER

Here is an actual example of what I am trying to accomplish
Text string


 $x =
"08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs"


Here is my regex
$regex = "(?<date>^(0[1-9]|1[012])[\/](0[1-9]|[12][0-9]|3[01])[ \/](19|20)\d\d)d{2}:\d{2}\sAM|PM\s-\sDRIVE\s(?<drive>\w)+:(?<name>\w+\s\w+)\sIndex\s\sFinished\s\w+\s(?<duration>\d+)\s(?<unit>secs|mins|hours)"


Question 1 :
Why does this not match ?

$x| where { $_ -match $regex }

| ForEach
`
{ New-Object PsObject -Property @{
      Date = $Matches["date"]
      Drive = $Matches[“drive"]
      Process = $Matches[“name"]
      Duration = $Matches[“duration"]
      units = $Matches["unit"]
}
}
I should have:

 date,
drive,
name,
duration,
unit

 named captures no ?

Then I want to end up with a collection with each matching line being an object with the captures  as properties.

I am trying to learn but I am stuck at this point ,
I will have 200 objects with the properties of date , time etc ?
Yes.

will they all be in one file ?
That depends on additional code.  What I posted above only submits the objects to the pipeline, and by default will be sent to your screen.  If you wanted it sent to a file, then you add a command that will write to a file.  Typically you'd use Export-CSV, so that later you could read the file back in and do additional processing.  And whether you create one file or two hundred is also up to you - it all depends on what you need.

Say you've read in a .CSV using Import-CSV.  You know that you've just created an array with the Import-CSV command, right (assuming you saved it to a variable and there was more then one item in the file)?  If you have no need to keep the data around long-term (i.e. longer than you have your PS session open), then you can skip writing out to a file, and the subsequent import, and just save the output to a variable.

An array is an object in memory.  It seems like you may be confusing that with a file that is written to disk.

Generally you'll do a sort utilizing the pipeline ( stuff | Sort ), and yes, what I posted above easily works with that - just put a pipe after the ForEach block.
Named captures, yes, but the regex doesn't match.
I'll see what I can come up with unless Qlemo (or someone else) does first.
Andre, honestly! Your new example is too far from the original. You should not do that. Our answers can only be as good as your input, and so you should choose to provide info as close as possible to what you really have (if you need to disguise info), and do that in the first place.

"Why does this not match ?" - because the RegEx is wrong, obviously. Your RegEx is very specific, and everything slightly being different from the pattern will stop the match completely.
I was just looking at the last sample text, and I have doubts as to whether that's a true representation of any output.  It looks like you inserted line breaks, but not consistently, such that the date is sometimes at the beginning, end, or middle of a line.  And as Qlemo stated, it doesn't match your original question.
Also something you need to consider when using a string for test instead of a file: you need to split the mulitline string into multiple strings, because that is what Get-Content does. Each line is a new string object. To make a correct test case, you need to use something like
$x = @"
...
"@ -split "`n"

Open in new window

You forgot the match against the space between date and time portion.

As I see it, you want to capture only those lines looking like:
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

Open in new window

I'm not clear whether you want to have "Name" being "Create", "Profile" or both. As-is, you capture "Create Profile" (or, more precise, anything between the drive and "Index").

After "Index", you left out the dash. No match here.

The RegEx working for me is
$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

Open in new window

Using spaces instead of \s makes it more restrictive and fragile, but much better to read and match manually.
Avatar of Andre P

ASKER

Sorry about that .

As I said I was trying to learn and didnt want someone doing it for me .
I am interested in "why " something worked or didnt work ..
I tried my regex on a regtester . it worked then .  although it was for javascript .
I couldnt find a tester that would allow me to use the (? <name>) Maybe when i added that it broke my match .
I
If there is one that will test powershell regex please let me know .
 I didnt know you could just leave a space instead of /s.
I did want to capture both "Create Profile"

The text I posted is the contents of a file .
I would be picking it up through Get-Content.
Providing your own RegEx for us to look at it is fine. Comparing yours and mine should help in learning.
And yes, changing something, even slighlty, can break the expression ;-).
http://regexhero.net/tester/   is able to use named groups (just tested it this moment).
To make sure we know how to put things together, I'll provide the "complete" test code, with some added info in the stats part:
# Only for test:
$x =  @"
08/14/2013 08:17 AM - DRIVE W: Create Profile Index - Started on
08/14/2013 08:18 AM - DRIVE W: Folders: 159 08/14/2013 08:18 AM - DRIVE W: Profiles Indexed: 574
08/14/2013 08:18 AM - DRIVE W: Profiles Removed: 0
08/14/2013 08:18 AM - DRIVE W: Create Profile Index - Finished in 7 secs

08/14/2013 08:18 AM - DRIVE W: Create Text Index - Started on

08:18 AM - DRIVE W: =       499 Files in database 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per rules 08/14/2013
08:18 AM - DRIVE W: -         0 Ignored per format 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped bad records 08/14/2013
08:18 AM - DRIVE W: =         0 Files qualified 08/14/2013
08:18 AM - DRIVE W: =       499 Needed indexing 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped old BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped new BADFILEs 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped open errors 08/14/2013
08:18 AM - DRIVE W: -         0 Skipped other fails 08/14/2013
08:18 AM - DRIVE W: +       376 Indexed actual files (42,818,863 bytes)
08/14/2013 08:18 AM - DRIVE W: +       123 Indexed cached copies (143,293,346 bytes)
08/14/2013 08:18 AM - DRIVE W: =       499 Indexed successfully (186,112,209 bytes)
08/14/2013 08:18 AM - DRIVE W: Create Text Index - Finished in 17 secs
"@ -split "`n"

$regex = "^(?<date>(0[1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d\d) (?:\d\d:\d\d (AM|PM)) - DRIVE (?<drive>\w): (?<name>\w+ \w+) Index - Finished \w+ (?<duration>\d+) (?<unit>secs|mins|hours)"

$x |                # instead of   Get-Content file*.txt |
  ? { $_ -match $regex }  |
  % {
    $matches.Remove(0)
    New-Object PsObject -Property $matches
 } |
   group-object Drive, Name, units |
   % {
     $Name = $_.Name
     $_.Group | measure-object -average -sum duration | select @{n='Group'; e={$Name}}, Average, Sum, Count
  }

Open in new window

I just detected the link I provided is not for free if using more than 5 minutes >-/
Avatar of Andre P

ASKER

Thanks for all your help ! ( And patience !)


What is the significance of the following items :

$_.  in

$_.Name

the  %  in
% {
    $matches.Remove(0)    <---- Why is this done ??

And
    New-Object PsObject -Property $matches  <-----Where is this stored ?if i wanted to inspect its result at this stage what would i do ?
 }
ASKER CERTIFIED SOLUTION
Avatar of Qlemo
Qlemo
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Andre P

ASKER

Wow ! Thank you for the wisdom !! You went above and beyond  - much gratitude !