Link to home
Start Free TrialLog in
Avatar of D B
D BFlag for United States of America

asked on

Powershell String Parsing with Regular Expression

I had this question after getting PowerShell Regular Expression answered.

This is the third spec change and I've been told the last (who hasn't heard that before?)

The file can have one of two layouts (note 'XYZ' and either 'TERM' or 'DENIED' are static values):
XYZ|ASSOC_ID|USER_ID|TERM
XYZ|ASSOC_ID|USER_ID|DENIED|SERVER_CLIENT_ROLE

A sample file might be:
XYZ|0034192|ab123456|TERM
XYZ|09681|ab192837|TERM
XYZ|0192847|ab98765|DENIED|TESTSERVER_GOOGLE_ANYTHING
XYZ|6273|ab11839|DENIED|PRODSERVER_MICROSOFT_WHATEVER

Each row will have at least four elements that are pipe delimited. Optionally, there can be an additional element, that is made up of server, client and role, which are each delimited by an underscore character.

The parsing needs two results: 1) an array that contains the third element (USER_ID) and SERVER and CLIENT from the fifth element if the fourth element has the value 'DENIED'.  However, if the fourth element has the value 'TERM', using the format from the above referenced question, the output would be a single string, consisting of ONLY USER_ID values, be comma-delimited and each USER_ID enclosed in single quotes .

Given the example above, the array with 'DENIED' would contain:
USER_ID      SERVER            CLIENT
ab98765      TESTSERVER      GOOGLE
ab11839      PRODSERVER      MICROSOFT

And the string of values for rows with 'TERM' would contain
'ab123456','ab192837'

I realize this will probably take two passes (one to build the array, one to build the string) and that is fine (and probably simpler to follow and maintain for those of you who want to do the whole thing in one line of code :-) )
Avatar of D B
D B
Flag of United States of America image

ASKER

Trying to figure out part of it by myself (to create the delimited string). I initially revised the original answer from the above link somewhat to remove the header row and have the following code (this is what I am using before 'revision #3'):
$termUserList = (Get-Content $TermPathAndFilename | Select-Object -Skip 1 | % {$_ -match '\A(?<Col1>.*?)\|(?<Col2>.*?)\|(?<Col3>.*?)(?:\Z|\|)' | Out-Null; "'$($Matches['Col3'])'"}) -join ','

Open in new window

From what PowerShell I am learning, I do think I need another <Col#> designator '\A(?<Col1>.*?)\|(?<Col2>.*?)\|(?<Col3>.*?)\|(?<Col4>.*?)(?:\Z|\|)' and need to somehow get a Where in there somewhere to pull only rows where Col4 = 'TERM'. Not getting it though, and I am really clueless on the array portion.
Avatar of aikimark
Does your file have a header row?  If so, what is it?
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Note that aikimark's suggestion above assumes there is no header in the file, but it doesn't matter as long as the third column header is neither TERM nor DENIED.
However, "we batchelors" try not to use something like
$A=Import-Csv -delimiter "|" -path $TermPathAndFilename -header "F1","F2","USER_ID","TD","F5"

$A | %{

Open in new window

unless there is a need for having direct access to the content beyond for the pipe ;-). It does no harm to compress it to:
Import-Csv -delimiter "|" -path $TermPathAndFilename -header "F1","F2","USER_ID","TD","F5" |
  %{

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of D B

ASKER

Qlemo: I am a bit confused by your comment. Your second code block is identical to the first; there does not appear to be anything 'compressed.'

Also, your comment, "unless there is a need for having direct access to the content beyond for the pipe." If you are talking about the data I've labeled SERVER, CLIENT, ROLE, it is certainly required (at least SERVER and CLIENT are, as those are elements of the $denied array). Were you thinking of something else?
Both code snippets are not the same, the second is not using $A. This can make a difference for larger files. The pipe does not store more than one line at any time; data is processed "on the fly", which is important if you conditionally dismiss a lot of info in the loop.

And since we do not need to have access to the original data outside of the loop, only to the result of the loop, storing the file content in a variable just consumes memory unnecessarily.

You will not see any difference usually, but if you go into larger files (hundreds of MB), there is a significant effect.

My own code example (using switch) is efficient to that regard too, as it uses optimized internal processing for larger files (if compared to get-content and switch, or import-csv and switch in a loop).

As I see it, your current choice of script should be what you think you can understand and expand more easily. RegEx can get complex and confusing, and is difficult to read. The regex used here is still ok ;-).
Avatar of D B

ASKER

Ok, I understand now (and see the difference). I was only looking at the first line and for some reason my mind didn't see the difference in the 2nd line. Makes sense. I will probably be dealing with a file that has no more than 10-20 rows of data. I am getting somewhat better at regex, but learning it well is learning a language within a language.

I grew up a COBOL/PASCAL/BASIC then VB then SQL Server developer so all these |, %, '.*\|.*\|(?<...' {} and stuff as well as piping and is taking a lot of getting use to. I really wish I had a lot better comprehension than I do. I tend to pick up PowerShell and use it when I have a task that I feel is best accommodated using it, but don't have the luxury of the time required to master it (plus not sure, at my age--referencing COBOL and PASCAL probably gives you an idea--that I really want to) ;-)

I do have one additional question that came up once I started integrating this into my script, and I will be more than happy to create a new question. I have the $DENIED array, that contains USER, SERVER and CLIENT. There is another array that contains CLIENT and DATABASE. Is there an easy way to match CLIENT between the two arrays and add DATABASE to the $DENIED array, so that it contains USER, SERVER, CLIENT, DATABASE?
No, there is no easy way. You need to compare or search, and then extend, each single member.
However, if you use the advanced technique of hash tables instead of arrays, at least for the database one, then you can look up the database directly.
That is, you build your database "array" like this:
$db = @{}
# looping code
   $db[$client] = $database
# ...

Open in new window

you can get the DB just by using $db[$client]. If we integrate that into my code (with $db already filled, of course):
$Term = ''
$Denied = @()
switch -regex -file c:\temp\ee\Q_28988112.txt
{
  '.*\|.*\|(?<ID>.*)\|TERM'                             { $Term   += "'$($matches['ID'])'," }
  '.*\|.*\|(?<ID>.*)\|DENIED\|(?<srv>.+)_(?<clt>.+)_.*' { $Denied += [PSCustomObject] @{
                                                                       user_ID = $Matches['ID']
                                                                       Server  = $Matches['Srv']
                                                                       Client  = $Matches['Clt']
                                                                       Database = $db[$Matches['Clt']
                                                                     }
                                                        }
}
$Term = $Term.Remove($Term.Length-1)

Write-Host "Term: $Term"
Write-Host "Denied:"
$Denied

Open in new window

Note that you cannot use the Client property, it does not exist at that point in time, so we have to repeat the $Matches['Clt'].
Avatar of D B

ASKER

I did find Add-Member that allows me to add Database as a new column, but still trying to figure out the matching logic to copy the value of DATABASE from my $ServerClientDatabase array to the $Denied array when I match on CLIENT.
BTW, COBOL/PASCAL/BASIC/VB applies to all three of us, at least age-wise. I never used COBOL, but learned it for fun in my early teen years. Like another dozen programming languages ...

Using PowerShell effectively is a science for itself. The good news is that you can use it very similar to other programming and scripting languages, and then move to more advanced techniques step by step. It is important to keep in mind there are more effective or flexible ways, but that does not mean you have to use them from the beginning or at all ;-).
Back on topic: If you do not want to change the way you generate your (client, database) array, you'll have to compare "manually":
# Generating test data
$db = @"
client,database
PC1,Db1
PC2,Db2
"@ -split "`n" | convertFrom-CSV

$denied = @"
User_ID,Server,Client
Usr1,Srv1,PC1
Usr2,Srv2,PC2
Usr3,Srv3,PC1
"@ -split "`n" | convertFrom-CSV

foreach ($obj in $denied) {
  $curDB = $db | ? { $_.Client -eq $obj.Client }
  $obj | Add-Member Noteproperty Database $curDB.Database
}
$Denied

Open in new window

Avatar of D B

ASKER

For fairness, I've asked a new question. I tried (which I prefer) your initial method of adding it inside the loop that builds the array, but didn't get any results.
Avatar of D B

ASKER

I'm wondering if the issue might be Database = $db[$Matches['Clt']]
since the $db table contains ctl (actually client_id, but I changed the code accordingly) and the element I am wanting from thethe table is db_name when I match client_id, but I am not seeing that referenced.

I am going to close this question and award points. Feel free to post the solution in the new question.
Avatar of D B

ASKER

I am splitting points between aikimark and Qlemo. I originally went with aikimark's solution as, for me, and I believe the rest of the team, it will be easier to understand. However, QLemo's provides some benefits, especially in light of the subsequent question I asked and have linked to, so it may end up being what I implement, with a lot of comments for us non PowerShelll guru types who will have to subsequently maintain it. Unlike most developers, I love commenting my code :-)
I still don't know if your file has headers or not.  The Import-CSV statement will be different if it does have headers as will any columnar/field references in the script.
aikimark, as I have said it does not matter here as long as the switch will not have a hit. You provide the headers with import-csv, and so (a) know the header names and (b) get headers, should they be included in the CSV file, as first line of data. Usually that makes a big difference, but we are lucky here.
@Q

If the file did have headers, it is possible that the first row might cause a logic error in the script.
... which conditions have been described by me.