Powershell/Python/Windows batch script to do a "group" by (ie Count) of a csv & sorting

I'll attach later as my Firefox is not showing the "Attach" option while in EE

Attachment 1 is my original raw csv file.

attachment 2 is the output after being sorted by a combined sort key of
"Name" + "Action" + "SrcIP" + "DstIP

attachment 3 is the output after being grouped (ie doing a count of repeat
records or records that are in common by "Name"+"Action+"SrcIP" + "DstIP"
& then sorted in descending order by the Count

Anyone can volunteer to knock out a PowerShell or Python or batch scripts
to do this task?
sunhuxAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

David Johnson, CD, MVPOwnerCommented:
you didn't submit ANY attachments
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
With the amount of info we have, is this what you would like to see?
@"
Name,Action,SrcIP,DstIP,Detail
Joe,Create,1.1.1.1,1.1.1.2,One
Joe,Create,1.1.1.1,1.1.1.2,Two
Joe,Create,1.1.1.1,1.1.1.3,Three
Jim,Send,1.1.1.1,1.1.1.2,Four
Jim,Receive,1.1.1.1,1.1.1.2,Five
Jim,Receive,1.1.1.1,1.1.1.2,Six
"@ | convertfrom-csv | Group-Object Name,Action,SrcIP,DstIP -NoElement | sort Count -desc | ft -a

Open in new window

Result:
Count Name
----- ----
    2 Jim, Receive, 1.1.1.1, 1.1.1.2
    2 Joe, Create, 1.1.1.1, 1.1.1.2
    1 Joe, Create, 1.1.1.1, 1.1.1.3
    1 Jim, Send, 1.1.1.1, 1.1.1.2

Open in new window

sunhuxAuthor Commented:
apologies, was not able to attach earlier on the other laptop's at my relative's place.

Attached.
SampleSanitIpslog1.csv
SampleSanitIpslog1-sortNameSrcDstIP.csv
SampleSanitIpslog2-sortNameCnt.csv
SampleSanitIpslog2.jpg
Simplify Active Directory Administration

Administration of Active Directory does not have to be hard.  Too often what should be a simple task is made more difficult than it needs to be.The solution?  Hyena from SystemTools Software.  With ease-of-use as well as powerful importing and bulk updating capabilities.

sunhuxAuthor Commented:
the input csv can have up to 3 million lines so the codes should not run out of memory
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Since both sorting and grouping require to have at least the key values in memory as a whole, that's difficult. Do you need the original data? And you know you are not including ports in the group?
sunhuxAuthor Commented:
I don't need the port, ad_industry, message and date/time columns in the final outputs so this will
help with reducing the amount of data
sunhuxAuthor Commented:
DstSvcName can be removed as well but I'll need Sev column for reference
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
This is all I can offer. Should have the smallest memory footprint possible in PowerShell.
import-csv D:\Downloads\SampleSanitIpslog1.csv |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sec'; e={$_.Name.Split(',')[4].Trim()}}

Open in new window

sunhuxAuthor Commented:
Bear with me : so I save those codes into a file, say
sortop.ps1
& then go to a PowerShell command prompt & just issue the script file name to run it?
   ie just issue   "sortop.ps1" at powerShell prompt?
sunhuxAuthor Commented:
Btw, do I need to delete away those unused columns in SampleSanitIpslog1.csv first ie
port, ad_industry, DstSvcname, message and date/time    
Or the script will still run fine if I leave them there?

Or will deleting those columns help save memory usage?
David Johnson, CD, MVPOwnerCommented:
to answer your question yes that is all that is required with the caveat that powershell originally ships with scripts disabled.. What I do is launch powershell from the start menu/screen then pin it to the task bar.. now right click on the icon and run as administrator.  now from the administrators powershell type the following (or copy paste
Set-ExecutionPolicy RemoteSigned 

Open in new window

and press enter or Y to accept
or from your powershell
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser

Open in new window

now you can run scripts
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
You always have to provide a relative or absolute path with a PS1 file to run it - for security reasons. So if you have changed into the folder containing the PS1 file:
  .\sortop.ps1
otherwise
  C:\Path\To\Script\sortop.ps1

And no, you do not need to prepare the CSV file, though it might speed up processing if you do. The Group-Object options I used preserve the relevant columns only already, and that should be sufficient, but I can't tell for sure.
sunhuxAuthor Commented:
Just tested the script on a 800,000 odd records csv file : it completed with the output below after
PowerShell chews almost 1.5GB of RAM (which is not the desired result) : could it be due to some
columns are "blank" eg for items with column name "IPS Bypass" ?

Possible that the output is like what's shown by the jpg file that I've attached earlier?

Attached is a trimmed (ie I take the first few lines) of the input csv file:

Btw, the row "sec" comes from "Sev" ?

PS D:\ipsmay\mon\ps> .\sortop.ps1

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 22Web Server Enforcement Violation
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : IPS Bypass
Action :
SrcIP  :
DstIP  :
Sec    :

Name   :
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 33WordPress Enforcement Protection
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 44Database Protection Violation
Action :
SrcIP  :
DstIP  :
Sec    :
trim.csv
sunhuxAuthor Commented:
> 1.5GB of RAM (which is not the desired result)
Sorry, I meant the output is not the desired result, not the 1.5GB of RAM
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Yes, "sec" is a typo and should be "Sev".
But your CSV needs to have the column names as you stated. The trimmed one does not, and so the results are "empty" - the only column with "correct" name is "Name".
We can enum the columns in Import-Csv, to make sure the names are as expected and processed:
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}}

Open in new window

The other way is to use the exact column headers of the CSV in Group-Object.
sunhuxAuthor Commented:
So the input csv file needs to have the columns labelled as:
column 1 : name
column 2 : action
column 3 : srcip
column 4 : dstip
column 5 : sev

& the labels must be of exact spelling  ie "deviceaction" will not do?
Are the labels case-sensitive?
sunhuxAuthor Commented:
Ok, I've renamed the columns' labels to follow exactly what's in the script (with exact case) ie
Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference
 
Output is better but it did not indicate the count (ie number of occurrences or how many times
a record was repeated for the combined Name, Action, SrcIP, DstIP:
(presumably the outputs at the top are the ones with more repetition/counts? )

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.27.5
DstIP  : 172.24.108.71
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.16.132
DstIP  : 172.24.66.68
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.6.4
DstIP  : 172.24.24.68
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.22.134
DstIP  : 172.24.90.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.26.134
DstIP  : 172.24.106.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.38.69
DstIP  : 172.24.153.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.13.7
DstIP  : 172.24.52.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.29.197
DstIP  : 172.24.119.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.20.7
DstIP  : 172.24.80.71
Sev    : Low
sunhuxAuthor Commented:
also, ideally the output is in columnar csv file
sunhuxAuthor Commented:
On a 910,000 records file, memory ran out:

PS D:\ipsmay\mon\ps> .\sortop.ps1 > ..\maymon4topair.txt
Group-Object : Exception of type 'System.OutOfMemoryException' was thrown.
At D:\ipsmay\mon\ps\sortop.ps1:2 char:15
+   Group-Object <<<<  Name,Action,SrcIP,DstIP,Sev -NoElement |
    + CategoryInfo          : NotSpecified: (:) [Group-Object], OutOfMemoryException
    + FullyQualifiedErrorId : System.OutOfMemoryException,Microsoft.PowerShell.Commands.GroupObjectCommand


Can amend script to delete all other columns except the columns below:
Time, Name, Sev, DstIP, SrcIP, action
sunhuxAuthor Commented:
Remove column "Time" as well
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
The script already should keep only the columns we group for. We can try if removing the columns prior to grouping helps, but I do not think so.
The count is missing, but that is no issue ;-)
Exporting to CSV is easy. too.
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
  Select Name,Action,SrcIP,DstIP,Sev |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv

Open in new window

But probably PowerShell and similar languages are not suited well for managing that amount of data, at least without specialized and sophisticated code.
sunhuxAuthor Commented:
Reducing the columns not needed helped : won't run out of memory anymore.

When I rerun the latest script, it gave the following :
The Count is missing & ideally output is in columnar csv

Name   : Name
Sev    : Sev
DstIP  : DstIP
SrcIP  : SrcIP
action : action

Name   : Streaming Engine: TCP Segment Limit Enforcement
Sev    : Low
DstIP  : 172.24.119.69
SrcIP  : 10.253.29.197
action : drop

Name   : Streaming Engine: TCP Segment Limit Enforcement
Sev    : Low
DstIP  : 172.24.119.69
SrcIP  : 10.253.29.197
action : drop

. . .
sunhuxAuthor Commented:
I thought of amending it as follows:

import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, SrcIP, action |
(above is is the minimum columns needed)
  Select Name,Action,SrcIP,DstIP |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv
(does result.csv contain the output in columnar format with Counts included?)
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
If you have set up your input CSV to contain the correct header, we should remove providing the header names on input. Something I forgot to consider is that the header lines in the CSV will be taken over as data lines if we do.
import-csv D:\Downloads\SampleSanitIpslog1.csv |
  Select Name,Action,SrcIP,DstIP,Sev |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv

Open in new window

And this *should* export the count.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
sunhuxAuthor Commented:
c:\temp\results.csv is empty too
sunhuxAuthor Commented:
Sorry, result.csv now contains the output in columnar format.

I amended it as follows as "Sev" is not part of the sorting key:

import-csv D:\ipsmay\Mon\ps\Maymon4relab2.csv |
  Select Name,Sev,DstIP,SrcIP,action |
  Group-Object Name,Action,SrcIP,DstIP -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv
sunhuxAuthor Commented:
wIth above latest amended script that I submitted above, the Sev column for the output all have
no values (Low, Med, High, Crit) : did I miss out something?
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
a) The change to select-object is no real change, you are just swapping column sequence, but that has no meaning.

b) You have to group by the severity.
ba) Because of -NoElements, only grouping columns are maintained.
bb) If we do not group by severity, and omit -NoElements, it is only part of the collection containing the original object, and can have different values (theoretically).

It is much easier if you leave SEV as grouping column, even if it is the same for each action.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Powershell

From novice to tech pro — start learning today.