Link to home
Start Free TrialLog in
Avatar of sunhux
sunhux

asked on

Powershell/Python/Windows batch script to do a "group" by (ie Count) of a csv & sorting

I'll attach later as my Firefox is not showing the "Attach" option while in EE

Attachment 1 is my original raw csv file.

attachment 2 is the output after being sorted by a combined sort key of
"Name" + "Action" + "SrcIP" + "DstIP

attachment 3 is the output after being grouped (ie doing a count of repeat
records or records that are in common by "Name"+"Action+"SrcIP" + "DstIP"
& then sorted in descending order by the Count

Anyone can volunteer to knock out a PowerShell or Python or batch scripts
to do this task?
Avatar of David Johnson, CD
David Johnson, CD
Flag of Canada image

you didn't submit ANY attachments
With the amount of info we have, is this what you would like to see?
@"
Name,Action,SrcIP,DstIP,Detail
Joe,Create,1.1.1.1,1.1.1.2,One
Joe,Create,1.1.1.1,1.1.1.2,Two
Joe,Create,1.1.1.1,1.1.1.3,Three
Jim,Send,1.1.1.1,1.1.1.2,Four
Jim,Receive,1.1.1.1,1.1.1.2,Five
Jim,Receive,1.1.1.1,1.1.1.2,Six
"@ | convertfrom-csv | Group-Object Name,Action,SrcIP,DstIP -NoElement | sort Count -desc | ft -a

Open in new window

Result:
Count Name
----- ----
    2 Jim, Receive, 1.1.1.1, 1.1.1.2
    2 Joe, Create, 1.1.1.1, 1.1.1.2
    1 Joe, Create, 1.1.1.1, 1.1.1.3
    1 Jim, Send, 1.1.1.1, 1.1.1.2

Open in new window

Avatar of sunhux
sunhux

ASKER

apologies, was not able to attach earlier on the other laptop's at my relative's place.

Attached.
SampleSanitIpslog1.csv
SampleSanitIpslog1-sortNameSrcDstIP.csv
SampleSanitIpslog2-sortNameCnt.csv
SampleSanitIpslog2.jpg
Avatar of sunhux

ASKER

the input csv can have up to 3 million lines so the codes should not run out of memory
Since both sorting and grouping require to have at least the key values in memory as a whole, that's difficult. Do you need the original data? And you know you are not including ports in the group?
Avatar of sunhux

ASKER

I don't need the port, ad_industry, message and date/time columns in the final outputs so this will
help with reducing the amount of data
Avatar of sunhux

ASKER

DstSvcName can be removed as well but I'll need Sev column for reference
This is all I can offer. Should have the smallest memory footprint possible in PowerShell.
import-csv D:\Downloads\SampleSanitIpslog1.csv |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sec'; e={$_.Name.Split(',')[4].Trim()}}

Open in new window

Avatar of sunhux

ASKER

Bear with me : so I save those codes into a file, say
sortop.ps1
& then go to a PowerShell command prompt & just issue the script file name to run it?
   ie just issue   "sortop.ps1" at powerShell prompt?
Avatar of sunhux

ASKER

Btw, do I need to delete away those unused columns in SampleSanitIpslog1.csv first ie
port, ad_industry, DstSvcname, message and date/time    
Or the script will still run fine if I leave them there?

Or will deleting those columns help save memory usage?
to answer your question yes that is all that is required with the caveat that powershell originally ships with scripts disabled.. What I do is launch powershell from the start menu/screen then pin it to the task bar.. now right click on the icon and run as administrator.  now from the administrators powershell type the following (or copy paste
Set-ExecutionPolicy RemoteSigned 

Open in new window

and press enter or Y to accept
or from your powershell
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser

Open in new window

now you can run scripts
You always have to provide a relative or absolute path with a PS1 file to run it - for security reasons. So if you have changed into the folder containing the PS1 file:
  .\sortop.ps1
otherwise
  C:\Path\To\Script\sortop.ps1

And no, you do not need to prepare the CSV file, though it might speed up processing if you do. The Group-Object options I used preserve the relevant columns only already, and that should be sufficient, but I can't tell for sure.
Avatar of sunhux

ASKER

Just tested the script on a 800,000 odd records csv file : it completed with the output below after
PowerShell chews almost 1.5GB of RAM (which is not the desired result) : could it be due to some
columns are "blank" eg for items with column name "IPS Bypass" ?

Possible that the output is like what's shown by the jpg file that I've attached earlier?

Attached is a trimmed (ie I take the first few lines) of the input csv file:

Btw, the row "sec" comes from "Sev" ?

PS D:\ipsmay\mon\ps> .\sortop.ps1

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 22Web Server Enforcement Violation
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : IPS Bypass
Action :
SrcIP  :
DstIP  :
Sec    :

Name   :
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 33WordPress Enforcement Protection
Action :
SrcIP  :
DstIP  :
Sec    :

Name   : 44Database Protection Violation
Action :
SrcIP  :
DstIP  :
Sec    :
trim.csv
Avatar of sunhux

ASKER

> 1.5GB of RAM (which is not the desired result)
Sorry, I meant the output is not the desired result, not the 1.5GB of RAM
Yes, "sec" is a typo and should be "Sev".
But your CSV needs to have the column names as you stated. The trimmed one does not, and so the results are "empty" - the only column with "correct" name is "Name".
We can enum the columns in Import-Csv, to make sure the names are as expected and processed:
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}}

Open in new window

The other way is to use the exact column headers of the CSV in Group-Object.
Avatar of sunhux

ASKER

So the input csv file needs to have the columns labelled as:
column 1 : name
column 2 : action
column 3 : srcip
column 4 : dstip
column 5 : sev

& the labels must be of exact spelling  ie "deviceaction" will not do?
Are the labels case-sensitive?
Avatar of sunhux

ASKER

Ok, I've renamed the columns' labels to follow exactly what's in the script (with exact case) ie
Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference
 
Output is better but it did not indicate the count (ie number of occurrences or how many times
a record was repeated for the combined Name, Action, SrcIP, DstIP:
(presumably the outputs at the top are the ones with more repetition/counts? )

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.27.5
DstIP  : 172.24.108.71
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.16.132
DstIP  : 172.24.66.68
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.6.4
DstIP  : 172.24.24.68
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.22.134
DstIP  : 172.24.90.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.26.134
DstIP  : 172.24.106.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.38.69
DstIP  : 172.24.153.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.13.7
DstIP  : 172.24.52.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.29.197
DstIP  : 172.24.119.69
Sev    : Low

Name   : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP  : 10.253.20.7
DstIP  : 172.24.80.71
Sev    : Low
Avatar of sunhux

ASKER

also, ideally the output is in columnar csv file
Avatar of sunhux

ASKER

On a 910,000 records file, memory ran out:

PS D:\ipsmay\mon\ps> .\sortop.ps1 > ..\maymon4topair.txt
Group-Object : Exception of type 'System.OutOfMemoryException' was thrown.
At D:\ipsmay\mon\ps\sortop.ps1:2 char:15
+   Group-Object <<<<  Name,Action,SrcIP,DstIP,Sev -NoElement |
    + CategoryInfo          : NotSpecified: (:) [Group-Object], OutOfMemoryException
    + FullyQualifiedErrorId : System.OutOfMemoryException,Microsoft.PowerShell.Commands.GroupObjectCommand


Can amend script to delete all other columns except the columns below:
Time, Name, Sev, DstIP, SrcIP, action
Avatar of sunhux

ASKER

Remove column "Time" as well
The script already should keep only the columns we group for. We can try if removing the columns prior to grouping helps, but I do not think so.
The count is missing, but that is no issue ;-)
Exporting to CSV is easy. too.
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
  Select Name,Action,SrcIP,DstIP,Sev |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv

Open in new window

But probably PowerShell and similar languages are not suited well for managing that amount of data, at least without specialized and sophisticated code.
Avatar of sunhux

ASKER

Reducing the columns not needed helped : won't run out of memory anymore.

When I rerun the latest script, it gave the following :
The Count is missing & ideally output is in columnar csv

Name   : Name
Sev    : Sev
DstIP  : DstIP
SrcIP  : SrcIP
action : action

Name   : Streaming Engine: TCP Segment Limit Enforcement
Sev    : Low
DstIP  : 172.24.119.69
SrcIP  : 10.253.29.197
action : drop

Name   : Streaming Engine: TCP Segment Limit Enforcement
Sev    : Low
DstIP  : 172.24.119.69
SrcIP  : 10.253.29.197
action : drop

. . .
Avatar of sunhux

ASKER

I thought of amending it as follows:

import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, SrcIP, action |
(above is is the minimum columns needed)
  Select Name,Action,SrcIP,DstIP |
  Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv
(does result.csv contain the output in columnar format with Counts included?)
ASKER CERTIFIED SOLUTION
Avatar of Qlemo
Qlemo
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of sunhux

ASKER

c:\temp\results.csv is empty too
Avatar of sunhux

ASKER

Sorry, result.csv now contains the output in columnar format.

I amended it as follows as "Sev" is not part of the sorting key:

import-csv D:\ipsmay\Mon\ps\Maymon4relab2.csv |
  Select Name,Sev,DstIP,SrcIP,action |
  Group-Object Name,Action,SrcIP,DstIP -NoElement |
  sort Count -desc |
  Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
         @{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
         @{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
         @{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
         @{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
         Count |
  export-csv -NoType C:\temp\Result.csv
Avatar of sunhux

ASKER

wIth above latest amended script that I submitted above, the Sev column for the output all have
no values (Low, Med, High, Crit) : did I miss out something?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial