sunhux
asked on
Powershell/Python/Windows batch script to do a "group" by (ie Count) of a csv & sorting
I'll attach later as my Firefox is not showing the "Attach" option while in EE
Attachment 1 is my original raw csv file.
attachment 2 is the output after being sorted by a combined sort key of
"Name" + "Action" + "SrcIP" + "DstIP
attachment 3 is the output after being grouped (ie doing a count of repeat
records or records that are in common by "Name"+"Action+"SrcIP" + "DstIP"
& then sorted in descending order by the Count
Anyone can volunteer to knock out a PowerShell or Python or batch scripts
to do this task?
Attachment 1 is my original raw csv file.
attachment 2 is the output after being sorted by a combined sort key of
"Name" + "Action" + "SrcIP" + "DstIP
attachment 3 is the output after being grouped (ie doing a count of repeat
records or records that are in common by "Name"+"Action+"SrcIP" + "DstIP"
& then sorted in descending order by the Count
Anyone can volunteer to knock out a PowerShell or Python or batch scripts
to do this task?
you didn't submit ANY attachments
With the amount of info we have, is this what you would like to see?
@"
Name,Action,SrcIP,DstIP,Detail
Joe,Create,1.1.1.1,1.1.1.2,One
Joe,Create,1.1.1.1,1.1.1.2,Two
Joe,Create,1.1.1.1,1.1.1.3,Three
Jim,Send,1.1.1.1,1.1.1.2,Four
Jim,Receive,1.1.1.1,1.1.1.2,Five
Jim,Receive,1.1.1.1,1.1.1.2,Six
"@ | convertfrom-csv | Group-Object Name,Action,SrcIP,DstIP -NoElement | sort Count -desc | ft -a
Result:Count Name
----- ----
2 Jim, Receive, 1.1.1.1, 1.1.1.2
2 Joe, Create, 1.1.1.1, 1.1.1.2
1 Joe, Create, 1.1.1.1, 1.1.1.3
1 Jim, Send, 1.1.1.1, 1.1.1.2
ASKER
apologies, was not able to attach earlier on the other laptop's at my relative's place.
Attached.
SampleSanitIpslog1.csv
SampleSanitIpslog1-sortNameSrcDstIP.csv
SampleSanitIpslog2-sortNameCnt.csv
SampleSanitIpslog2.jpg
Attached.
SampleSanitIpslog1.csv
SampleSanitIpslog1-sortNameSrcDstIP.csv
SampleSanitIpslog2-sortNameCnt.csv
SampleSanitIpslog2.jpg
ASKER
the input csv can have up to 3 million lines so the codes should not run out of memory
Since both sorting and grouping require to have at least the key values in memory as a whole, that's difficult. Do you need the original data? And you know you are not including ports in the group?
ASKER
I don't need the port, ad_industry, message and date/time columns in the final outputs so this will
help with reducing the amount of data
help with reducing the amount of data
ASKER
DstSvcName can be removed as well but I'll need Sev column for reference
This is all I can offer. Should have the smallest memory footprint possible in PowerShell.
import-csv D:\Downloads\SampleSanitIpslog1.csv |
Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
@{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
@{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
@{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
@{n='Sec'; e={$_.Name.Split(',')[4].Trim()}}
ASKER
Bear with me : so I save those codes into a file, say
sortop.ps1
& then go to a PowerShell command prompt & just issue the script file name to run it?
ie just issue "sortop.ps1" at powerShell prompt?
sortop.ps1
& then go to a PowerShell command prompt & just issue the script file name to run it?
ie just issue "sortop.ps1" at powerShell prompt?
ASKER
Btw, do I need to delete away those unused columns in SampleSanitIpslog1.csv first ie
port, ad_industry, DstSvcname, message and date/time
Or the script will still run fine if I leave them there?
Or will deleting those columns help save memory usage?
port, ad_industry, DstSvcname, message and date/time
Or the script will still run fine if I leave them there?
Or will deleting those columns help save memory usage?
to answer your question yes that is all that is required with the caveat that powershell originally ships with scripts disabled.. What I do is launch powershell from the start menu/screen then pin it to the task bar.. now right click on the icon and run as administrator. now from the administrators powershell type the following (or copy paste
or from your powershell
Set-ExecutionPolicy RemoteSigned
and press enter or Y to acceptor from your powershell
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
now you can run scripts
You always have to provide a relative or absolute path with a PS1 file to run it - for security reasons. So if you have changed into the folder containing the PS1 file:
.\sortop.ps1
otherwise
C:\Path\To\Script\sortop.p s1
And no, you do not need to prepare the CSV file, though it might speed up processing if you do. The Group-Object options I used preserve the relevant columns only already, and that should be sufficient, but I can't tell for sure.
.\sortop.ps1
otherwise
C:\Path\To\Script\sortop.p
And no, you do not need to prepare the CSV file, though it might speed up processing if you do. The Group-Object options I used preserve the relevant columns only already, and that should be sufficient, but I can't tell for sure.
ASKER
Just tested the script on a 800,000 odd records csv file : it completed with the output below after
PowerShell chews almost 1.5GB of RAM (which is not the desired result) : could it be due to some
columns are "blank" eg for items with column name "IPS Bypass" ?
Possible that the output is like what's shown by the jpg file that I've attached earlier?
Attached is a trimmed (ie I take the first few lines) of the input csv file:
Btw, the row "sec" comes from "Sev" ?
PS D:\ipsmay\mon\ps> .\sortop.ps1
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action :
SrcIP :
DstIP :
Sec :
Name : 22Web Server Enforcement Violation
Action :
SrcIP :
DstIP :
Sec :
Name : IPS Bypass
Action :
SrcIP :
DstIP :
Sec :
Name :
Action :
SrcIP :
DstIP :
Sec :
Name : 33WordPress Enforcement Protection
Action :
SrcIP :
DstIP :
Sec :
Name : 44Database Protection Violation
Action :
SrcIP :
DstIP :
Sec :
trim.csv
PowerShell chews almost 1.5GB of RAM (which is not the desired result) : could it be due to some
columns are "blank" eg for items with column name "IPS Bypass" ?
Possible that the output is like what's shown by the jpg file that I've attached earlier?
Attached is a trimmed (ie I take the first few lines) of the input csv file:
Btw, the row "sec" comes from "Sev" ?
PS D:\ipsmay\mon\ps> .\sortop.ps1
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action :
SrcIP :
DstIP :
Sec :
Name : 22Web Server Enforcement Violation
Action :
SrcIP :
DstIP :
Sec :
Name : IPS Bypass
Action :
SrcIP :
DstIP :
Sec :
Name :
Action :
SrcIP :
DstIP :
Sec :
Name : 33WordPress Enforcement Protection
Action :
SrcIP :
DstIP :
Sec :
Name : 44Database Protection Violation
Action :
SrcIP :
DstIP :
Sec :
trim.csv
ASKER
> 1.5GB of RAM (which is not the desired result)
Sorry, I meant the output is not the desired result, not the 1.5GB of RAM
Sorry, I meant the output is not the desired result, not the 1.5GB of RAM
Yes, "sec" is a typo and should be "Sev".
But your CSV needs to have the column names as you stated. The trimmed one does not, and so the results are "empty" - the only column with "correct" name is "Name".
We can enum the columns in Import-Csv, to make sure the names are as expected and processed:
But your CSV needs to have the column names as you stated. The trimmed one does not, and so the results are "empty" - the only column with "correct" name is "Name".
We can enum the columns in Import-Csv, to make sure the names are as expected and processed:
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
@{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
@{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
@{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
@{n='Sev'; e={$_.Name.Split(',')[4].Trim()}}
The other way is to use the exact column headers of the CSV in Group-Object.
ASKER
So the input csv file needs to have the columns labelled as:
column 1 : name
column 2 : action
column 3 : srcip
column 4 : dstip
column 5 : sev
& the labels must be of exact spelling ie "deviceaction" will not do?
Are the labels case-sensitive?
column 1 : name
column 2 : action
column 3 : srcip
column 4 : dstip
column 5 : sev
& the labels must be of exact spelling ie "deviceaction" will not do?
Are the labels case-sensitive?
ASKER
Ok, I've renamed the columns' labels to follow exactly what's in the script (with exact case) ie
Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference
Output is better but it did not indicate the count (ie number of occurrences or how many times
a record was repeated for the combined Name, Action, SrcIP, DstIP:
(presumably the outputs at the top are the ones with more repetition/counts? )
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.27.5
DstIP : 172.24.108.71
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.16.132
DstIP : 172.24.66.68
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.6.4
DstIP : 172.24.24.68
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.22.134
DstIP : 172.24.90.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.26.134
DstIP : 172.24.106.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.38.69
DstIP : 172.24.153.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.13.7
DstIP : 172.24.52.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.29.197
DstIP : 172.24.119.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.20.7
DstIP : 172.24.80.71
Sev : Low
Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference
Output is better but it did not indicate the count (ie number of occurrences or how many times
a record was repeated for the combined Name, Action, SrcIP, DstIP:
(presumably the outputs at the top are the ones with more repetition/counts? )
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.27.5
DstIP : 172.24.108.71
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.16.132
DstIP : 172.24.66.68
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.6.4
DstIP : 172.24.24.68
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.22.134
DstIP : 172.24.90.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.26.134
DstIP : 172.24.106.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.38.69
DstIP : 172.24.153.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.13.7
DstIP : 172.24.52.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.29.197
DstIP : 172.24.119.69
Sev : Low
Name : 11Streaming Engine: TCP Segment Limit Enforcement
Action : drop
SrcIP : 10.253.20.7
DstIP : 172.24.80.71
Sev : Low
ASKER
also, ideally the output is in columnar csv file
ASKER
On a 910,000 records file, memory ran out:
PS D:\ipsmay\mon\ps> .\sortop.ps1 > ..\maymon4topair.txt
Group-Object : Exception of type 'System.OutOfMemoryExcepti on' was thrown.
At D:\ipsmay\mon\ps\sortop.ps 1:2 char:15
+ Group-Object <<<< Name,Action,SrcIP,DstIP,Se v -NoElement |
+ CategoryInfo : NotSpecified: (:) [Group-Object], OutOfMemoryException
+ FullyQualifiedErrorId : System.OutOfMemoryExceptio n,Microsof t.PowerShe ll.Command s.GroupObj ectCommand
Can amend script to delete all other columns except the columns below:
Time, Name, Sev, DstIP, SrcIP, action
PS D:\ipsmay\mon\ps> .\sortop.ps1 > ..\maymon4topair.txt
Group-Object : Exception of type 'System.OutOfMemoryExcepti
At D:\ipsmay\mon\ps\sortop.ps
+ Group-Object <<<< Name,Action,SrcIP,DstIP,Se
+ CategoryInfo : NotSpecified: (:) [Group-Object], OutOfMemoryException
+ FullyQualifiedErrorId : System.OutOfMemoryExceptio
Can amend script to delete all other columns except the columns below:
Time, Name, Sev, DstIP, SrcIP, action
ASKER
Remove column "Time" as well
The script already should keep only the columns we group for. We can try if removing the columns prior to grouping helps, but I do not think so.
The count is missing, but that is no issue ;-)
Exporting to CSV is easy. too.
The count is missing, but that is no issue ;-)
Exporting to CSV is easy. too.
import-csv D:\Downloads\SampleSanitIpslog1.csv -Header Time, Name, Sev, DstIP, Dstsvnm, message, DstPort, SrcIP, action, Reference |
Select Name,Action,SrcIP,DstIP,Sev |
Group-Object Name,Action,SrcIP,DstIP,Sev -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].Trim()}},
@{n='Action'; e={$_.Name.Split(',')[1].Trim()}},
@{n='SrcIP'; e={$_.Name.Split(',')[2].Trim()}},
@{n='DstIP'; e={$_.Name.Split(',')[3].Trim()}},
@{n='Sev'; e={$_.Name.Split(',')[4].Trim()}},
Count |
export-csv -NoType C:\temp\Result.csv
But probably PowerShell and similar languages are not suited well for managing that amount of data, at least without specialized and sophisticated code.
ASKER
Reducing the columns not needed helped : won't run out of memory anymore.
When I rerun the latest script, it gave the following :
The Count is missing & ideally output is in columnar csv
Name : Name
Sev : Sev
DstIP : DstIP
SrcIP : SrcIP
action : action
Name : Streaming Engine: TCP Segment Limit Enforcement
Sev : Low
DstIP : 172.24.119.69
SrcIP : 10.253.29.197
action : drop
Name : Streaming Engine: TCP Segment Limit Enforcement
Sev : Low
DstIP : 172.24.119.69
SrcIP : 10.253.29.197
action : drop
. . .
When I rerun the latest script, it gave the following :
The Count is missing & ideally output is in columnar csv
Name : Name
Sev : Sev
DstIP : DstIP
SrcIP : SrcIP
action : action
Name : Streaming Engine: TCP Segment Limit Enforcement
Sev : Low
DstIP : 172.24.119.69
SrcIP : 10.253.29.197
action : drop
Name : Streaming Engine: TCP Segment Limit Enforcement
Sev : Low
DstIP : 172.24.119.69
SrcIP : 10.253.29.197
action : drop
. . .
ASKER
I thought of amending it as follows:
import-csv D:\Downloads\SampleSanitIp slog1.csv -Header Time, Name, Sev, DstIP, SrcIP, action |
(above is is the minimum columns needed)
Select Name,Action,SrcIP,DstIP |
Group-Object Name,Action,SrcIP,DstIP,Se v -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].T rim()}},
@{n='Action'; e={$_.Name.Split(',')[1].T rim()}},
@{n='SrcIP'; e={$_.Name.Split(',')[2].T rim()}},
@{n='DstIP'; e={$_.Name.Split(',')[3].T rim()}},
@{n='Sev'; e={$_.Name.Split(',')[4].T rim()}},
Count |
export-csv -NoType C:\temp\Result.csv
(does result.csv contain the output in columnar format with Counts included?)
import-csv D:\Downloads\SampleSanitIp
(above is is the minimum columns needed)
Select Name,Action,SrcIP,DstIP |
Group-Object Name,Action,SrcIP,DstIP,Se
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].T
@{n='Action'; e={$_.Name.Split(',')[1].T
@{n='SrcIP'; e={$_.Name.Split(',')[2].T
@{n='DstIP'; e={$_.Name.Split(',')[3].T
@{n='Sev'; e={$_.Name.Split(',')[4].T
Count |
export-csv -NoType C:\temp\Result.csv
(does result.csv contain the output in columnar format with Counts included?)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
c:\temp\results.csv is empty too
ASKER
Sorry, result.csv now contains the output in columnar format.
I amended it as follows as "Sev" is not part of the sorting key:
import-csv D:\ipsmay\Mon\ps\Maymon4re lab2.csv |
Select Name,Sev,DstIP,SrcIP,actio n |
Group-Object Name,Action,SrcIP,DstIP -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].T rim()}},
@{n='Action'; e={$_.Name.Split(',')[1].T rim()}},
@{n='SrcIP'; e={$_.Name.Split(',')[2].T rim()}},
@{n='DstIP'; e={$_.Name.Split(',')[3].T rim()}},
@{n='Sev'; e={$_.Name.Split(',')[4].T rim()}},
Count |
export-csv -NoType C:\temp\Result.csv
I amended it as follows as "Sev" is not part of the sorting key:
import-csv D:\ipsmay\Mon\ps\Maymon4re
Select Name,Sev,DstIP,SrcIP,actio
Group-Object Name,Action,SrcIP,DstIP -NoElement |
sort Count -desc |
Select @{n='Name'; e={$_.Name.Split(',')[0].T
@{n='Action'; e={$_.Name.Split(',')[1].T
@{n='SrcIP'; e={$_.Name.Split(',')[2].T
@{n='DstIP'; e={$_.Name.Split(',')[3].T
@{n='Sev'; e={$_.Name.Split(',')[4].T
Count |
export-csv -NoType C:\temp\Result.csv
ASKER
wIth above latest amended script that I submitted above, the Sev column for the output all have
no values (Low, Med, High, Crit) : did I miss out something?
no values (Low, Med, High, Crit) : did I miss out something?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.