Powershell Script List Top 50 Largest Files

compdigit44
compdigit44 used Ask the Experts™
on
I have a very large File Server environment with 17TB of data and our current storage monitor solution is not able to scan the volume fast enough to produce a report daily..

I am not a powershell export but I am looking to see if a scritp would be able to list the Top 50 largest files per volume from largest to smallest, its full path and date of late access , owner and of course size. If it could be exported to a csv this would be great...
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
I doubt the PS approach will help with that, as it still has to traverse all file system info to determine the top n files. But of course you can try.
Get-ChildItem C:\ -recurse | select FullName, Length | sort Length -Desc | Select -First 50

Open in new window

Author

Commented:
Thanks...Does this list the size of the file though?
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
Yes, it shows the full path plus the size.
Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

yes and add this
| export-csv C:\file_size\log.csv
the you have the output as a file

Author

Commented:
Thanks I am trying it now..

Author

Commented:
the script is running but I am getting messages that the file paths it is hitting are to long..
By default the "length" property does not show size in MB.
Some addition to script posted by Qlemo.
will give output to .csv and size will be in MB
Get-ChildItem C:\ -Recurse | select Name, Directory, @{N="Size(MB)";E={[Math]::Round($_.length/1MB,2)}} | sort "Size(MB)" -Descending | select -First 50 | Export-Csv -Path d:\Top50Files.csv -NoTypeInformation

Open in new window

Author

Commented:
Thanks... I will try this tomorrow and report back.. With this be able to handles files names with very long path names?
As the error says, its due to the path of file is more than 256 characters.
You will face the same error by running the script I posted coz there is no change is the cmdlet used (get-childitem).
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
This uses robocopy to produce the file list (but it will not actually copy anything), as robocopy doesn't care about long paths. It should have a pretty low memory footprint, too, since it doesn't collect all items first and then sorts them, but only collects the top n biggest ones.
Can't promise anything concerning speed, though.
It returns an array of PSCustom objects with two properties, FullName and Length. You can process that output any way you feel like, for example like this:
$Top = .\Whatever.ps1 -Path E:\Wherever
$Top | fl
$Top | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Open in new window

[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0}

Open in new window

We'll worry about how (considering the long file paths) to get the rest of the information you want if that runs in a timely fashion.
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
A very clever way to get the top n. Had to think about how it works for some time...

Author

Commented:
Thank for the replied and will look at this more tomorrow but the though of using robocopy even to just list files scare me a bit
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
In the past, I've used successive automaitc SUBST to cut down overly long paths in PowerShell, but that is in no way better (as any advantage) over using RoboCopy. RoboCopy is built-in, so why not use it?
Top Expert 2014

Commented:
A couple of suggestions
Exclude small files -- in this parameter example, 1MB minimum greatly reduced the number of output lines produced by Robocopy in my test.  If you know your file system well, you might be able to increase that minimum
/min:1000000

Open in new window


Sort once -- instead of sorting your list every time you add an item, sort all the items and take the top N.
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
compdigit44,
with the /L, argument, robocopy will only list the files - really. All it will do is generate a file list.

aikimark,
the /min is a nice touch.
The "sort once [all the items]" is exactly what I wanted to avoid. Not knowing anything about the server resources and its file structure, I find it safer to invest a bit of CPU into sorting (not that sorting 50 elements is really that taxing) than to first collect potentially gigabytes of data in RAM.
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
Here's a version that includes the minimum size suggested by aikimark, and more importantly, fixes a sizing issue (sorry); [int] might not be big enough. In addition, it gives you LastWriteTime and size in MB.
[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50,
	[string]$MinimumSize = "1MB"
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
If (($Min = ([int64]1 * $MinimumSize.Replace(" ", ""))) -isnot [int64]) {
	"Unable to parse '$($MinimumSize)' to an integer." | Write-Error
	Exit 1
}
## Not collecting LastWriteTime yet; it'll be in the same field as the Length, which would require an additional Split().
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /min:$Min /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int64]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0} | ForEach-Object {
	$Folder = Split-Path -Path $_.FullName -Parent
	$File = Split-Path -Path $_.FullName -Leaf
	$Line = & robocopy.exe $Folder C:\Dummy_Must_Not_Exist $File /L /s /nc /njh /njs /ndl /ns /ts
	$_ | Select-Object -Property `
		*,
		@{Name="SizeMB"; Expression={"{0:N3}" -f ($_.Length / 1MB)}},
		@{Name="LastWriteTime"; Expression={[DateTime]($Line.Split("`t")[4].Trim())}}
}

Open in new window

Top Expert 2014

Commented:
Every time you find a file that is larger than the minimum in your list, you are performing the sort.

You could go through a second round of filtering after the Robocopy /min:### process.  
1. Read the results and look for the largest and smallest values.  Alternatively, you could do a frequency analysis of the file sizes (a more accurate approach).    
2. With the information/data from step 1, you can then filter the Robocopy output to make sorting a simple, non-system-stressing operation.

===================
For frequency analysis, use a 10x14 array to store the counts of the mantissa digit and exp indexed file counts.

A simpler frequency analysis can be made of the length of the string numbers, ignoring the mantissa data.

A simpler approach to the initial filtering would be to start with a large /min:### value and decrease it until you get more than 50 (?N?) files.  Save that value for each volume for the next run.

Example:
On my test, I ran the Robocopy against a directory tree with 31000 files.  The resulting directory tree output was 4MB.  Applying the /min:1000000 reduced the file to 30KB with 390 lines.
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
Mark, there only two practical choices:
Either you ignore the amount of data, e.g. because it is of no significance anymore (filtered to something reasonable),
or you sort each time.

Everything else is sorting theory only. without any practical value for this case. It is making it complex, not simple, to apply statistical methods.
The "sort on change" approach will scale well, as the probability to have the top n values in the list is increasing with each file found, and then no sort is done.

A sort of 50 objects (with an average of maybe 250 Bytes per entry) should be a very fast operation, without much CPU cost. The more objects to keep, the more cost, that is correct - with O(n^2), if I'm correct.
Top Expert 2014

Commented:
@Q

O(N * log2(N))

I did some back-of-the-napkin estimate, based on my earlier test.  If I scale up my 8.5GB tree to 17TB, I might expect a concomitant 2000 fold increase in the number of files.  So, my 30,900 file problem would scale up to 61,800,000 file problem.  Sure, sorting 50 items isn't too terribly expensive, but we don't know the distribution of the files, so we can't eliminate the possibility that we would be doing over 60M sorting operations on our little list.

Author

Commented:
I ran the script below as posted earlier and get a message that states.."Cannot call method on null value expression" yet does should some files.

Also I noticed the Full name path list a whole bunch of dots at the end. If the script exported to a CSV could we see the full file path? What does the column Length mean? Stupid question how can I be sure that robocopy is not moving any files? Sorry just paranoid...


[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50,
	[string]$MinimumSize = "1MB"
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
If (($Min = ([int64]1 * $MinimumSize.Replace(" ", ""))) -isnot [int64]) {
	"Unable to parse '$($MinimumSize)' to an integer." | Write-Error
	Exit 1
}
## Not collecting LastWriteTime yet; it'll be in the same field as the Length, which would require an additional Split().
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /min:$Min /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int64]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0} | ForEach-Object {
	$Folder = Split-Path -Path $_.FullName -Parent
	$File = Split-Path -Path $_.FullName -Leaf
	$Line = & robocopy.exe $Folder C:\Dummy_Must_Not_Exist $File /L /s /nc /njh /njs /ndl /ns /ts
	$_ | Select-Object -Property `
		*,
		@{Name="SizeMB"; Expression={"{0:N3}" -f ($_.Length / 1MB)}},
		@{Name="LastWriteTime"; Expression={[DateTime]($Line.Split("`t")[4].Trim())}}
}

Open in new window

Top Expert 2014

Commented:
how can I be sure that robocopy is not moving any files?
because of the /L command line switch

Author

Commented:
Thanks ... Just wanted to make sure since the Microsoft site use a lower case l and not upper case.. Is the other syntax include just logging?

what about the error message I am getting or the full file name not being listed? Would it display the file name if exported to CSV... What would need to be changed to export it to a CSV?
Top Expert 2014

Commented:
shouldn't be an issue, since the long path names are coming from Robocopy and not from within PS
Most Valuable Expert 2018
Distinguished Expert 2018
Commented:
robocopy arguments are case insensitive (the /L is uppercase so it's obvious it's an "L", not maybe an uppercase "i"); the other arguments just get rid of the additional output robocopy generates otherwise.
The column Length is the length of the file in bytes.
For testing, use it as I suggested above; start with
$Top = .\Whatever.ps1 -Path E:\Wherever

Open in new window

$Top now contains an array of the objects found, which you can inspect any which way you want, for example (Format-Table won't help you much because of the long file paths.):
$Top | Format-List

Open in new window

Or export it to csv:
$Top | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Open in new window

To run the script and export to csv in one go, just pipe the output to Export-Csv instead of collecting it in a variable:
.\Whatever.ps1 -Path E:\Wherever | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Open in new window

The error was probably an error message from robocopy. The following has better error handling:
[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50,
	[string]$MinimumSize = "1MB"
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
If (($Min = ([int64]1 * $MinimumSize.Replace(" ", ""))) -isnot [int64]) {
	"Unable to parse '$($MinimumSize)' to an integer." | Write-Error
	Exit 1
}
$FileCount = $ErrorCount = 0
## Not collecting LastWriteTime yet; it'll be in the same field as the Length, which would require an additional Split().
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /min:$Min /r:0 | ForEach-Object {
	If ($_) {
		$Line = $_
		Try {
			$Split = $Line.Split("`t")
			$Length = [int64]$Split[3].Trim()
			If ($Length -gt $List[0].Length) {
				$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_}}, @{Name="Length"; Expression={$Length}}
				$List = $List | Sort-Object -Property Length
			}
			$FileCount++
		} Catch {
			"Unable to parse the line: '$($Line)'" | Write-Warning
			$ErrorCount++
		}
	}
}
$List | Where-Object {$_.Length -gt 0} | ForEach-Object {
	$Folder = Split-Path -Path $_.FullName -Parent
	$File = Split-Path -Path $_.FullName -Leaf
	$Line = & robocopy.exe $Folder C:\Dummy_Must_Not_Exist $File /L /nc /njh /njs /ndl /ns /ts
	$_ | Select-Object -Property `
		*,
		@{Name="SizeMB"; Expression={[math]::Round(($_.Length / 1MB), 3)}},
		@{Name="LastWriteTime"; Expression={[DateTime]($Line.Split("`t")[4].Trim())}}
}
"Analyzed $($FileCount) files bigger than $($MinimumSize)." | Write-Host
If ($ErrorCount -gt 0) {
	"Encountered $($ErrorCount) errors." | Write-Warning
}

Open in new window

Author

Commented:
Stupid question...

I see you are saving the script to a PS1 file.. for powershell then using the path command.. Isn't this redundant

\Whatever.ps1 -Path E:\Wherever | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Also were do I specific which volume to scan
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
Sorry, I can't quite follow you.
The script accepts three arguments, all optional:
-Path: The path to the folder in which to start the report; default is the current folder (which is not necessarily the script's).
-Top: the number of files to collect in the report; default is 50.
-MinimumSize: the minimum file size in bytes to consider wort checking; this accepts (as you can see in the default value) stuff like "1MB" or "100KB" or "10TB" as well.

Author

Commented:
Thank but how does the script know which drive to scan..
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
With the -Path argument, the same way Get-ChildItem expects the path to list as argument.
Save the script somewhere, for example as C:\Temp\Get-TopFiles.ps1.
Then open a PS console (preferably as Administrator), and if you want to scan for example the folder "E:\BigData", enter
$Top = C:\Temp\Get-TopFiles.ps1 -Path E:\BigData

Open in new window

Do not forget the "$Top ="; it collects the objects returned so that you have more to work with than some console output that will be truncated.
For further processing of that variable and how to export it to csv, check my examples above http:#a41455680.
Top Expert 2014

Commented:
In a test I ran, the Robocopy output might contain error messages like this:
2016/02/09 10:18:50 ERROR 5 (0x00000005) Scanning Source Directory c:\users\aikimark\Templates\
Access is denied.

Open in new window


So, you will need to check that the file length column you parse is numeric before you add the item to your list.

Alternatively, you might check that the split line result has more than one item.
Most Valuable Expert 2018
Distinguished Expert 2018

Commented:
That's already been taken care of in the latest version. The error lines don't contain tabs, so access to the split array will fail and end up in the Catch.

Author

Commented:
First off I wanted to thank everyone one for their help. I have been testing the script and it is working great you Guy's are Grand Master at this ...

Right now my goal is to run a scheduled task to scan each of my volume daily overwriting the same CSV name. The after all file are done run a final script to send all CSV's as an email attachment. Is this hard to do?

 Also when I run the script in the CSV what does the LENGTH column represent?
Qlemo"Batchelor", Developer and EE Topic Advisor
Top Expert 2015

Commented:
Length is the file size (it is the way PowerShell fills the file object). We would not have to keep it, as the script adds SizeMB, but it is no bad idea either to see the "raw" data.

The "send all CSV" part isn't difficult:
Send-MailMessage -Subject "Daily Report" -From me@domain.com -To you@domain.com -SmtpServer mail.domain.com -Attachments (get-childitem c:\temp\ -include *.csv | select -Expand FullName)

Open in new window

or
get-childitem c:\temp\ -include *.csv | select -Expand Fullname | Send-MailMessage  -Subject "Daily Report" -From me@domain.com -To you@domain.com -SmtpServer mail.domain.com

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial