Link to home
Start Free TrialLog in
Avatar of compdigit44
compdigit44

asked on

Powershell Script List Top 50 Largest Files

I have a very large File Server environment with 17TB of data and our current storage monitor solution is not able to scan the volume fast enough to produce a report daily..

I am not a powershell export but I am looking to see if a scritp would be able to list the Top 50 largest files per volume from largest to smallest, its full path and date of late access , owner and of course size. If it could be exported to a csv this would be great...
Avatar of Qlemo
Qlemo
Flag of Germany image

I doubt the PS approach will help with that, as it still has to traverse all file system info to determine the top n files. But of course you can try.
Get-ChildItem C:\ -recurse | select FullName, Length | sort Length -Desc | Select -First 50

Open in new window

Avatar of compdigit44
compdigit44

ASKER

Thanks...Does this list the size of the file though?
Yes, it shows the full path plus the size.
yes and add this
| export-csv C:\file_size\log.csv
the you have the output as a file
Thanks I am trying it now..
the script is running but I am getting messages that the file paths it is hitting are to long..
By default the "length" property does not show size in MB.
Some addition to script posted by Qlemo.
will give output to .csv and size will be in MB
Get-ChildItem C:\ -Recurse | select Name, Directory, @{N="Size(MB)";E={[Math]::Round($_.length/1MB,2)}} | sort "Size(MB)" -Descending | select -First 50 | Export-Csv -Path d:\Top50Files.csv -NoTypeInformation

Open in new window

Thanks... I will try this tomorrow and report back.. With this be able to handles files names with very long path names?
As the error says, its due to the path of file is more than 256 characters.
You will face the same error by running the script I posted coz there is no change is the cmdlet used (get-childitem).
This uses robocopy to produce the file list (but it will not actually copy anything), as robocopy doesn't care about long paths. It should have a pretty low memory footprint, too, since it doesn't collect all items first and then sorts them, but only collects the top n biggest ones.
Can't promise anything concerning speed, though.
It returns an array of PSCustom objects with two properties, FullName and Length. You can process that output any way you feel like, for example like this:
$Top = .\Whatever.ps1 -Path E:\Wherever
$Top | fl
$Top | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Open in new window

[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0}

Open in new window

We'll worry about how (considering the long file paths) to get the rest of the information you want if that runs in a timely fashion.
A very clever way to get the top n. Had to think about how it works for some time...
Thank for the replied and will look at this more tomorrow but the though of using robocopy even to just list files scare me a bit
In the past, I've used successive automaitc SUBST to cut down overly long paths in PowerShell, but that is in no way better (as any advantage) over using RoboCopy. RoboCopy is built-in, so why not use it?
A couple of suggestions
Exclude small files -- in this parameter example, 1MB minimum greatly reduced the number of output lines produced by Robocopy in my test.  If you know your file system well, you might be able to increase that minimum
/min:1000000

Open in new window


Sort once -- instead of sorting your list every time you add an item, sort all the items and take the top N.
compdigit44,
with the /L, argument, robocopy will only list the files - really. All it will do is generate a file list.

aikimark,
the /min is a nice touch.
The "sort once [all the items]" is exactly what I wanted to avoid. Not knowing anything about the server resources and its file structure, I find it safer to invest a bit of CPU into sorting (not that sorting 50 elements is really that taxing) than to first collect potentially gigabytes of data in RAM.
Here's a version that includes the minimum size suggested by aikimark, and more importantly, fixes a sizing issue (sorry); [int] might not be big enough. In addition, it gives you LastWriteTime and size in MB.
[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50,
	[string]$MinimumSize = "1MB"
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
If (($Min = ([int64]1 * $MinimumSize.Replace(" ", ""))) -isnot [int64]) {
	"Unable to parse '$($MinimumSize)' to an integer." | Write-Error
	Exit 1
}
## Not collecting LastWriteTime yet; it'll be in the same field as the Length, which would require an additional Split().
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /min:$Min /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int64]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0} | ForEach-Object {
	$Folder = Split-Path -Path $_.FullName -Parent
	$File = Split-Path -Path $_.FullName -Leaf
	$Line = & robocopy.exe $Folder C:\Dummy_Must_Not_Exist $File /L /s /nc /njh /njs /ndl /ns /ts
	$_ | Select-Object -Property `
		*,
		@{Name="SizeMB"; Expression={"{0:N3}" -f ($_.Length / 1MB)}},
		@{Name="LastWriteTime"; Expression={[DateTime]($Line.Split("`t")[4].Trim())}}
}

Open in new window

Every time you find a file that is larger than the minimum in your list, you are performing the sort.

You could go through a second round of filtering after the Robocopy /min:### process.  
1. Read the results and look for the largest and smallest values.  Alternatively, you could do a frequency analysis of the file sizes (a more accurate approach).    
2. With the information/data from step 1, you can then filter the Robocopy output to make sorting a simple, non-system-stressing operation.

===================
For frequency analysis, use a 10x14 array to store the counts of the mantissa digit and exp indexed file counts.

A simpler frequency analysis can be made of the length of the string numbers, ignoring the mantissa data.

A simpler approach to the initial filtering would be to start with a large /min:### value and decrease it until you get more than 50 (?N?) files.  Save that value for each volume for the next run.

Example:
On my test, I ran the Robocopy against a directory tree with 31000 files.  The resulting directory tree output was 4MB.  Applying the /min:1000000 reduced the file to 30KB with 390 lines.
Mark, there only two practical choices:
Either you ignore the amount of data, e.g. because it is of no significance anymore (filtered to something reasonable),
or you sort each time.

Everything else is sorting theory only. without any practical value for this case. It is making it complex, not simple, to apply statistical methods.
The "sort on change" approach will scale well, as the probability to have the top n values in the list is increasing with each file found, and then no sort is done.

A sort of 50 objects (with an average of maybe 250 Bytes per entry) should be a very fast operation, without much CPU cost. The more objects to keep, the more cost, that is correct - with O(n^2), if I'm correct.
@Q

O(N * log2(N))

I did some back-of-the-napkin estimate, based on my earlier test.  If I scale up my 8.5GB tree to 17TB, I might expect a concomitant 2000 fold increase in the number of files.  So, my 30,900 file problem would scale up to 61,800,000 file problem.  Sure, sorting 50 items isn't too terribly expensive, but we don't know the distribution of the files, so we can't eliminate the possibility that we would be doing over 60M sorting operations on our little list.
I ran the script below as posted earlier and get a message that states.."Cannot call method on null value expression" yet does should some files.

Also I noticed the Full name path list a whole bunch of dots at the end. If the script exported to a CSV could we see the full file path? What does the column Length mean? Stupid question how can I be sure that robocopy is not moving any files? Sorry just paranoid...


[CmdletBinding()]
Param(
	[string]$Path = $(Get-Location -PSProvider Filesystem),
	[uint32]$Top = 50,
	[string]$MinimumSize = "1MB"
)
$List = 1..$Top | ForEach-Object {New-Object -TypeName PSObject -Property @{"Length" = 0}}
If (($Min = ([int64]1 * $MinimumSize.Replace(" ", ""))) -isnot [int64]) {
	"Unable to parse '$($MinimumSize)' to an integer." | Write-Error
	Exit 1
}
## Not collecting LastWriteTime yet; it'll be in the same field as the Length, which would require an additional Split().
& robocopy.exe $Path C:\Dummy_Must_Not_Exist *.* /L /s /nc /njh /njs /ndl /fp /bytes /min:$Min /r:0 | ForEach-Object {
	If ($_) {
		$Split = $_.Split("`t")
		$Length = [int64]$Split[3].Trim()
		If ($Length -gt $List[0].Length) {
			$List[0] = $Split[4].Trim() | Select-Object -Property @{Name="FullName"; Expression={$_};}, @{Name="Length"; Expression={$Length}}
			$List = $List | Sort-Object -Property Length
		}
	}
}
$List | Where-Object {$_.Length -gt 0} | ForEach-Object {
	$Folder = Split-Path -Path $_.FullName -Parent
	$File = Split-Path -Path $_.FullName -Leaf
	$Line = & robocopy.exe $Folder C:\Dummy_Must_Not_Exist $File /L /s /nc /njh /njs /ndl /ns /ts
	$_ | Select-Object -Property `
		*,
		@{Name="SizeMB"; Expression={"{0:N3}" -f ($_.Length / 1MB)}},
		@{Name="LastWriteTime"; Expression={[DateTime]($Line.Split("`t")[4].Trim())}}
}

Open in new window

how can I be sure that robocopy is not moving any files?
because of the /L command line switch
Thanks ... Just wanted to make sure since the Microsoft site use a lower case l and not upper case.. Is the other syntax include just logging?

what about the error message I am getting or the full file name not being listed? Would it display the file name if exported to CSV... What would need to be changed to export it to a CSV?
shouldn't be an issue, since the long path names are coming from Robocopy and not from within PS
ASKER CERTIFIED SOLUTION
Avatar of oBdA
oBdA

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Stupid question...

I see you are saving the script to a PS1 file.. for powershell then using the path command.. Isn't this redundant

\Whatever.ps1 -Path E:\Wherever | Export-Csv -Path C:\Wherever\Top.csv -NoTypeInformation

Also were do I specific which volume to scan
Sorry, I can't quite follow you.
The script accepts three arguments, all optional:
-Path: The path to the folder in which to start the report; default is the current folder (which is not necessarily the script's).
-Top: the number of files to collect in the report; default is 50.
-MinimumSize: the minimum file size in bytes to consider wort checking; this accepts (as you can see in the default value) stuff like "1MB" or "100KB" or "10TB" as well.
Thank but how does the script know which drive to scan..
With the -Path argument, the same way Get-ChildItem expects the path to list as argument.
Save the script somewhere, for example as C:\Temp\Get-TopFiles.ps1.
Then open a PS console (preferably as Administrator), and if you want to scan for example the folder "E:\BigData", enter
$Top = C:\Temp\Get-TopFiles.ps1 -Path E:\BigData

Open in new window

Do not forget the "$Top ="; it collects the objects returned so that you have more to work with than some console output that will be truncated.
For further processing of that variable and how to export it to csv, check my examples above http:#a41455680.
In a test I ran, the Robocopy output might contain error messages like this:
2016/02/09 10:18:50 ERROR 5 (0x00000005) Scanning Source Directory c:\users\aikimark\Templates\
Access is denied.

Open in new window


So, you will need to check that the file length column you parse is numeric before you add the item to your list.

Alternatively, you might check that the split line result has more than one item.
That's already been taken care of in the latest version. The error lines don't contain tabs, so access to the split array will fail and end up in the Catch.
First off I wanted to thank everyone one for their help. I have been testing the script and it is working great you Guy's are Grand Master at this ...

Right now my goal is to run a scheduled task to scan each of my volume daily overwriting the same CSV name. The after all file are done run a final script to send all CSV's as an email attachment. Is this hard to do?

 Also when I run the script in the CSV what does the LENGTH column represent?
Length is the file size (it is the way PowerShell fills the file object). We would not have to keep it, as the script adds SizeMB, but it is no bad idea either to see the "raw" data.

The "send all CSV" part isn't difficult:
Send-MailMessage -Subject "Daily Report" -From me@domain.com -To you@domain.com -SmtpServer mail.domain.com -Attachments (get-childitem c:\temp\ -include *.csv | select -Expand FullName)

Open in new window

or
get-childitem c:\temp\ -include *.csv | select -Expand Fullname | Send-MailMessage  -Subject "Daily Report" -From me@domain.com -To you@domain.com -SmtpServer mail.domain.com

Open in new window