如何提高計算N個最大檔案大小的速度和記憶體使用率？-有解無憂

我得到了檔案夾中 32 個最大檔案的總位元組數：

$big32 = Get-ChildItem c:\\temp -recurse |
    Sort-Object length -descending |
    select-object -first 32 |
    measure-object -property length –sum

$big32.sum /1gb

但是，它的作業非常緩慢。我們在 140 萬個檔案中有大約 10 TB 的資料。

uj5u.com熱心網友回復：

以下僅使用 PowerShell cmdlet 實作改進。System.IO.Directory.EnumerateFiles()根據此答案的建議使用作為基礎可能會帶來另一個性能改進，但您應該自己進行測量以進行比較。

(Get-ChildItem c:\temp -Recurse -File).ForEach('Length') | 
    Sort-Object -Descending -Top 32 | 
    Measure-Object -Sum

這應該會大大減少記憶體消耗，因為它只對數字陣列而不是FileInfo物件陣列進行排序。也許由于更好的快取（數字陣列存盤在連續的、對快取友好的記憶體塊中，而物件陣列僅以連續的方式存盤參考，但物件本身可以分散在各處在記憶中）。

請注意使用.ForEach('Length')代替，而不僅僅是.Length因為成員列舉的歧義。

通過使用Sort-Object引數，-Top我們可以擺脫Select-Objectcmdlet，進一步減少管道開銷。

uj5u.com熱心網友回復：

我可以想到一些改進，特別是記憶體使用，但跟隨應該比 Get-ChildItem

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [PSCustomObject]@{
            filename = $_
            length = [System.IO.FileInfo]::New($_).Length
        }
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

編輯

我會考慮嘗試實作一個隱式堆來減少記憶體使用而不損害性能（甚至可能改進它......有待測驗）

編輯 2

如果不需要檔案名，最簡單的記憶體增益就是不將它們包含在結果中。

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [System.IO.FileInfo]::New($_).Length
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

uj5u.com熱心網友回復：

首先，如果您要使用，Get-ChildItem那么您應該傳遞-Fileswitch 引數，以便[System.IO.DirectoryInfo]實體永遠不會進入管道。

其次，您沒有將-Forceswitch 引數傳遞給Get-ChildItem，因此不會檢索該目錄結構中的任何隱藏檔案。

第三，請注意您的代碼正在檢索 32 個最大的檔案，而不是具有 32 個最大長度的檔案。也就是說，如果檔案 31、32 和 33 的長度都相同，則檔案 33 將被任意排除在最終計數之外。如果這種區別對你很重要，你可以像這樣重寫你的代碼......

$filesByLength = Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\' |
    Group-Object -AsHashTable -Property Length
$big32 = $filesByLength.Keys |
    Sort-Object -Descending |
    Select-Object -First 32 |
    ForEach-Object -Process { $filesByLength[$_] } |
    Measure-Object -Property Length -Sum

$filesByLength是[Hashtable]從長度映射到具有該長度的檔案。該Keys屬性包含所有檢索到的檔案的所有唯一長度，因此我們獲得 32 個最大的鍵/長度，并使用每個鍵/長度將所有該長度的檔案發送到管道中。

最重要的是，對檢索到的檔案進行排序以找到最大的檔案是有問題的，原因如下：

在所有輸入資料都可用之前無法開始排序，這意味著在那個時間點所有 140 萬個[System.IO.FileInfo]實體都將存在于記憶體中。
- 我不確定如何Sort-Object緩沖傳入的管道資料，但我想它會是某種串列，每次需要更多容量時，它的大小都會增加一倍，從而導致記憶體中的更多垃圾需要清理。
140 萬個[System.IO.FileInfo]實體中的每一個都將被第二次訪問以獲取它們的Length屬性，同時任何排序操作（取決于Sort-Object使用的演算法）也在發生。

由于我們只關心 140 萬個檔案中的 32 個最大檔案/長度，如果我們只跟蹤那 32 個而不是全部 140 萬個呢？考慮一下我們是否只想找到單個最大的檔案......

$largestFileLength = 0
$largestFile = $null

foreach ($file in Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\')
{
    # Track the largest length in a separate variable to avoid two comparisons...
    #     if ($largestFile -eq $null -or $file.Length -gt $largestFile.Length)
    if ($file.Length -gt $largestFileLength)
    {
        $largestFileLength = $file.Length
        $largestFile = $file
    }
}

Write-Host -Message "The largest file is named ""$($largestFile.Name)"" and has length $largestFileLength."

與相反Get-ChildItem ... | Sort-Object -Property Length -Descending | Select-Object -First 1，這具有一次只有一個[FileInfo]物件“在飛行中”的優點，并且整個[System.IO.FileInfo]s集僅被列舉一次。現在我們需要做的就是采用相同的方法，但從 1 個檔案/長度的“插槽”擴展到 32 個……

$basePath = 'C:\Temp\'
$lengthsToKeep = 32
$includeZeroLengthFiles = $false

$listType = 'System.Collections.Generic.List[System.IO.FileInfo]'
# A SortedDictionary[,] could be used instead to avoid having to fully enumerate the Keys
# property to find the new minimum length, but add/remove/retrieve performance is worse
$dictionaryType = "System.Collections.Generic.Dictionary[System.Int64, $listType]"

# Create a dictionary pre-sized to the maximum number of lengths to keep
$filesByLength = New-Object -TypeName $dictionaryType -ArgumentList $lengthsToKeep

# Cache the minimum length currently being kept
$minimumKeptLength = -1L

Get-ChildItem -File -Force -Recurse -Path $basePath |
    ForEach-Object -Process {
        if ($_.Length -gt 0 -or $includeZeroLengthFiles)
        {
            $list = $null
            if ($filesByLength.TryGetValue($_.Length, [ref] $list))
            {
                # The current file's length is already being kept
                # Add the current file to the existing list for this length
                $list.Add($_)
            }
            else
            {
                # The current file's length is not being kept

                if ($filesByLength.Count -lt $lengthsToKeep)
                {
                    # There are still available slots to keep more lengths

                    $list = New-Object -TypeName $listType

                    # The current file's length will occupy an empty slot of kept lengths
                }
                elseif ($_.Length -gt $minimumKeptLength)
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is large enough to keep

                    # Get the list for the minimum length
                    $list = $filesByLength[$minimumKeptLength]

                    # Remove the minimum length to make room for the current length
                    $filesByLength.Remove($minimumKeptLength) |
                        Out-Null

                    # Reuse the list for the now-removed minimum length instead of allocating a new one
                    $list.Clear()

                    # The current file's length will occupy the newly-vacated slot of kept lengths
                }
                else
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is too small to keep
                    return
                }
                $list.Add($_)

                $filesByLength.Add($_.Length, $list)
                $minimumKeptLength = ($filesByLength.Keys | Measure-Object -Minimum).Minimum
            }
        }
    }

# Unwrap the files in each by-length list
foreach ($list in $filesByLength.Values)
{
    foreach ($file in $list)
    {
        $file
    }
}

I went with the approach, described above, of retrieving the files with the 32 largest lengths. A [Dictionary[Int64, List[FileInfo]]] is used to track those 32 largest lengths and the corresponding files with that length. For each input file, we first check if its length is among the largest so far and, if so, add the file to the existing List[FileInfo] for that length. Otherwise, if there's still room in the dictionary we can unconditionally add the input file and its length, or if the input file is at least bigger than the smallest tracked length we can remove that smallest length and add in its place the input file and its length. Once there are no more input files we output all of the [FileInfo] objects from all of the [List[FileInfo]]s in the [Dictionary[Int64, [List[FileInfo]]]].

I ran this simple benchmarking template...

1..5 |
    ForEach-Object -Process {
        [GC]::Collect()

        return Measure-Command -Expression {
            # Code to test
        }
    } | Measure-Object -Property 'TotalSeconds' -Minimum -Maximum -Average

...在 PowerShell 7.2 上針對我的$Env:WinDir目錄（325,000 個檔案），結果如下：

`# Code to test`	最低限度	最大限度	平均	記憶體使用情況*
`Get-ChildItem -File -Force -Recurse -Path $Env:WinDir`	69.7240896	79.727841	72.81731518	260 MB
`$Env:WinDir`使用`-AsHashtable`,獲取最大長度為 32 的檔案`Sort-Object`	82.7488729	83.5245153	83.04068032	1 GB
`$Env:WinDir`使用按長度串列的字典獲取最大長度為 32 的檔案	81.6003697	82.7035483	82.15654538	235 MB

* 如在Task Manager→Details選項卡 →Memory (active private working set)列中所見

我有點失望，我的解決方案只比使用Keysa的代碼快 1% [Hashtable]，但也許使用編譯的 cmdlet 對檔案進行分組，而不是對它們進行分組或排序，但使用更多（解釋過的）PowerShell 代碼是一種清洗. 盡管如此，記憶體使用量的差異仍然很大，盡管我無法解釋為什么Get-ChildItem簡單地列舉所有檔案的呼叫最終會使用更多。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/379355.html

標籤：电源外壳表现获取子项内存高效目录列表

上一篇：使用data.table在j中使用順序任務加速for回圈

下一篇：計算所有已排序陣列組合的平均值