我正在嘗試決議一個非常大的日志檔案,其中包含大約 16 個欄位的空格分隔文本。不幸的是,該應用程式在每個合法行之間記錄了一個空白行(任意將我必須處理的行加倍)。它還會導致欄位移動,因為它使用空間作為輪廓符以及空欄位。我無法在 LogParser 中解決這個問題。幸運的是,Powershell 為我提供了從末尾參考欄位的選項,也讓我更容易獲得受移位影響的后續欄位。
在對較小的示例檔案進行了一些測驗之后,我確定當檔案在本地使用 Get-Content 進行流式傳輸時,逐行處理比僅使用 Get-Content -ReadCount 0 完全讀取檔案然后從記憶體中處理要慢。這部分相對較快(<1min)。
處理每一行時都會出現問題,即使它在記憶體中。包含 561178 行合法資料(減去所有空白行)的 75MB 檔案需要數小時。
我在代碼本身并沒有做太多事情。我正在執行以下操作:
- 通過空格分割線作為分隔符
- 其中一個欄位是我正在反向 DNS 決議的 IP 地址,這顯然會很慢。因此,我將其包裝到更多代碼中,以創建先前決議的 IP 的記憶體陣列串列快取,并在可能的情況下從中提取。IP 基本相同,因此在幾百行之后,解析度不再是問題。
- 將所需的陣列元素保存到我的 pscustomobject
- 將 pscustomobject 添加到 arraylist 以供以后使用。
- 在回圈期間,我正在跟蹤我處理了多少行并在進度條中輸出該資訊(我知道這會增加額外的時間,但不確定多少)。我真的很想知道進展。
總而言之,它每秒處理大約 30-40 行,但顯然這不是很快。
有人可以提供替代方法/物件型別來實作我的目標并大大加快速度嗎?
下面是一些帶有欄位移位的日志示例(注意這是一個 Windows DNS 除錯日志)以及下面的代碼。
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A583FE0 UDP Snd 127.0.0.1 6c94 R Q [8385 A DR NXDOMAIN] AAAA (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A582050 UDP Snd 127.0.0.1 3d9d R Q [8081 DR NOERROR] A (4)pool(3)ntp(3)org(0)
NOTE: the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081 DR NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's " Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()
# Initialize log iteration counter
$i = 1
# Get Log data. Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d \/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
$PercentComplete = [math]::Round(($i/$LogDataTotalLines * 100))
Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s '
# Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
If ($DNSCache.IP -notcontains $temp[8]) {
$DNSEntry = [PSCustomObject]@{
IP = $temp[8]
DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
}
# Add DNSEntry to DNSCache collection
$DNSCache.Add($DNSEntry) | Out-Null
# Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE: value could be blank.
$ResolvedDNSName = $DNSEntry.DNSName
} Else {
# DNSCache contains resolved IP already. Find and Use it.
$ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
}
$LogEntry = [PSCustomObject]@{
Datetime = $temp[0] " " $temp[1] " " $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $temp[8]
ClientDNSName = $ResolvedDNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace "\(\d \)",".") -Replace "^\.","" # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry) | Out-Null
$i
}
uj5u.com熱心網友回復:
這是您可以嘗試的更優化的版本。
有什么變化?:
- 已洗掉
Write-Progress,尤其是因為不知道是否使用了 Windows PowerShell。低于 6 的 PowerShell 版本對性能有很大影響Write-Progress - 更改
$DNSCache為通用字典以進行快速查找 - 更改
$LogEntries為通用串列 - 從切換
Get-Content到switch -Regex -File
$Logfile = 'C:\Temp\log.txt'
$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)
# Process each log entry
switch -Regex -File ($Logfile) {
'^\d \/' {
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s '
$ip = [string] $temp[8]
$resolvedDNSRecord = $DNSCache[$ip]
if ($null -eq $resolvedDNSRecord) {
$resolvedDNSRecord = [PSCustomObject]@{
IP = $ip
DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
}
$DNSCache[$ip] = $resolvedDNSRecord
}
$LogEntry = [PSCustomObject]@{
Datetime = $temp[0] ' ' $temp[1] ' ' $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $ip
ClientDNSName = $resolvedDNSRecord.DNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace '\(\d \)', '.') -Replace '^\.', '' # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry)
}
}
如果它仍然很慢,仍然可以選擇使用Start-ThreadJob作為具有分塊行的多執行緒方法(例如每個作業 10000 行)。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/528448.html
標籤:电源外壳
