tshark和powershell重定向如何創建位元組碼文本檔案？-有解無憂

好的，這實際上是我能夠解決的問題，但我仍然不明白為什么問題首先存在。

我一直在網路流量上使用 tshark，目的是創建一個 txt 或 csv 檔案，其中包含可用于機器學習的關鍵資訊。乍一看，該檔案看起來非常好，完全符合我的想象。但是，在 python 中，我注意到一些奇怪的初始字符，并且在應用拆分運算子時，突然我正在處理位元組碼。

我的 powershell 腳本最初看起來像這樣：

$src = "G:\...\train_data\"
$dst = $src "tsharked\"
Write-Output $dst

Get-ChildItem $src -Filter *.pcap | 
Foreach-Object {
    $content = Get-Content $_.FullName
    $filename=$_.BaseName
    tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta > $dst$filename.txt
}

現在我嘗試在我的 jupyter 筆記本中讀取這個檔案

directory = "G://.../train_data/tsharked/"
file = open(directory "example.txt", "r")
for line in file.readlines():
    print(line)
    words = line.split(",")
    print(words)
    break

結果看起來像這樣

?t134.169.109.51,134.169.109.25,543,0.000000000,0.000000000

['?t1\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x005\x001\x00', '\x001\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x002\x005\x00', '\x005\x004\x003\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00\n']

當我在編輯器中打開文本檔案時，特殊字符?t沒有出現。這是我第一次見到他們。他們甚至在這里意味著什么？無論如何，我只能通過洗掉我的 powershell 腳本中的輸出重定向來解決這個問題。

$src = "G:\...\train_data\"
$dst = $src "tsharked\"
Write-Output $dst

Get-ChildItem $src -Filter *.pcap | 
Foreach-Object {
    $content = Get-Content $_.FullName
    $filename=$_.BaseName
    $out = tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta
    Set-Content -Path $dst$filename.txt -Value $out
}

這就是我問自己的問題，即 powershell 中的輸出重定向如何設法寫入某種位元組輸出？據我了解，這只是控制臺輸出的重定向，因此得名。這怎么可能不是字串？

uj5u.com熱心網友回復：

從 PowerShell 7.2 開始，外部程式的輸出在進一步處理之前總是被解碼為文本，這意味著原始（位元組）輸出既不能通過傳遞|也不能用>. 有關詳細資訊，請參閱此答案。
PowerShell 的>重定向運算子實際上是的別名Out-File，因此適用其默認字符編碼。

在Windows PowerShell 中，Out-File默認為“Unicode”編碼，即UTF-16 LE：

此編碼使用BOM（位元組順序標記），如果將其位元組單獨解釋為 ANSI (Windows-1252) 位元組，則呈現為?t)，并且它將大多數字符表示為兩個位元組序列，^[1]在Windows-1252 字符集中的大多數字符（它本身是 ASCII 的超集）意味著每個序列中的第二個位元組是NUL(0x0位元組) - 這就是您所看到的。

幸運的是，在PowerShell (Core) 7 中，所有檔案處理 cmdlet 現在始終默認為 (BOM-less) UTF-8。

要使用不同的編碼，要么Out-File顯式呼叫并使用它的-Encoding引數，要么 - 正如你所做的那樣，并且在處理已經是text -use 的資料時，為了提高性能通常更可取Set-Content。

^{[1]每個字符至少需要兩個位元組；對于所謂BMP (Basic Multilingual Plane)之外的字符，需要一對兩位元組序列。}

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/405513.html

標籤：

上一篇：通過Powershell檢查本地管理員

下一篇：陣列行索引函式