使用字串作為位元組-有解無憂

我的問題如下：

我正在閱讀由某些軟體生成的 .csv 并閱讀它，我正在使用 Pandas。Pandas 正確讀取 .csv，但其中一列存盤表示向量的位元組序列，Pandas 將它們存盤為字串。

所以我有資料（字串），我想使用 np.frombuffer() 來獲得正確的向量。問題是，資料是一個字串，所以它已經編碼，所以當我使用 .encode() 將它轉換為位元組時，序列不是原始序列。

示例：.csv 包含\x00\x00表示向量 [0,0] 與 dtype=np.uint8。Pandas 將它存盤為一個字串，當我嘗試處理它時，會發生這樣的事情：

data = df.data[x] # With x any row.
type(data)

<類'str'>

print(data)

\x00\x00

e_data = data.encode("latin1")
print(e_data)

b'\\x00\\x00'

v = np.frombuffer(e_data, np.uint8)
print(v)

陣列（[ 92 120 48 48 92 120 48 48]，dtype=uint8）

我只是想從資料中得到 b'\x00\x00' 而不是 b'\\x00\\x00' ，我知道這是一個我還無法修復的編碼混亂。

有沒有辦法做到這一點？

謝謝！

uj5u.com熱心網友回復：

問題：您（顯然）有一個包含文字反斜杠轉義序列的字串，例如：

>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00

由此，您希望創建一個相應的bytes物件，其中反斜杠轉義序列被轉換為相應的位元組。

處理這些型別的轉義序列是使用unicode-escape字串編碼完成的。您可能知道，字串編碼在bytes和str物件之間進行轉換，指定位元組序列對應于什么 Unicode 代碼點的規則。

但是，unicode-escape編解碼器假定轉義序列bytes在等式的一側，并且該str側將具有相應的 Unicode 字符：

>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'

應用于.encode字串將反轉該程序；所以如果你從反斜杠轉義序列開始，它會重新轉義反斜杠：

>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]

如您所見，這顯然不是我們想要的。

我們想從轉換bytes到str做反斜杠逃逸。但是我們有一個str開始，所以我們需要將其更改為bytes；我們想要bytes在最后，所以我們需要改變str我們從反斜杠轉義中得到的。在這兩種情況下，我們都需要使 0-255 之間的每個 Unicode 代碼點對應于具有相同值的單個位元組。

我們為該任務所需的編碼稱為latin-1，也稱為iso-8859-1.

例如：

>>> r'\x00'.encode('latin-1')
b'\\x00'

因此，我們可以推斷出整體轉換：

>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'

根據需要：我們str的文字反斜杠，小寫 x 和兩個零，被轉換為bytes包含單個零位元組的物件。

或者：我們可以請求在解碼時處理反斜杠轉義，使用escape_decode來自codecs標準庫模塊。但是，這并沒有記錄在案，也不是真的打算以這種方式使用 - 它是用于實作unicode-escape編解碼器的內部東西，可能還有其他一些東西。

如果您想讓自己暴露在將來發生中斷的風險中，則看起來像：

>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)

We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.

Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:

>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/347956.html

標籤：Python 蟒蛇-3.x 熊猫字符编码

上一篇：Python-在初始提取數字后從字串中提取文本

下一篇：隨機時間戳和更大的“權重”到特定范圍