為什么Pandas融化Fortran的結果是連續的而不是C連續的？-有解無憂

我遇到了一些melt破壞我對該功能的心理模型的熊貓行為，我想知道是否有人可以解釋為什么這是理智/合乎邏輯/理想的行為。

下面的代碼片段融合了一個資料幀，然后將結果轉換為一個 numpy 陣列。由于我正在融化所有列，因此我預計結果與會np.ndarray.ravel()做的類似。即，在資料中創建一維視圖并添加具有相應列名（變數名）的列。然而，令我驚訝的是，melt實際上復制了資料并將其重新排序為f-contigous。為什么 f 鄰接在這里是個好主意？

expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)

# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]

test_df = pd.DataFrame(
    expected_flat.reshape(100, 3),
    columns=["a", "b", "c"],
)

# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat

flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()

# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False

# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)

uj5u.com熱心網友回復：

從一對“系列”構造一個框架：

In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]: 
   a  b
0  1  4
1  2  5
2  3  6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]: 
array([[1, 4],
       [2, 5],
       [3, 6]])
In [326]: arr.flags
Out[326]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  ...
In [327]: arr.strides
Out[327]: (8, 24)

結果陣列是 F_CONTIGUOUS。

如果我從二維陣列制作一個框架，則該值與輸入相同，在這種情況下順序為“C”：

In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]: 
   a  b
0  1  2
1  3  4
2  5  6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)

使用命令 F 創建它，結果與第一種情況相同：

In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
     ...: "a", "b"])
In [333]: df1
Out[333]: 
   a  b
0  1  4
1  2  5
2  3  6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)

熔化

回到從訂單 C 創建的框架：

In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]: 
  variable  value
0        a      1
1        a      3
2        a      5
3        b      2
4        b      4
5        b      6

請注意該value列是如何垂直連接“a”和“b”列的。這就是方法示例所顯示的。我沒有pivot足夠的資訊來知道這是否是對它的自然解釋。

使用順序“F”框架：

In [338]: df2.to_numpy()
Out[338]: 
array([['a', 1],
       ['a', 3],
       ['a', 5],
       ['b', 2],
       ['b', 4],
       ['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)

In df1 both columns are int dtype, and can be stored as a 2d array:

In [340]: df1.dtypes
Out[340]: 
a    int64
b    int64
dtype: object

df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':

In [341]: df2.dtypes
Out[341]: 
variable    object
value        int64
dtype: object

We get a hint of this storage from:

In [352]: df1._mgr
Out[352]: 
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]: 
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64

How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/443069.html

標籤：Python 熊猫麻木的

上一篇：如何在熊貓資料框中顯示包含1的列和行串列

下一篇：奇怪的`ErrorType`錯誤參考列