我遇到了一些melt破壞我對該功能的心理模型的熊貓行為,我想知道是否有人可以解釋為什么這是理智/合乎邏輯/理想的行為。
下面的代碼片段融合了一個資料幀,然后將結果轉換為一個 numpy 陣列。由于我正在融化所有列,因此我預計結果與會np.ndarray.ravel()做的類似。即,在資料中創建一維視圖并添加具有相應列名(變數名)的列。然而,令我驚訝的是,melt實際上復制了資料并將其重新排序為f-contigous。為什么 f 鄰接在這里是個好主意?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)
uj5u.com熱心網友回復:
從一對“系列”構造一個框架:
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
結果陣列是 F_CONTIGUOUS。
如果我從二維陣列制作一個框架,則該值與輸入相同,在這種情況下順序為“C”:
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
使用命令 F 創建它,結果與第一種情況相同:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
熔化
回到從訂單 C 創建的框架:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
請注意該value列是如何垂直連接“a”和“b”列的。這就是方法示例所顯示的。我沒有pivot足夠的資訊來知道這是否是對它的自然解釋。
使用順序“F”框架:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/443069.html
