給定一個df帶有簡單Index(不是 a MultiIndex)的資料框- 對應于帶有行和列名稱的二維實矩陣 - 以及e中元素的布爾運算式df,我想得到:
- 行的名稱和基于整數的索引
- 列的名稱和基于整數的索引
滿足運算式的所有元素e。運算式e沒什么特別的:我對大于閾值的元素的行/列感興趣。
在閱讀了檔案以及這里的大量問題和答案后,我撰寫了下面給出的代碼。它包含兩個解決方案:
- 一種基于
numpy. 基本上,我從資料框中提取數字并將它們視為numpy陣列。這個解決方案似乎是合理的:鑒于任務的基本性質,代碼足夠簡單。 - 一種基于提供的方法
pandas。即使pandas是為比簡單的數字矩陣更復雜的場景而設計的,這個解決方案對于我想要完成的事情來說似乎太復雜了。
設定資料
import numpy as np
import pandas as pd
n_rows, n_cols, v = 4, 5, 3
rows = [ "r" str(i) for i in range(n_rows) ]
columns = [ "c" str(i) for i in range(n_cols) ]
values = np.zeros( (n_rows, n_cols), dtype=int)
ii = np.random.randint(n_rows, size=(2,))
jj = np.random.randint(n_cols, size=(2,))
poss = zip(ii, jj)
for pos in poss:
print(f"target set at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
values[pos] = v 1
print(" === values ===")
print(values)
df = pd.DataFrame(values, index=rows, columns=columns)
print(" === df === ")
print(df)
帶輸出:
target set at (2, 4) -> (r2, c4)
target set at (1, 0) -> (r1, c0)
=== values ===
[[0 0 0 0 0]
[4 0 0 0 0]
[0 0 0 0 4]
[0 0 0 0 0]]
=== df ===
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 4 0 0 0 0
r2 0 0 0 0 4
r3 0 0 0 0 0
解決方案 numpy
print("\n === USING NUMPY ===")
data = df.to_numpy()
indexes = np.argwhere(data > v)
for ind in indexes:
print(f"(numpy) target found at {ind} -> ({rows[ind[0]]}, {columns[ind[1]]})")
帶輸出:
=== USING NUMPY ===
(numpy) target found at [1 0] -> (r1, c0)
(numpy) target found at [2 4] -> (r2, c4)
解決方案 pandas
print("\n === WITH PANDAS ===")
# select the rows with at least one column satisfying the condition
cond = (df > v).any(1)
df2 = df[cond]
print(df2, "\n")
# stack
stacked = df2.stack()
print(stacked, "\n")
# filter (again!)
stacked2 = stacked.loc[stacked>v]
print("indexes in stacked:", stacked2.index.to_list(), "\n")
# get index (it is a MultiIndex at this point)
target_rows = [a for (a, _) in stacked2.index.to_list()]
target_cols = [b for (_, b) in stacked2.index.to_list()]
target_rows_idx = [df.index.get_loc(row_name) for row_name in target_rows]
target_cols_idx = [columns.index(col_name) for col_name in target_cols]
for pos in zip(target_rows_idx, target_cols_idx):
print(f"(pandas) target found at {pos} -> ({rows[pos[0]]}, {columns[pos[1]]})")
帶輸出:
=== WITH PANDAS ===
c0 c1 c2 c3 c4
r1 4 0 0 0 0
r2 0 0 0 0 4
r1 c0 4
c1 0
c2 0
c3 0
c4 0
r2 c0 0
c1 0
c2 0
c3 0
c4 4
dtype: int64
indexes in stacked: [('r1', 'c0'), ('r2', 'c4')]
(pandas) target found at (1, 0) -> (r1, c0)
(pandas) target found at (2, 4) -> (r2, c4)
有沒有更簡單的方法來撰寫代碼pandas?
uj5u.com熱心網友回復:
由于stack下降NaN的默認值,我們可以屏蔽掉值第一,然后 stack(這避免了需要過濾器的兩倍)。然后,只需抓住index和使用get_loc上都index與columns該標簽轉換為整數值:
stacked = df[df > v].stack()
label_idx = stacked.index.tolist()
integer_idx = [(df.index.get_loc(r), df.columns.get_loc(c))
for r, c in label_idx]
for i, j in zip(integer_idx, label_idx):
print(f'(pandas 2) target found at {i} -> {j}')
輸出:
(pandas 2) target found at (0, 0) -> ('r0', 'c0')
(pandas 2) target found at (1, 4) -> ('r1', 'c4')
stacked:
r0 c0 4.0
r1 c4 4.0
dtype: float64
label_idx:
[('r0', 'c0'), ('r1', 'c4')]
integer_index:
[(0, 0), (1, 4)]
可重現:
np.random.seed(22)
uj5u.com熱心網友回復:
我會用pd.Series.iteritems():
>>> [x for x, y in df.gt(3).stack().iteritems() if y]
[('r1', 'c3'), ('r2', 'c3')]
對于索引:
>>> [(df.index.get_loc(a), df.columns.get_loc(b)) for (a, b), y in df.gt(3).stack().iteritems() if y]
[(1, 3), (2, 3)]
>>>
df 在這種情況下:
>>> df
c0 c1 c2 c3 c4
r0 0 0 0 0 0
r1 0 0 0 4 0
r2 0 0 0 4 0
r3 0 0 0 0 0
>>>
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/343402.html
下一篇:如何從資料幀計算事件的相對頻率?
