我目前正在嘗試決議包含一些結構化資訊的
當我使用資料框在 excel 中讀取時,df_original = pd.read_excel(filename, sheet_name=sheet)看起來像這樣
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
---- -------------- -------------- -------------- ----------------- -------------- -------------- --------------
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|---- -------------- -------------- -------------- ----------------- -------------- -------------- --------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
---- -------------- -------------- -------------- ----------------- -------------- -------------- --------------
此代碼查找感興趣的行并解決目標 1。
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
---- -------------- -------------- ----------------- -------------- -------------- --------------
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|---- -------------- -------------- ----------------- -------------- -------------- --------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
---- -------------- -------------- ----------------- -------------- -------------- --------------
解決目標 2 的最簡單方法是什么?基本上我想找到至少包含一個與正則運算式模式匹配的值的列。想要的輸出是[Unnamed: 5]. 可能有一些簡單的方法可以同時解決目標 1 和 2。例如:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
---- ----------------- -------------- -------------- --------------
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|---- ----------------- -------------- -------------- --------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
---- ----------------- -------------- -------------- --------------
uj5u.com熱心網友回復:
一種選擇是使用pyjanitor的xlsx_cells;它將每個單元格讀取為一行;這樣你就獲得了更多的操縱自由;對于您的用例,它可能很方便,也是一種替代方法:
# pip install pyjanitor
import pandas as pd
import janitor as jn
讀入資料
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
過濾與模式匹配的行:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
查找與 位于同一行vals且列中的值大于 中的列的值vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
合并vals并result得到最終輸出
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water
uj5u.com熱心網友回復:
嘗試以下 2 個選項之一:
選項 1(假設我們不想NaN包含的“[Tt]emperature (C)”行下方沒有非資料)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
解釋
- 首先,我們使用
df.stack將列名作為一個級別添加到索引中,并在一個列中獲取所有資料。 - 現在,我們可以申請
Series.str.contains為r'[Tt]emperature'. 我們鏈接Series.idxmax到“[r]回傳最大值的行標簽”。即這將是第一個True,所以我們將回傳(5, 'Unnamed: 3'), 分別存盤在idx和col中。 - 現在,我們知道從哪里開始我們的選擇
df,即在 index5和 columnUnnamed: 3。如果我們只是想要從這里開始的所有資料(向右和向下),我們可以使用:df_original.loc[idx:, col:]最后,洗掉所有僅具有NaN值的剩余行。
選項 2(我們不想包含的帶有“[Tt]emperature (C)”的行下方的潛在資料)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
解釋
- 基本上,這里的程序與選項 1 相同,只是我們要檢索所有
index values,而不僅僅是第一個(對于“[Tt]emperature (C)”)。之后tmp[tmp].index,我們得到tmp:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
- 在下一步中,我們將這些值用作 的坐標
df.loc。即對于索引選擇,我們想要所有值,所以我們使用index.get_level_values; 對于列,我們只需要第一個值(當然它們都應該相同:)Unnamed: 3。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/524486.html
下一篇:如何將名稱屬性添加到文本區域?
