我有這個資料框,它是從以下代碼構建的:
d = [{'AX':['chr=1','pos=2'], 'AVF1':[], 'HI':['chr=343', 'pos=4'], 'version_1':[]},
{'AX':[], 'AVF1':['chr=4', 'pos=454'], 'HI':[], 'version_2':[]},
{'AX':['chr=3', 'pos=32'], 'AVF1':['chr=6', 'pos=12'], 'HI':[], 'version_3':[]}]
frame = pd.DataFrame(d)
frame
cols = ['AX','AVF1','HI']
f = frame.T
lst = []
f['temp'] = f.index
for i in f.iloc[-3:, -1]:
lst.append(i)
f = f.drop(columns={'temp'})
f.columns = [lst, f.columns]
f
chrs = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['chr']]))
pos = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['pos']]))
f = pd.concat([f,chrs], axis=1).sort_index(level=0, axis=1)
f = pd.concat([f,pos], axis=1).sort_index(level=0, axis=1)
f = f.drop(f.index[[-1,-2,-3]])
f
version_1 version_2 version_3
0 chr pos 1 chr pos 2 chr pos
AX [chr=1, pos=2] NaN NaN [] NaN NaN [chr=3, pos=32] NaN NaN
AVF1 [] NaN NaN [chr=4, pos=454] NaN NaN [chr=6, pos=12] NaN NaN
HI [chr=343, pos=4] NaN NaN [] NaN NaN [] NaN NaN
我正在嘗試查看以 int (0,1,2) 開頭的每一列,并且模式匹配以“chr”和“pos”開頭的模式直到第一個逗號,即“chr=1”或“pos=454” . 然后我試圖將值附加到相應的列。
期望的輸出:
version_1 version_2 version_3
0 chr pos 1 chr pos 2 chr pos
AX [chr=1, pos=2] chr=1 pos=2 [] NaN NaN [chr=3, pos=32] chr=3 pos=32
AVF1 [] NaN NaN [chr=4, pos=454] chr=4 pos=454 [chr=6, pos=12] chr=6 pos=12
HI [chr=343, pos=4] chr=343 pos=4 [] NaN NaN [] NaN NaN
我正在執行此操作的真實資料框有更多的列,因此列出每一列可能不是一個可行的選擇。我嘗試了下面的代碼,但我不擅長模式匹配。
f['0'].str.extract(pat='chr')
uj5u.com熱心網友回復:
據我了解,您應該從更簡單的結構開始。不要創建輸出結構,然后提取資料,而是先提取資料,然后創建新列
從您的復雜結構開始,第一步是獲取 0,1,2 列。然后我們stack(實際上不需要正則運算式,因為您有串列):
cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False, True, False, False, True, False, False])
# get a single column with the chr/pos lists
s = f.loc[:, cols].droplevel(1, axis=1).stack()
# create a 2D structure from the extracted data and replace in origincal DataFrame
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
.unstack(1).swaplevel(axis=1))
輸出:
cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False, True, False, False, True, False, False])
s = f.loc[:, cols].droplevel(1, axis=1).stack()
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
.unstack(1).swaplevel(axis=1))
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/457116.html
上一篇:如何使用string.replace()重新加載newValue而不是用JavaScript重復newValue?[復制]
