熊貓匹配模式并在同一資料框中多次附加到列-有解無憂

我有這個資料框，它是從以下代碼構建的：

d = [{'AX':['chr=1','pos=2'], 'AVF1':[], 'HI':['chr=343', 'pos=4'], 'version_1':[]},
      {'AX':[], 'AVF1':['chr=4', 'pos=454'], 'HI':[], 'version_2':[]},
      {'AX':['chr=3', 'pos=32'], 'AVF1':['chr=6', 'pos=12'], 'HI':[], 'version_3':[]}]

frame = pd.DataFrame(d)

frame

cols = ['AX','AVF1','HI']

f = frame.T

lst = []
f['temp'] = f.index
for i in f.iloc[-3:, -1]:
  lst.append(i)
f = f.drop(columns={'temp'})

f.columns = [lst, f.columns]
f

chrs = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['chr']]))
pos = pd.DataFrame(index=f.index, columns=pd.MultiIndex.from_product([f.columns.levels[0], ['pos']]))



f = pd.concat([f,chrs], axis=1).sort_index(level=0, axis=1)
f = pd.concat([f,pos], axis=1).sort_index(level=0, axis=1)

f = f.drop(f.index[[-1,-2,-3]])
f

        version_1                        version_2                                version_3
        0                 chr      pos   1                   chr        pos       2                 chr      pos
AX       [chr=1, pos=2]   NaN      NaN    []                 NaN        NaN        [chr=3, pos=32]  NaN      NaN
AVF1     []               NaN      NaN    [chr=4, pos=454]   NaN        NaN        [chr=6, pos=12]  NaN      NaN
HI       [chr=343, pos=4] NaN      NaN    []                 NaN        NaN        []               NaN      NaN

我正在嘗試查看以 int (0,1,2) 開頭的每一列，并且模式匹配以“chr”和“pos”開頭的模式直到第一個逗號，即“chr=1”或“pos=454” . 然后我試圖將值附加到相應的列。

期望的輸出：

        version_1                        version_2                                version_3
        0                 chr      pos   1                   chr        pos       2                 chr      pos
AX       [chr=1, pos=2]   chr=1    pos=2      []             NaN        NaN        [chr=3, pos=32]  chr=3    pos=32
AVF1     []               NaN      NaN    [chr=4, pos=454]   chr=4      pos=454    [chr=6, pos=12]  chr=6    pos=12
HI       [chr=343, pos=4] chr=343  pos=4  []                 NaN        NaN        []               NaN      NaN

我正在執行此操作的真實資料框有更多的列，因此列出每一列可能不是一個可行的選擇。我嘗試了下面的代碼，但我不擅長模式匹配。

f['0'].str.extract(pat='chr')

uj5u.com熱心網友回復：

據我了解，您應該從更簡單的結構開始。不要創建輸出結構，然后提取資料，而是先提取資料，然后創建新列

從您的復雜結構開始，第一步是獲取 0,1,2 列。然后我們stack（實際上不需要正則運算式，因為您有串列）：

cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False,  True, False, False,  True, False, False])

# get a single column with the chr/pos lists
s = f.loc[:, cols].droplevel(1, axis=1).stack()

# create a 2D structure from the extracted data and replace in origincal DataFrame
f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
                     .unstack(1).swaplevel(axis=1))

輸出：

cols = pd.to_numeric(f.columns.get_level_values(1), errors='coerce').notna()
# array([ True, False, False,  True, False, False,  True, False, False])

s = f.loc[:, cols].droplevel(1, axis=1).stack()

f.loc[:, ~cols] = (pd.DataFrame(s.to_list(), columns=['chr', 'pos'], index=s.index)
                     .unstack(1).swaplevel(axis=1))

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/457116.html

標籤：正则表达式熊猫数据框

上一篇：如何使用string.replace()重新加載newValue而不是用JavaScript重復newValue？[復制]

下一篇：Powershell正則運算式缺少CR等