將字典中的詞值分配給資料框內容 -有解無憂

下面是一個資料框架、一個字典和一個代碼的例子，該代碼可以作業，但對于巨大的字典來說效率極低。資料框架的其中一列包含句子。代碼從句子中提取每個詞，檢查它是否在字典中，并將其值分配給它。這些值的平均值（每句或每行）被計算出來并保存在一個額外的列中。

import pandas as pd
    test_df = pd.DataFrame({
    '_id': ['1a','2b','3c','4d'] 。
    'column': ['und der in zu',
                'Kompliziertereswort something',
                'Lehrerin in zu [Buch]',
                'Buch (Lehrerin) kompliziertereswort']})

{'und': 20,
     'der': 10,
     'in':  40,
     'zu':  10,
     'Kompliziertereswort': 2,
     'Buch': 5,
     'Lehrerin': 5}。

pat = fr"({'|'.join(map(re.escape, d))}) "
test_df['score'] = test_df[' column'].str. extractall(pat)[0].map（d）.mean(level=0)

print(test_df)

  _id列分數
0 1a und der in zu  20.0
1 2b Kompliziereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0[/span].
34d Buch (Lehrerin) kompliziereswort 5.0

既然翻閱字典比使用regex更有效，我相信一定有辦法使用一個函式將句子分割成單詞，只檢查它們是否在字典中，并計算出平均值。我還將字典轉化為一個資料框架，并使用explode()，但這也是完全沒有效率的。

uj5u.com熱心網友回復：

使用：

import string

test_df['score'] = (test_df['column'].str.split(expand=True)
                                     .stack()
                                     .str.strip(string.pubctuation)
                                     .map(d)
                                     .groupby(level=0)
                                     .mean())
print(test_df)
  _id列分數
0 1a und der in zu  20.0
1 2b Kompliziereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0[/span].
34d Buch (Lehrerin) kompliziereswort 5.0

或者：

f = lambda x: np.nanmean([d.get(y, np.nan) for y in x.split() ] )
test_df['score'] = test_df['column'].str. replace('[^ws]','', regex=True) .apply(f)
  _id列得分
0 1a und der in zu  20.0
1 2b Kompliziereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0[/span].
34d Buch (Lehrerin) kompliziereswort 5.0

uj5u.com熱心網友回復：

你可以試試：

test_df['score'] = test_df['umn']。 str.split(r'W').explode() 
                                    .map(d).groupby(level=0).mean()

輸出：

>>> test_df

  _id列分數
0 1a und der in zu  20.0
1 2b Kompliziereswort something 2.0
2 3c Lehrerin in zu [Buch] 15.0[/span].
34d Buch (Lehrerin) kompliziereswort 5.0

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/323840.html

標籤：

上一篇：當訪問我的Mastodon網站時，我得到了阻止域名的網頁。

下一篇：來自DB檔案值的DjangoURL路徑