我有 2 個資料集,其中包含唯一值。第一個,也就是下面那個有幾家醫院的全名。
-----------------------
| HOSPITAL_NAME_FULL |
-----------------------
| St. Christine |
| Californian Hospital |
| Holy Mercy Hospital |
| Germanic NW Hospital |
| Trauma Center Hospital|
| Holy Spirit Hospital |
| Mater Hospital |
-----------------------
另一家有上述相同醫院的簡稱。
---------------------
| HOSPITAL_NAME_SHORT |
---------------------
| Christine |
| Californian |
| Mercy |
| Germanic |
| Trauma |
| Holy |
| Mater |
---------------------
問題是,我需要加入他們,所以我可以同時擁有全名和簡稱。我可以在使用某種正則運算式的同時加入資料框,以便得到這個結果嗎?
----------------------- ---------------------
| HOSPITAL_NAME_FULL | HOSPITAL_NAME_SHORT |
----------------------- ---------------------
| St. Christine | Christine |
| Californian Hospital | Californian |
| Holy Mercy Hospital | Mercy |
| Germanic NW Hospital | Germanic |
| Trauma Center Hospital| Trauma |
| Holy Spirit Hospital | Holy |
| Mater Hospital | Mater |
----------------------- ---------------------
謝謝!
uj5u.com熱心網友回復:
TL; DR,如果您的單詞HOSPITAL_NAME_SHORT完全存在,HOSPITAL_NAME_FULL我推薦第一種方法,如果您想考慮單詞之間的相似性,我推薦第二種方法:
第一種方法:( 單詞完全存在)
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
lambda x: next(st for st in df2['HOSPITAL_NAME_SHORT'] if st in x))
print(df1)
第二種方法 (考慮相似性)
您可以使用difflib.SequenceMatcher并查找兩個資料幀中每兩個單詞之間的相似性,并回傳每行具有最大相似性的單詞:
什么是difflib.SequenceMatcher:
>>> SequenceMatcher(None, 'Vitamin_A', 'Vitamin_C').ratio()
0.8888888888888888
解決這個問題SequenceMatcher:
from difflib import SequenceMatcher
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
lambda x: max([(st, SequenceMatcher(None, x, st).ratio())
for st in df2['HOSPITAL_NAME_SHORT']], key=lambda x: x[1])[0])
print(df1)
輸出:
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Mercy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater
輸入兩個資料框:
print(df1)
# HOSPITAL_NAME_FULL
# 0 St. Christine
# 1 Californian Hospital
# 2 Holy Mercy Hospital
# 3 Germanic NW Hospital
# 4 Trauma Center Hospital
# 5 Holy Spirit Hospital
# 6 Mater Hospital
print(df2)
# HOSPITAL_NAME_SHORT
# 0 Christine
# 1 Californian
# 2 Mercy
# 3 Germanic
# 4 Trauma
# 5 Holy
# 6 Mater
uj5u.com熱心網友回復:
你可以試試.str.extract
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].str.extract('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
print(df1)
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Holy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater
您還可以考慮str.extractall顯示所有可能的匹配項
df1 = (df1
.join(
df1['HOSPITAL_NAME_FULL'].str.extractall('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
.unstack().droplevel(0, axis=1)
)
)
print(df1)
HOSPITAL_NAME_FULL 0 1
0 St. Christine Christine NaN
1 Californian Hospital Californian NaN
2 Holy Mercy Hospital Holy Mercy
3 Germanic NW Hospital Germanic NaN
4 Trauma Center Hospital Trauma NaN
5 Holy Spirit Hospital Holy NaN
6 Mater Hospital Mater NaN
uj5u.com熱心網友回復:
這是一種方法:
df1['HOSPITAL_NAME_SHORT'] = df1.HOSPITAL_NAME_FULL.apply(lambda x: [y for y in df2.HOSPITAL_NAME_SHORT if y in x][0])
輸出:
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Mercy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/496122.html
上一篇:排序索引串列的方式與python中的熊貓資料框串列按長度排序相同嗎?
下一篇:按行數對熊貓資料框串列進行排序?
