#!/usr/bin/env python3
import pandas
import numpy
example_dataset = {
'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
'Amount' : ['156.00', '55', '21.50', '176.10']
}
df = pandas.DataFrame(example_dataset)
df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')
df['Category'] = 'tempvalue'
df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries', df['Category'])
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff', df['Category'])
print (df)
鑒于上面的代碼,我有兩個相關的問題:
- 有沒有比使用臨時值更好的方法來創建類別列,然后用特定值替換它,如圖所示?我問,因為它感覺很亂。
- 我怎樣才能在單獨的檔案中找到要搜索的術語和要分配的類別?那可能嗎?我問是因為我想讓自己在將來更容易添加新術語并定義類別。
謝謝
uj5u.com熱心網友回復:
1. 第一個問題
您不需要預先創建新列,您可以執行以下操作:
#df['Category'] = 'tempvalue'
df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries',numpy.nan)
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff',df['Category'])
2. 第二個問題
categories.json讓我們在腳本的同一目錄中創建一個簡單的鍵值檔案
{
"Tesco|Sainsbury":"Groceries",
"Dentist|Cinema":"Stuff"
}
你可以做這樣的事情來自動化類別分配
import pandas
import numpy
import json
example_dataset = {
'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
'Amount' : ['156.00', '55', '21.50', '176.10']
}
df = pandas.DataFrame(example_dataset)
df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')
with open('categories.json') as file:
categories_dict = json.load(file)
df['Category'] = numpy.nan
for key,value in categories_dict.items():
df['Category'] = numpy.where(df['Description'].str.contains(key), value,df['Category'])
在這種情況下,為了簡單起見,我建議保留列初始化
uj5u.com熱心網友回復:
您可以在 csv 中撰寫搜索詞,例如“search_terms.csv”,例如:
SearchTerm,Value
Tesco|Sainsbury,Groceries
Dentist|Cinema,Stuff
并將其讀入如下資料框:
df_search = pd.read_csv('search_terms.csv')
并建立一個字典,如:
search_dict = df_search.set_index('SearchTerm')['Value'].to_dict()
Category現在將列初始化為:
df['Category'] = np.nan
并有效地更新Category到位,loc例如:
for k in d:
df.loc[df['Description'].str.match(k),'Category'] = d[k]
輸出df:
Date Transaction Type Description Amount Category
0 01 Mar 2022 Contactless payment Tesco Store 156.00 Groceries
1 02 Apr 2022 Payment to Dentist 55 Stuff
2 10 Apr 2022 Contactless payment Cinema 21.50 Stuff
3 15 Apr 2022 Contactless payment Sainsburys 176.10 Groceries
uj5u.com熱心網友回復:
我發現這比 Gam 的答案要快,而且在我看來,代碼更簡潔:
category_dict = {'Groceries':
['Tesco', 'Sainsbury'],
'Stuff':
['Dentist', 'Cinema']
}
def get_category(description):
for category, substrings in category_dict.items():
for substring in substrings:
if substring in description:
return category
df['Category'] = df['Description'].apply(get_category)
如果你想要子字串作為鍵,有這個:
category_dict ={'Tesco':'Groceries',
'Sainsbury':'Groceries',
'Dentist':'Stuff',
'Cinema':'Stuff'
}
def get_category(description):
for substring in category_dict:
if substring in description:
return category_dict[substring]
df['Category'] = df['Description'].apply(get_category)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/467776.html
