Pandas-兩個問題-根據搜索結果創建列和替換值-有解無憂

#!/usr/bin/env python3

import pandas
import numpy

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')
df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries', df['Category'])
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff', df['Category'])

print (df)

鑒于上面的代碼，我有兩個相關的問題：

有沒有比使用臨時值更好的方法來創建類別列，然后用特定值替換它，如圖所示？我問，因為它感覺很亂。
我怎樣才能在單獨的檔案中找到要搜索的術語和要分配的類別？那可能嗎？我問是因為我想讓自己在將來更容易添加新術語并定義類別。

謝謝

uj5u.com熱心網友回復：

1. 第一個問題

您不需要預先創建新列，您可以執行以下操作：


#df['Category'] = 'tempvalue'

df['Category'] = numpy.where(df['Description'].str.contains('Tesco|Sainsbury'), 'Groceries',numpy.nan)
df['Category'] = numpy.where(df['Description'].str.contains('Dentist|Cinema'), 'Stuff',df['Category'])

2. 第二個問題

categories.json讓我們在腳本的同一目錄中創建一個簡單的鍵值檔案

{
    "Tesco|Sainsbury":"Groceries",
    "Dentist|Cinema":"Stuff"
}

你可以做這樣的事情來自動化類別分配

import pandas
import numpy
import json

example_dataset = {
    'Date' : ['01 Mar 2022', '02 Apr 2022', '10 Apr 2022', '15 Apr 2022'],
    'Transaction Type' : ['Contactless payment', 'Payment to', 'Contactless payment', 'Contactless payment'],
    'Description' : ['Tesco Store', 'Dentist', 'Cinema', 'Sainsburys'],
    'Amount' : ['156.00', '55', '21.50', '176.10']
}

df = pandas.DataFrame(example_dataset)

df ['Date'] = pandas.to_datetime(df['Date'], format='%d %b %Y')



with open('categories.json') as file:
    categories_dict = json.load(file)

df['Category'] = numpy.nan
for key,value in categories_dict.items():
    df['Category'] = numpy.where(df['Description'].str.contains(key), value,df['Category'])

在這種情況下，為了簡單起見，我建議保留列初始化

uj5u.com熱心網友回復：

您可以在 csv 中撰寫搜索詞，例如“search_terms.csv”，例如：

SearchTerm,Value
Tesco|Sainsbury,Groceries
Dentist|Cinema,Stuff

并將其讀入如下資料框：

df_search = pd.read_csv('search_terms.csv')

并建立一個字典，如：

search_dict = df_search.set_index('SearchTerm')['Value'].to_dict()

Category現在將列初始化為：

df['Category'] = np.nan

并有效地更新Category到位，loc例如：

for k in d:
    df.loc[df['Description'].str.match(k),'Category'] = d[k]

輸出df：

    Date        Transaction Type    Description Amount  Category
0   01 Mar 2022 Contactless payment Tesco Store 156.00  Groceries
1   02 Apr 2022 Payment to          Dentist     55      Stuff
2   10 Apr 2022 Contactless payment Cinema      21.50   Stuff
3   15 Apr 2022 Contactless payment Sainsburys  176.10  Groceries

uj5u.com熱心網友回復：

我發現這比 Gam 的答案要快，而且在我看來，代碼更簡潔：

category_dict = {'Groceries':
                     ['Tesco', 'Sainsbury'],
                 'Stuff':
                     ['Dentist', 'Cinema']
                     }
def get_category(description):
    for category, substrings in category_dict.items():
        for substring in substrings:
            if substring in description:
                return category
df['Category'] = df['Description'].apply(get_category)

如果你想要子字串作為鍵，有這個：

category_dict ={'Tesco':'Groceries',
                'Sainsbury':'Groceries',
                'Dentist':'Stuff',
                'Cinema':'Stuff'
                     }

def get_category(description):
    for substring in category_dict:
        if substring in description:
            return category_dict[substring]
df['Category'] = df['Description'].apply(get_category)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/467776.html

標籤：Python 熊猫

上一篇：如何為PandasDataFrame的隨機子集賦值？

下一篇：將較長的np陣列放入較短但較寬的panda資料幀中