如何使用python中的新資料集/資料農場更新訓練有素的IsolationForest模型？-有解無憂

假設我IsolationForest()在基于時間序列的 Dataset1 或 dataframe1 上擬合來自 scikit-learn 的演算法，并使用此處和此處df1提到的方法保存模型。現在我想為新的dataset2 或.df2

我的發現：

這個關于從 sklearn 進行增量學習的解決方法：

...從小批量實體中增量學習（有時稱為“在線學習”）是核心外學習的關鍵，因為它保證在任何給定時間，主實體中只有少量實體記憶。為平衡相關性和記憶體占用的小批量選擇合適的大小可能涉及調整。

但遺憾的是 IF 演算法不支持estimator.partial_fit(newdf)

根據這篇文章， auto-sklearn 優惠refit()也不適合我的情況。

如何使用新的 Dataset2 更新在 Dataset1 上訓練和保存的 IF 模型？

uj5u.com熱心網友回復：

您可以簡單地重用對新資料的估計器可用.fit()的呼叫。

這將是首選，尤其是在時間序列中，因為信號會發生變化，并且您不希望將較舊的非代表性資料理解為潛在的正常（或例外）。

如果舊資料很重要，您可以簡單地將舊的訓練資料和新的輸入信號資料連接在一起，然后.fit()再次呼叫。

另請注意，根據 sklearn 檔案，它比使用joblib更好pickle

具有以下資源的MRE：

# Model
from sklearn.ensemble import IsolationForest

# Saving file
import joblib

# Data
import numpy as np

# Create a new model
model = IsolationForest()

# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)

# Save it off
joblib.dump(model, 'isf_model.joblib')

# Load the model
model = joblib.load('isf_model.joblib')

# Generate new data
df2 = np.random.randint(1,500,(1000,10))

# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)

# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

combined_data = np.concatenate((df1, df2))
model.fit(combined_data)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/437517.html

標籤：Python 机器学习 scikit-学习隔离森林在线机器学习

上一篇：用Python預測正弦波

下一篇：使用單個資料點時僅從Model.Predict()中獲取零，但在使用整個測驗資料集時獲取1和0