使用包含標頭的新資料（如pd.Dataframe）使用Azure機器學習進行預測-有解無憂

我的問題在某種程度上與https://docs.microsoft.com/en-us/answers/questions/217305/data-input-format-call-the-service-for-azure-ml-ti.html有關- 但是，提供的解決方案似乎不起作用。

我正在使用心臟病資料集構建一個簡單的模型，但我將它包裝到 Pipeline 中，因為我使用了一些特征化步驟（縮放、編碼等）。下面的完整腳本：

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import pickle

# data input
df = pd.read_csv('heart.csv')

# numerical variables
num_cols = ['age',
            'trestbps',
            'chol',
            'thalach',
            'oldpeak'
]

# categorical variables
cat_cols = ['sex',
            'cp',
            'fbs',
            'restecg',
            'exang',
            'slope',
            'ca',
            'thal']

# changing format of the categorical variables
df[cat_cols] = df[cat_cols].apply(lambda x: x.astype('object'))

# target variable
y = df['target']

# features
X = df.drop(['target'], axis=1)

# data split:

# random seed
np.random.seed(42)

# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    stratify=y)

# double check
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# pipeline for numerical data
num_preprocessing = Pipeline([('num_imputer', SimpleImputer(strategy='mean')), # imputing with mean
                                                   ('minmaxscaler', MinMaxScaler())]) # scaling

# pipeline for categorical data
cat_preprocessing = Pipeline([('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')), # filling missing values
                                                ('onehot', OneHotEncoder(drop='first', handle_unknown='error'))]) # One Hot Encoding

# preprocessor - combining pipelines
preprocessor = ColumnTransformer([
                                  ('categorical', cat_preprocessing, cat_cols),
                                  ('numerical', num_preprocessing, num_cols)
                                                           ])

# initial model parameters
log_ini_params = {'penalty': 'l2', 
                  'tol': 0.0073559740277086005, 
                  'C': 1.1592424247511928, 
                  'fit_intercept': True, 
                  'solver': 'liblinear'}

# model - Pipeline
log_clf = Pipeline([('preprocessor', preprocessor),
                  ('clf', LogisticRegression(**log_ini_params))])

log_clf.fit(X_train, y_train)

# dumping the model
f = 'model/log.pkl'
with open(f, 'wb') as file:
    pickle.dump(log_clf, file)

# loading it
loaded_model = joblib.load(f)

# double check on a single datapoint
new_data = pd.DataFrame({'age': 71,
                         'sex': 0,
                         'cp': 0,
                         'trestbps': 112,
                         'chol': 203,
                         'fbs': 0,
                         'restecg': 1,
                         'thalach': 185,
                         'exang': 0,
                         'oldpeak': 0.1,
                         'slope': 2,
                         'ca': 0,
                          'thal': 2}, index=[0])

loaded_model.predict(new_data)

...它作業得很好。然后我使用以下步驟將模型部署到 Azure Web 服務：

我創建 score.py 檔案

import joblib
from azureml.core.model import Model
import json

def init():
    global model
    model_path = Model.get_model_path('log') # logistic
    print('Model Path is  ', model_path)
    model = joblib.load(model_path)


def run(data):
    try:
        data = json.loads(data)
        result = model.predict(data['data'])
        # any data type, as long as it is JSON serializable.
        return {'data' : result.tolist() , 'message' : 'Successfully classified heart diseases'}
    except Exception as e:
        error = str(e)
        return {'data' : error , 'message' : 'Failed to classify heart diseases'}

我部署模型：

from azureml.core import Workspace
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core.conda_dependencies import CondaDependencies

ws = Workspace.from_config()

model = Model.register(workspace = ws,
              model_path ='model/log.pkl',
              model_name = 'log',
              tags = {'version': '1'},
              description = 'Heart disease classification',
              )

# to install required packages
env = Environment('env')
cd = CondaDependencies.create(pip_packages=['pandas==1.1.5', 'azureml-defaults','joblib==0.17.0'], conda_packages = ['scikit-learn==0.23.2'])
env.python.conda_dependencies = cd

# Register environment to re-use later
env.register(workspace = ws)
print('Registered Environment')

myenv = Environment.get(workspace=ws, name='env')

myenv.save_to_directory('./environ', overwrite=True)

aciconfig = AciWebservice.deploy_configuration(
            cpu_cores=1,
            memory_gb=1,
            tags={'data':'heart disease classifier'},
            description='Classification of heart diseases',
            )

inference_config = InferenceConfig(entry_script='score.py', environment=myenv)

service = Model.deploy(workspace=ws,
                name='hd-model-log',
                models=[model],
                inference_config=inference_config,
                deployment_config=aciconfig, 
                overwrite = True)

service.wait_for_deployment(show_output=True)
url = service.scoring_uri
print(url)

部署很好：

成功 ACI 服務創建操作完成，操作“成功”

但我無法用新資料做出任何預測。我嘗試使用：

import pandas as pd

new_data = pd.DataFrame([[71, 0, 0, 112, 203, 0, 1, 185, 0, 0.1, 2, 0, 2],
                         [80, 0, 0, 115, 203, 0, 1, 185, 0, 0.1, 2, 0, 0]],
                         columns=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'])

按照這個主題的答案（https://docs.microsoft.com/en-us/answers/questions/217305/data-input-format-call-the-service-for-azure-ml-ti.html）我轉換資料：

test_sample = json.dumps({'data': new_data.to_dict(orient='records')})

并嘗試做出一些預測：

import json
import requests
data = test_sample
headers = {'Content-Type':'application/json'}
r = requests.post(url, data=data, headers = headers)
print(r.status_code)
print(r.json())

但是，我遇到一個錯誤：

200 {'data': "預期的二維陣列，得到一維陣列：\narray=[{'age': 71, 'sex': 0, 'cp': 0, 'trestbps': 112, 'chol': 203 , 'fbs': 0, 'retecg': 1, 'thalach': 185, 'exang': 0, 'oldpeak': 0.1, 'slope': 2, 'ca': 0, 'thal': > 2} \n {'age': 80, 'sex': 0, 'cp': 0, 'trestbps': 115, 'chol': 203, 'fbs': 0, 'retecg': 1, 'thalach': 185 , 'exang': 0, 'oldpeak': 0.1, 'slope': 2, 'ca': 0, 'thal': 0}].\n如果使用 array.reshape(-1, > 1) 重新整形你的資料如果資料包含單個樣本，則您的資料具有單個特征或 array.reshape(1, -1)。", 'message': '未能對心臟病進行分類'}

如何將輸入資料調整為這種形式的預測并添加其他輸出，如 predict_proba 以便我可以將它們存盤在單獨的輸出資料集中？

我知道這個錯誤與 score.py 檔案的“運行”部分或呼叫 web 服務的最后一個代碼單元有關，但我找不到它。

非常感謝一些幫助。

uj5u.com熱心網友回復：

我相信我設法解決了這個問題——即使我遇到了一些嚴重的問題。:)

如此處所述- 我編輯了score.py腳本：

import joblib
from azureml.core.model import Model
import numpy as np
import json
import pandas as pd
import numpy as np

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType
    
data_sample = PandasParameterType(pd.DataFrame({'age': pd.Series([0], dtype='int64'),
                                                'sex': pd.Series(['example_value'], dtype='object'),
                                                'cp': pd.Series(['example_value'], dtype='object'),
                                                'trestbps': pd.Series([0], dtype='int64'),
                                                'chol': pd.Series([0], dtype='int64'),
                                                'fbs': pd.Series(['example_value'], dtype='object'),
                                                'restecg': pd.Series(['example_value'], dtype='object'),
                                                'thalach': pd.Series([0], dtype='int64'),
                                                'exang': pd.Series(['example_value'], dtype='object'),
                                                'oldpeak': pd.Series([0.0], dtype='float64'),
                                                'slope': pd.Series(['example_value'], dtype='object'),
                                                'ca': pd.Series(['example_value'], dtype='object'),
                                                'thal': pd.Series(['example_value'], dtype='object')}))

input_sample = StandardPythonParameterType({'data': data_sample})
result_sample = NumpyParameterType(np.array([0]))
output_sample = StandardPythonParameterType({'Results':result_sample})

def init():
    global model
    # Example when the model is a file
    model_path = Model.get_model_path('log') # logistic
    print('Model Path is  ', model_path)
    model = joblib.load(model_path)

@input_schema('Inputs', input_sample)
@output_schema(output_sample)
def run(Inputs):
    try:
        data = Inputs['data']
        result = model.predict_proba(data)
        return result.tolist()
    except Exception as e:
        error = str(e)
        return error

在部署步驟中，我調整了CondaDependencies：

# to install required packages
env = Environment('env')
cd = CondaDependencies.create(pip_packages=['pandas==1.1.5', 'azureml-defaults','joblib==0.17.0', 'inference-schema==1.3.0'], conda_packages = ['scikit-learn==0.22.2.post1'])
env.python.conda_dependencies = cd
# Register environment to re-use later
env.register(workspace = ws)
print('Registered Environment')

作為

a) 必須包含inference-schema在檔案中 b)由于這個問題Dependencies，我降級scikit-learn到版本scikit-learn==0.22.2.post1

現在，當我為模型提供新資料時：

new_data = {
  "Inputs": {
    "data": [
      {
        "age": 71,
        "sex": "0",
        "cp": "0",
        "trestbps": 112,
        "chol": 203,
        "fbs": "0",
        "restecg": "1",
        "thalach": 185,
        "exang": "0",
        "oldpeak": 0.1,
        "slope": "2",
        "ca": "0",
        "thal": "2"
      }
    ]
  }
}

并將其用于預測：

import json
import requests
data = new_data
headers = {'Content-Type':'application/json'}
r = requests.post(url, str.encode(json.dumps(data)), headers = headers)
print(r.status_code)
print(r.json())

我得到：

200 [[0.02325369841858338, 0.9767463015814166]]

嗚！也許有人會從我痛苦的學習之路中受益！:)

uj5u.com熱心網友回復：

主要問題是分類變數的轉換。處理分類變數的傳統方法是使用OneHotEncoder

# changing format of the categorical variables
df[cat_cols] = df[cat_cols].apply(lambda x: x.astype('object'))

轉換資料需要如下所述應用：

from sklearn.preprocessing import MinMaxScaler
cat_col =['sex',
            'cp',
            'fbs',
            'restecg',
            'exang',
            'slope',
            'ca',
            'thal']

df_2 = pd.get_dummies(data[cat_col], drop_first = True)

[0,1] 將在應用啞元后形成，然后

new_data = pd.DataFrame([[71, 0, 0, 112, 203, 0, 1, 185, 0, 0.1, 2, 0, 2],
                         [80, 0, 0, 115, 203, 0, 1, 185, 0, 0.1, 2, 0, 0]],
                         columns=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'])

這可以通過對語法的較少更改來應用。

編輯：

new_data = {
  "Inputs": {
    "data": [
      {
        "age": 71,
        "sex": "0",
        "cp": "0",
        "trestbps": 112,
        "chol": 203,
        "fbs": "0",
        "restecg": "1",
        "thalach": 185,
        "exang": "0",
        "oldpeak": 0.1,
        "slope": "2",
        "ca": "0",
        "thal": "2"
      }
    ]
  }
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/482353.html

標籤：json python-3.x 机器学习天蓝色机器学习工作室天蓝色机器学习服务

上一篇：機器學習演算法評估的指標

下一篇：在活動AndroidStudioJava上顯示文本和json