LDA的SageMaker超引數調整，闡明feature

我正在嘗試使用 mxnet 在 SageMaker 筆記本中的 LDA 模型的 Estimator 上運行 HyperparameterTuner，但在我的代碼中遇到與 feature_dim 超引數相關的錯誤。我相信這與訓練和測驗資料集的不同維度有關，但我不能 100% 確定是否是這種情況或如何解決它。

估算器代碼

[請注意，我將 feature_dim 設定為訓練資料集的維度]

vocabulary_size = doc_term_matrix_train.shape[1]

lda = sagemaker.estimator.Estimator(
        container,
        role,
        output_path="s3://{}/{}/output".format(bucket, prefix),
        train_instance_count=1,
        train_instance_type="ml.c4.2xlarge",
        sagemaker_session=session
        )

lda.set_hyperparameters(
    mini_batch_size=40
    feature_dim=vocabulary_size,
    )

超引數調優作業

#s3_input_train and s3_input_test hold doc_term matrices of the test/train corpus 

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_test ='s3://{}/{}/test/'.format(bucket, prefix)
data_channels = {'train': s3_input_train, 'test': s3_input_test}

hyperparameter_ranges = {
    "alpha0": ContinuousParameter(0.1, 1.5, scaling_type="Logarithmic"),
    "num_topics":IntegerParameter(3, 10)}

# Configure HyperparameterTuner
my_tuner = HyperparameterTuner(estimator=lda,  
                               objective_metric_name='test:pwll',
                               hyperparameter_ranges=hyperparameter_ranges,
                               max_jobs=5,
                               max_parallel_jobs=2)

# Start hyperparameter tuning job
my_tuner.fit(data_channels, job_name='run-3', include_cls_metadata=False)

Cloudwatch 日志

當我運行上述程式時，調整失敗，當我查看 Cloudwatch 以查看日志時，錯誤通常是：

[01/19/2022 19:42:22 錯誤 140234465695552] 演算法錯誤：索引 11873 超出軸 1 的范圍，大小為 11873（由 IndexError 引起）

我復制了上面的內容，因為 11873 是我的測驗資料集中的特征數，所以我認為有聯系，但我不確定到底發生了什么。當我嘗試“11873”作為 feature_dim 的值時，錯誤抱怨資料有 32465 個特征（對應于訓練集）。將這兩個值相加也會產生以下錯誤：

[01/20/2022 13:44:01 錯誤 140125082621760] 客戶錯誤：提供的 feature_dim 引數與資料的維度不同。（feature_dim）44338！= 32465（資料）。

最后，Cloudwatch 中的最后一個日志報告了以下內容，表明“所有資料”正在適合具有測驗資料維度的矩陣：

[01/20/2022 14:49:52 INFO 140411440904000] 將所有資料加載到具有形狀的矩陣中：(11, 11873)

給定測驗和訓練資料集，如何定義 feature_dim？

uj5u.com熱心網友回復：

我已經解決了這個問題。我的問題是，在將資料轉換為 doc-term 矩陣之前，我將資料拆分為測驗和訓練，這導致了不同維度的測驗和訓練資料集，這導致了 SageMaker 的演算法失效。一旦我將所有輸入資料轉換成一個 doc-term 矩陣，然后將其拆分為測驗和訓練，超引數優化操作就完成了。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/420791.html

標籤：

上一篇：如何使這種線性時間復雜度記錄時間復雜度？

下一篇：Firestore的快速性能是否在所有情況下都保持快速，即使應用程式有數百萬訂閱者？

LDA的SageMaker超引數調整，闡明feature_dim

估算器代碼

超引數調優作業

Cloudwatch 日志