我正在嘗試除錯我的 tensorflow 代碼,該代碼在大約 30 個時期后突然產生 NaN 損失。您可能會發現我的具體問題以及我在這個SO question 中嘗試過的事情。
我在訓練期間監控了每個 mini-batch 的所有層的權重,發現盡管在前一次迭代期間kernel_constraint所有權重值都小于 1(我已將max_norm設定為 1),但權重突然跳到了 NaN 。這使得很難找出哪個操作是罪魁禍首。
Pytorch 有一個很酷的除錯方法torch.autograd.detect_anomaly,它在任何產生 NaN 值的反向計算中都會產生錯誤并顯示回溯。這使得除錯代碼變得容易。
TensorFlow 中有類似的東西嗎?如果沒有,你能提出一種除錯方法嗎?
uj5u.com熱心網友回復:
tensorflow中確實有類似的除錯工具。見 tf.debugging.check_numerics。
這可用于跟蹤在訓練期間產生的張量inf或nan值。一旦找到這樣的值,tensorflow 就會生成一個InvalidArgumentError.
tf.debugging.check_numerics(LayerN, "LayerN is producing nans!")
如果張量LayerN有 nans,你會得到這樣的錯誤:
Traceback (most recent call last):
File "trainer.py", line 506, in <module>
worker.train_model()
File "trainer.py", line 211, in train_model
l, tmae = train_step(*batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LayerN is producing nans! : Tensor had NaN values
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/311554.html
