來自節點池的GKE上不可調度的GPU作業負載-有解無憂

我在 GKE Standard 上按需運行 GPU 密集型作業負載，我在其中創建了最少 0 個和最多 5 個節點的適當節點池。但是，當在節點池上調度 Job 時，GKE 會出現以下錯誤：

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   59s (x2 over 60s)  default-scheduler   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
  Normal   NotTriggerScaleUp  58s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 1 in backoff after failed scale-up

我已經根據檔案設定了 nodeSelector 并啟用了自動縮放，我可以確認它確實找到了節點池，盡管出現“與 Pod 的節點關聯性/選擇器不匹配”的錯誤，并嘗試擴展集群。但是此后不久它就失敗了，說 0/1 節點可用？這是完全錯誤的，因為節點池中使用了 0/5 個節點。我在這里做錯了什么？

uj5u.com熱心網友回復：

1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate...

嘗試添加tolerations到您作業的 pod 規范中：

...
spec:
  containers:
  - name: ...
    ...
  tolerations:
  - key: nvidia.com/gpu
    value: present
    operator: Exists

uj5u.com熱心網友回復：

對于node(s) didn't match Pod's node您不共享清單檔案的詳細資訊，但假設他有以下幾行：

nodeSelector: 
nodePool: cluster

一種選擇是從 YAML 檔案中洗掉這些行。或者，另一種選擇是nodePool: cluster作為標簽添加到所有節點，然后將使用可用選擇器來調度 pod。以下命令可能對您有用：

kubectl label nodes <your node name> nodePool=cluster

關于1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate訊息，您可以遵循@gohm'c 的建議，或者您也可以使用以下命令taint從主節點中洗掉，這樣您就可以在該節點上安排您的 pod：

kubectl taint nodes  <your node name> node-role.kubernetes.io/master-
kubectl taint nodes  <your node name> node-role.kubernetes.io/master-

您可以使用以下執行緒作為參考，它們包含來自真實案例的資訊，Error : FailedScheduling : nodes did not match node selector并且Node 有 pod 不能容忍錯誤的污點。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/383014.html

標籤：Kubernetes 谷歌云平台显卡 google-kubernetes-engine

上一篇：kubectl命令顯示為空

下一篇：不同區域Swift中的字串到雙精度轉換nil