【論文考古】聯邦學習開山之作 Communication-Efficient Learning of Deep Networks from Decentralized Data-有解無憂

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Apr. 2017, pp. 1273–1282.

聯邦學習

特征

unbalanced and non-IID data：資料的異構性是FL的決定性特征，
massively distributed：用戶數量比每個用戶的平均樣本數量還要多
limited communication (client availability)：考慮offline/slow/expensive connections

優勢：communication-efficient

communication-efficient的含義并不是相較于傳輸整體資料或整個網路結構，只傳輸引數更新會降低通信開銷，而是和同步的SGD（僅用所有本地資料訓練一次就進行引數合并，是當時的基于資料中心訓練方法的SOTA）相比，在更少的通信次數下就能達到目標準確率（減少10到100倍的通信次數），

our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model

對于不傳輸本地資料這一點，作者強調的是隱私保護，而不是節省通信開銷，

核心演算法：FedAvg

精彩觀點

每次更新只針對當前模型，因此不建議利用連續兩次更新的相關性

Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
每一輪的用戶參與并不是越多越好，需要考慮一個性能和通信的折衷

We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point.
FedAvg的有很強的魯棒性，作者推測是因為帶來了類似于dropout的正則化作用

averaging provides any advantage (vs. actually diverging) when we naively average the parameters of models trained on entirely different pairs of digits. Thus, we view this as strong evidence for the robustness of this approach

We conjecture that in addition to lowering communication costs, model averaging produces a regularization benefit similar to that achieved by dropout [36]
batch size只要和硬體相匹配，減少它就不會顯著增加計算時間

As long as B is large enough to take full advantage of available parallelism on the client hardware, there is essentially no cost in computation time for lowering it, and so in practice this should be the first parameter tuned.

性能提升

多個模型框架、大小規模都能應用
- 2層NN，16萬引數；3層CNN，166萬引數
- MNIST：100個用戶，non iid下每個用戶包括的手寫數字不超過2個，CNN下97次通信可以達到99%準確率，比FedSGD快10倍；NN下380次通信達到97%正確率；iid 下CNN本地引數更新1200次，18次通信達到99%準確率，通信次數下降35倍
- Cifar10：100個用戶，80%準確率，通信280次，快64倍
- Shakespeare： 1146個用戶，達到54%準確率，non IID下快95倍
- 大規模LSTM：50萬個用戶，一千萬的post，每次200個用戶更新，準確率10.5%，快23倍
本地訓練batch size取10或50，epoch取5或20，fraction取0.1

挖的坑

文章的訓練物件是mobile devices，因此和通信結合是自然而然的

the identification of the problem of training on decentralized data from mobile devices as an important research direction
- 不穩定通信情況下的調度
- 考慮通信資費的博弈論角度
- 通信中誤碼率的影響、傳輸速率的影響
異構資料
- 資料初始分布不同有何影響（每個用戶的損失函式都不同）
  
  \(F_k\) could be an arbitrarily bad approximation to \(f\)
  
  \[f(w)=\sum_{k=1}^{K} \frac{n_{k}}{n} F_{k}(w) \quad \text { where } \quad F_{k}(w)=\frac{1}{n_{k}} \sum_{i \in \mathcal{P}_{k}} f_{i}(w) \]
- 訓練中資料的增刪有何影響
- 資料的上線時段不同有何影響
- 在不平衡的資料分布下，小資料集的過擬合程度很大，也沒有影響嗎？
網路引數傳輸
- 部分網路傳輸
- one-shot averaging（多半是正則的）訓練完后直接合并
- 本地訓練的過擬合程度和發散究竟有何關系？
  - Shakespeare LSTM過擬合后發散嚴重，但是MNIST CNN沒有（但還是本地越多越容易發散）
  - 大規模LSTM時，epoch為1時的訓練速度比epoch為5時更快
  This result suggests that for some models, especially in the later stages of convergence, it may be useful to decay the amount of local computation per round (moving to smaller E or larger B) in the same way decaying learning rates can be useful.

評價

文章價值

新意100×有效1000×研究問題100

為什么能誕生FL

當兩個模型采用同一套引數初始值時，過擬合訓練后直接引數平均就能提高模型性能！所以和分布式SGD的每本地訓練一次就上傳相比，大大減少了通信的次數，

這個發現是在IID的情況下做的，仿真下發現在non IID下也有顯著提升，但是沒有IID下提升那么明顯，可能是個可以挖的坑，

Recent work indicates that in practice, the loss surfaces of sufficiently over-parameterized NNs are surprisingly well-behaved and in particular less prone to bad local minima than previously thought [11, 17, 9].

we find that naive parameter averaging works surprisingly well

the average of these two models, \(\frac{1}{2}w+ \frac{1}{2}w^\prime\), achieves significantly lower loss on the full MNIST training set than the best model achieved by training on either of the small datasets independently.

為什么FL能這么火

時代的潮流：大量用戶、設備算力增強、隱私越來越被重視、有切實的應用價值
足夠簡單的框架，很容易follow，馬太效應

提示與啟發

在服務器端用proxy data 是常規操作（雖然FL不需要），但其實和用戶的真實資料集還是存在差異
先用多個（2000）individual training+proxy data進行調參
next-word prediction是FL的最佳應用場景，符合真實資料、隱私保護、不需要額外標簽三個FL特征
一項作業并不是因為他是另外一項作業的直接推廣就沒有創新，一般的直接推廣通常是不能應用、或違反當時人們直覺的，如果在更改某個簡單設定后帶來了顯著的性能提升，那么無疑是巨大的創新，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/430213.html

標籤：其他

上一篇：MongoDB聚合：展開并保持根為單獨的檔案

下一篇：【自動化測驗框架】pytest和unitttest你知道多少？區別在哪？該用哪個？