【深入淺出 Yarn 架構與實作】2-1 Yarn 基礎庫概述-有解無憂

了解 Yarn 基礎庫是后面閱讀 Yarn 原始碼的基礎，本節對 Yarn 基礎庫做總體的介紹，
并對其中使用的第三方庫 Protocol Buffers 和 Avro 是什么、怎么用做簡要的介紹，

一、主要使用的庫

Protocol Buffers：是 Google 開源的序列化庫，具有平臺無關、高性能、兼容性好等優點，YARN 將其用到了 RPC 通信中，默認情況下，YARN RPC 中所有引數采用 Protocol Buffers 進行序列化 / 反序列化，
Apache Avro：是 Hadoop 生態系統中的 RPC 框架，具有平臺無關、支持動態模式(無需編譯)等優點，Avro 的最初設計動機是解決 YARN RPC 兼容性和擴展性差等問題，
RPC 庫：YARN 仍采用了 MRv1 中的 RPC 庫，但其中采用的默認序列化方法被替換成了 Protocol Buffers，
服務庫和事件庫 :YARN 將所有的物件服務化，以便統一管理(比創建、銷毀等)，而服務之間則采用事件機制進行通信，不再使用類似 MRv1 中基于函式呼叫的方式，
狀態機庫：YARN 采用有限狀態機描述一些物件的狀態以及狀態之間的轉移，引入狀態機模型后，相比 MRv1， YARN 的代碼結構更加清晰易懂，

二、第三方開源庫介紹

一）Protocol Buffers

1、簡要介紹

Protocol Buffers 是 Google 開源的一個語言無關、平臺無關的通信協議，其小巧、高效和友好的兼容性設計，使其被廣泛使用，
【可以類比 java 自帶的 Serializable 庫，功能上是一樣的，】

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

核心特點：

語言、平臺無關
簡潔
高性能
兼容性好

2、安裝環境

以 mac 為例（其他平臺方式請自查）

# 1) brew安裝
brew install protobuf 

# 查看安裝目錄
$ which protoc 
/opt/homebrew/bin/protoc 


# 2) 配置環境變數
vim ~/.zshrc

# protoc (for hadoop)
export PROTOC="/opt/homebrew/bin/protoc"

source ~/.zshrc


# 3) 查看protobuf版本
$ protoc --version
libprotoc 3.19.1

3、寫個 demo

1）創建個 maven 工程，添加依賴

<dependencies>
  <dependency>
    <groupId>com.google.protobuf</groupId>
    <artifactId>protobuf-java</artifactId>
    <version>3.19.1</version>  <!--版本號務必和安裝的protoc版本一致-->
  </dependency>
</dependencies>

2）根目錄新建 protobuf 的訊息定義檔案 student.proto

proto 資料型別語法定義可以參考：ProtoBuf 入門教程

syntax = "proto3"; // 宣告為protobuf 3定義檔案
package tutorial;

option java_package = "com.shuofxz.learning.student";	// 生成檔案的包名
option java_outer_classname = "StudentProtos";				// 類名

message Student {								// 待描述的結構化資料
    string name = 1;
    int32 id = 2;
    optional string email = 3;	//optional 表示該欄位可以為空

    message PhoneNumber {				// 嵌套結構
        string number = 1;
        optional int32 type = 2;
    }

    repeated PhoneNumber phone = 4;	// 重復欄位
}

3）使用 protoc 工具生成訊息對應的Java類（在 proto 檔案目錄執行）

protoc -I=. --java_out=src/main/java student.proto

可以在對應的檔案夾下找到 StudentProtos.java 類，里面寫了序列化、反序列化等方法，

public class StudentExample {
    static public void main(String[] argv) {
        StudentProtos.Student Student1 = StudentProtos.Student.newBuilder()
                .setName("San Zhang")
                .setEmail("[email protected]")
                .setId(11111)
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("13911231231")
                        .setType(0))
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("01082345678")
                        .setType(1)).build();

        // 寫出到檔案
        try {
            FileOutputStream output = new FileOutputStream("example.txt");
            Student1.writeTo(output);
            output.close();
        } catch(Exception e) {
            System.out.println("Write Error ! ");
        }

        // 從檔案讀取
        try {
            FileInputStream input = new FileInputStream("example.txt");
            StudentProtos.Student Student2 = StudentProtos.Student.parseFrom(input);
            System.out.println("Student2:" + Student2);
        } catch(Exception e) {
            System.out.println("Read Error!");
        }
    }
}

以上就是一個 protocol buffers 使用的完整流程了，沒什么難的，就是呼叫了一個第三方的序列化庫，將物件序列化到檔案，再反序列化讀出來，
只不過需要先在 proto 檔案中定義好資料結構，并生成對應的工具類，

4、在 Yarn 中應用

在 YARN 中，所有 RPC 函式的引數均采用 Protocol Buffers 定義的，RPC 仍使用 MRv1 中的 RPC，

二）Apache Avro

1、簡要介紹

Apache Avro 是 Hadoop 下的一個子專案，它本身既是一個序列化框架，同時也實作了 RPC 的功能，
但由于 Yarn 專案初期，Avro 還不成熟，Avro 則作為日志序列化庫使用，所有事件的序列化均采用 Avro 完成，
特點：

豐富的資料結構型別;
快速可壓縮的二進制資料形式;
存盤持久資料的檔案容器;
提供遠程程序呼叫 RPC;
簡單的動態語言結合功能，

相比于 Apache Thrift 和 Google 的 Protocol Buffers，Apache Avro 具有以下特點:

支持動態模式，Avro 不需要生成代碼，這有利于搭建通用的資料處理系統，同時避免了代碼入侵，
資料無須加標簽，讀取資料前，Avro 能夠獲取模式定義，這使得 Avro 在資料編碼時只需要保留更少的型別資訊，有利于減少序列化后的資料大小，
無須手工分配的域標識，Thrift 和 Protocol Buffers 使用一個用戶添加的整型域唯一性定義一個欄位，而 Avro 則直接使用域名，該方法更加直觀、更加易擴展，

2、安裝環境 & demo

參考：Avro學習入門

3、在 Yarn 中應用

Apache Avro 最初是為 Hadoop 量身打造的 RPC 框架，考慮到穩定性，YARN 暫時采用 Protocol Buffers 作為序列化庫，RPC 仍使用 MRv1 中的 RPC，而 Avro 則作為日志序列化庫使用，在 YARN MapReduce 中，所有事件的序列化 / 反序列化均采用 Avro 完成，相關定義在 Events.avpr 檔案中，

三、總結

本節簡要介紹了 Yarn 中五個重要的基礎庫，了解這些庫會幫助了解 Yarn 代碼邏輯和資料傳遞方式，
對其中兩個第三方開源庫進行了介紹，Protocol Buffers 用作 RPC 函式引數的序列化和反序列化；Avro 在日志和事件部分的序列化庫使用，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/529904.html

標籤：其他

上一篇：【深入淺出 Yarn 架構與實作】1-2 搭建 Hadoop 原始碼閱讀環境

下一篇：數論合集