Hadoop3.X And Hadoop2.X Version Diff comparison and New features,
Minimum required Java version increased from Java 7 to Java 8
All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.
目前所有的Hadoop JARs 現在都是用了Java 8進行的編譯,如果集群想升級Hadoop3.X 需要針對使用Java版本進行升級到Java 8,
Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.
擦除編碼是一種持久存盤資料的方法,與復制相比節省了大量空間,像Reed-Solomon(10,4)這樣的標準編碼的空間開銷是標準HDFS復制的3倍,可以將3倍副本占據的空間壓縮到1.5倍,并保持3倍副本的容錯,
Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.
由于擦除編碼在重建程序中增加了額外的開銷,并且主要執行遠程讀取,因此它通常用于存盤較冷、訪問頻率較低的資料,在部署此功能時,用戶應考慮網路和CPU開銷的擦除編碼,傳統的擦除編碼技術對性能的影響,特別是IOPS和延遲的影響還是比較大的,因此目前適用的場景主要局限在歸檔、云存盤等冷資料方面,
YARN Timeline Service v.2
We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.
提高Timeline服務的可伸縮性和可靠性,以及通過引入流和聚合來增強可用性,
YARN Resource Types
The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.
通過擴展YARN的資源型別,支持CPU和記憶體之外的其他資源,如比較流行的GPU計算、FPGA、軟體許可證、本地存盤等,
Shell script rewrite
The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.
重新了部分腳本,修復了部分bug,但是沒有具體體現出來,
Shaded client jars
The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.
2.x版本中提供的hadoop客戶機Maven工件將hadoop的可傳遞依賴關系拉到hadoop應用程式的類路徑上,如果這些可傳遞依賴項的版本與應用程式使用的版本沖突,那么這可能會有問題,
Support for Opportunistic Containers and Distributed Scheduling.
A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.
引入了ExecutionType的概念,應用程式現在可以請求執行型別為機會主義的容器,即使在調度時沒有可用的資源,這種型別的容器也可以在NM處被調度執行,在這種情況下,這些容器將在NM處排隊,等待資源可供其啟動,機會主義容器的優先級低于默認的保證容器,因此如果需要,會被搶占,以便為保證容器騰出空間,這將提高集群利用率,
MapReduce task-level native optimization
MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.
Map階段的輸出收集器增加了本地實作,對于Shuffer密集型作業,的性能可以提高30%以上,
Support for more than 2 NameNodes.
The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.
However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.
hadoop2.x中NameNode的HA包含一個active的NameNode和一個Standby的NameNode,解決了系統中NameNode的單點故障問題,在hadoop3中允許多個standby狀態的NameNode以達到更高級別容錯的目的,允許用戶運行多個備用NameNodes,例如,通過配置三個NameNodes和五個journalnode,集群能夠容忍兩個節點的故障,而不僅僅是一個節點,
Default ports of multiple services have been changed.
Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.
埠改動,Hadoop1.x、2.x、3.x 部分主鍵的埠經常變更,多個Hadoop服務的默認埠位于Linux臨時埠范圍(32768-61000), 這意味著在啟動時,由于與另一個應用程式的沖突,服務有時無法系結到埠,這些沖突的埠已移出臨時范圍,影響了NameNode,Secondary NameNode,DataNode和KMS,
Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.
Hadoop現在支持與microsoftazure資料湖和Aliyun物件存盤系統OSS集成,作為Hadoop兼容檔案系統的替代方案,
Intra-datanode balancer
A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.
資料傾斜,單個資料節點管理多個磁盤,在正常寫操作期間,磁盤將被均勻地填滿,但是,添加或替換磁盤可能會導致DataNode中的嚴重偏差,這種情況不是由現有的HDFS平衡器處理的,它關注的是內部而不是內部的DN偏斜,
This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guide for more information.
這種情況由新的intra-DataNode平衡功能處理,該功能通過hdfs diskbalancer CLI呼叫,
Reworked daemon and task heap management
A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.
守護行程以及MR任務的堆管理做了一系列更改,現在可以根據主機的記憶體大小進行自動調整,并且不推薦使用HADOOP_HEAPSIZE變數,簡化了MR任務堆空間的配置,在任務中不再需要以java選項的方式進行指定,
S3Guard: Consistency and Metadata Caching for the S3A filesystem client
為Amazon S3存盤的S3A客戶端添加了一個可選功能:能夠將DynamoDB表用于檔案和目錄元資料的快速一致存盤,
HDFS Router-Based Federation
HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.
HDFS基于路由器的聯邦添加了一個RPC路由層,該層提供多個HDFS命名空間的聯合視圖, 這與現有的ViewFs和HDFS聯合功能類似,不同之處在于安裝表由路由層而不是客戶端在服務器端進行管理, 簡化了對現有HDFS客戶端對聯邦群集的訪問,
API-based configuration of Capacity Scheduler queue configuration
The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.
容量調度器Capacity scheduler的OrgQueue擴展提供了一種編程方式,通過提供restapi來更改配置,用戶可以呼叫restapi來修改佇列配置,這使得佇列的administrate佇列ACL中的管理員能夠自動化佇列配置管理,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/290217.html
標籤:其他
上一篇:大資料穩定性體系建設
