
前序:Hudi系列文章:
FlinkCDC-Hudi:Mysql資料實時入湖全攻略一:初試風云
FlinkCDC-Hudi:Mysql資料實時入湖全攻略二:Hudi與Spark整合時所遇例外與解決方案
一、背景
在生產環境中,mysql一般會配備主從庫,以實作資料備份、服務容災、讀寫分離等需要,使用FlinkCdc進行mysql資料入湖時,就不可避免地要和主從庫打交道,FlinkCDC對mysql主從庫的切換支撐到什么程度、資料庫需要怎么配置、同步程式要怎么配合操作和開發,是FlinkCDC投入生產應用前必驗專案,
本文記錄了使用FlinkCDC進行Mysql主從資料同步的主要驗證程序,以為后鑒,
二、驗證前環境準備
2.1 FlinkCDC+Hudi環境準備
FlinkCDC+Hudi環境準備具體程序參見 FlinkCDC-Hudi:Mysql資料實時入湖全攻略一:初試風云,本文不再贅述,
2.2 Mysql主從庫環境準備
準備Mysql主從庫各一臺,
主庫:192.168.2.100
從庫1:192.168.2.101
從庫2:192.168.2.102
*mysql簡易安裝命令如下:
#ubuntu安裝命令
sudo apt install mysql-server -y
#linux安裝命令
yum install mysql-server
更多的安裝應用可以參考:
Ubuntu18.04 安裝MySQL
Linux安裝Mysql*
2.2 主從庫配置
mysql安裝完成后,進行主從庫配置,mysql的默認組態檔在/etc/mysql/my.cnf,主從庫相應配置如下,
注意:這里僅提供了實作主從庫的最簡配置,僅供測驗使用,生產環境切勿直接應用些配置,
2.2.1 主庫配置
!includedir /etc/mysql/conf.d/
!includedir /etc/mysql/mysql.conf.d/
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
collation-server = utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server = utf8mb4
bind-address = 0.0.0.0
server_id = 1
log-bin = /var/lib/mysql/mysql-bin
#binlog-do-db = *
log-slave-updates
sync_binlog = 1
auto_increment_offset = 1
auto_increment_increment = 1
log_bin_trust_function_creators = 1
2.2.2 從庫安裝與配置
!includedir /etc/mysql/conf.d/
!includedir /etc/mysql/mysql.conf.d/
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
collation-server = utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server = utf8mb4
bind-address = 0.0.0.0
server_id = 2
log-bin = /var/lib/mysql/mysql-bin
log-slave-updates
sync_binlog = 0
##指定slave要復制哪個庫
replicate-do-db = flink_cdc
##MySQL主從復制的時候,當Master和Slave之間的網路中斷,但是Master和Slave無法察覺的情況下(比如防火墻或者路由問題),Slave會等待slave_net_timeout設定的秒數后,才能認為網路出現故障,然后才會重連并且追趕這段時間主庫的資料
slave-net-timeout = 60
log_bin_trust_function_creators = 1
read_only = 1
2.2.3 從庫資料初始化
新安裝的主從庫由于還沒有資料,不需要進行從庫資料初始化,
如果是已存在主庫,然后新增從庫,則需要進行資料初始化,
1、在主庫中給要同步的資料庫加鎖
mysql> use flink_cdc;
mysql> flush tables with read lock;
2、在主庫dump所有資料
mysqldump -uroot -proot_password flink_cdc> flink_cdc.sql
3、在第2步完成資料dump之后,釋放鎖
mysql> unlock tables;
4、在從庫匯入所有資料
mysql> source /path/flink_cdc.sql;
2.2.4 主庫授權
#創建slave賬號user_test,密碼user_test_password
mysql> grant select,replication slave,replication client on *.* to 'user_test'@'%' identified by 'user_test_password';
#更新資料庫權限
mysql> flush privileges;
#查看master binlog點位
### 如要進行從庫初始化,需要在unlock tables之前執行,
mysql> show master status\G;
*************************** 1. row ***************************
File: mysql-bin.000037 //當前binlog檔案
Position: 700 //當前binlog檔案位移
Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)
2.2.4 開啟從庫同步
#執行同步命令,設定主服務器ip,同步賬號密碼,同步位置
mysql> change master to master_host='192.168.2.100',master_user='user_test',master_password='user_test_password',master_log_file='mysql-bin.000037',master_log_pos=700;
#開啟同步功能
mysql> start slave;
#查看slave同步狀態
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Connecting to master
Master_Host: 192.168.2.100
Master_User: user_test
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000037
Read_Master_Log_Pos: 700
Relay_Log_File: node-142-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File: mysql-bin.000037
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: flink_cdc
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 700
Relay_Log_Space: 154
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID:
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61167
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
show slave status中,Slave_IO_Running及Slave_SQL_Running行程必須是Yes狀態,才是正常開啟了同步,例外情況 Last_IO_Errno,Last_IO_Error, Last_SQL_Errno,Last_SQL_Error會有相應的提示,
設定完成后可以在主庫增刪資料,在從庫驗證同步狀態,這里便不再展開,
三、FlinkCDC從庫同步入湖驗證
3.1 FlinkCDC-Hudi運行環境
這里假定讀者已經搭建好FlinkCDC+Hudi環境,本文操作只記錄驗證FlinkCDC同步mysql主從資料的必要步驟,
FlinkCDC+Hudi環境準備具體程序參見 FlinkCDC-Hudi:Mysql資料實時入湖全攻略一:初試風云
3.1.1 mysql測驗表準備
在主庫中創建測驗表1
mysql> use flink_cdc;
mysql> CREATE TABLE `test_1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data` varchar(10) DEFAULT NULL,
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
## 重復執行,插入多條資料
mysql> insert into test_1(data) values('data');
3.1.2 使用FlinkSql啟動同步作業
3.1.2.1 啟動FlinkSql
FLINK_HOME/bin/yarn-session.sh -s 4 -jm 1024 -tm 2048 -nm flink-hudi-0.10 -d
FLINK_HOME/bin/sql-client.sh embedded -s yarn-session -j ./lib/hudi-flink-bundle_2.11-0.10.0.jar shell
3.1.2.2 在FlinkSql中啟動作業
Flink SQL> set execution.checkpointing.interval=30sec;
##創建flinksql mysqlcdc表
Flink SQL> create table mysql_test_1(
id bigint primary key not enforced,
data String,
create_time Timestamp(3)
) with (
'connector'='mysql-cdc',
'hostname'='192.168.2.101', --從從庫1開始同步,切換為主庫或從庫2時修改為對應的ip
'port'='3306',
'server-id'='5600-5604',
'username'='user_test',
'password'='user_test_password',
'server-time-zone'='Asia/Shanghai',
'debezium.snapshot.mode'='initial',
'database-name'='flink_cdc',
'table-name'='test_1'
)
##創建flinksql hudi表
Flink SQL> create table hudi_test_1(
id bigint,
data String,
create_time Timestamp(3),
PRIMARY KEY (`id`) NOT ENFORCED
)
with(
'connector'='hudi',
'path'='hdfs:///tmp/flink/cdcata/hudi_test_1',
'hoodie.datasource.write.recordkey.field'='id',
'hoodie.parquet.max.file.size'='268435456',
'write.precombine.field'='create_time',
'write.tasks'='1',
'write.bucket_assign.tasks'='1',
'write.task.max.size'='1024',
'write.rate.limit'='30000',
'table.type'='MERGE_ON_READ',
'compaction.tasks'='1',
'compaction.async.enabled'='true',
'compaction.delta_commits'='1',
'compaction.max_memory'='500',
'changelog.enabled'='true',
'read.streaming.enabled'='true',
'read.streaming.check.interval'='3',
'hive_sync.enable'='true',
'hive_sync.mode'='hms',
'hive_sync.metastore.uris'='thrift://hiveserver2:9083',
'hive_sync.db'='test',
'hive_sync.table'='hudi_test_1',
'hive_sync.username'='flinkcdc',
'hive_sync.support_timestamp'='true'
);
##啟動flink作業
Flink SQL> set pipeline.name = flinkcdc_test_1;
Flink SQL> set execution.checkpointing.externalized-checkpoint-retention= RETAIN_ON_CANCELLATION;
Flink SQL> insert into hudi_test_1 select * from mysql_test_1;
3.1.2.3 成功啟動作業

3.1.2.4 在FlinkSQL中驗證hudi落地的資料,
Flink SQL> select * from hudi_test_1;

3.2 模擬從庫資料同步例外
目前我們搭建的環境是一主二從,這里驗證從一例外,切換到從二,當然也可以切到主,切換驗證的流程是一樣的,但一般生產環境不會直接從主庫同步資料,
3.2.1 模擬從庫例外
這里我們直接停掉從庫1,模擬從庫宕機,在從庫1中關掉mysql,
service mysql stop
這時flink作業會掛掉,

查看flink作業的例外記錄,可以觀察到兩種例外,
一種是服務連接超時,沒有收到mysql服務端的回應,
-- The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
另一種是Flink作業不斷重啟產生的One or more fetchers have encountered exception,由于flinkcdc通過模擬slave的行為來同步binlog,每個slave都會有一個serverid(在flinksql mysql表的定義中配置,‘server-id’=‘5600-5604’),作業重啟時會使用相同的serverid,當重啟flink作業的間隔小于flink與mysql連接的超時時長會,會報這個例外,
這兩個例外的詳細例外堆疊資訊如下:
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:456)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:246)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:197)
at io.debezium.jdbc.JdbcConnection.lambda$patternBasedFactory$1(JdbcConnection.java:231)
at io.debezium.jdbc.JdbcConnection.connection(JdbcConnection.java:872)
at io.debezium.connector.mysql.MySqlConnection.connection(MySqlConnection.java:79)
at io.debezium.jdbc.JdbcConnection.connection(JdbcConnection.java:867)
at io.debezium.jdbc.JdbcConnection.query(JdbcConnection.java:550)
at io.debezium.jdbc.JdbcConnection.query(JdbcConnection.java:498)
at io.debezium.connector.mysql.MySqlConnection.querySystemVariables(MySqlConnection.java:125)
... 15 more
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.GeneratedConstructorAccessor47.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
at com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:91)
at com.mysql.cj.NativeSession.connect(NativeSession.java:144)
at com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:956)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:826)
... 25 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at com.mysql.cj.protocol.StandardSocketFactory.connect(StandardSocketFactory.java:155)
at com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:65)
... 28 more
2022-02-15 19:00:54
java.lang.RuntimeException: One or more fetchers have encountered exception
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:223)
at org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:154)
at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:116)
at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294)
at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:69)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:148)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:103)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: io.debezium.DebeziumException: Error reading MySQL variables: Communications link failure
3.2.2 人為制造從庫狀態同步不一致
這時持續對主庫寫入多條資料,這樣兩個從庫的狀態就會出現不一致,
a. 通過show slave status\G查看從庫已從主庫同步的資料點位,
Master_Log_File: mysql-bin.000039
Read_Master_Log_Pos: 2759
Relay_Log_File: node-142-relay-bin.000006
Relay_Log_Pos: 2885
Relay_Master_Log_File: mysql-bin.000039
b. 這里只能查詢從庫二的狀態,因為從庫一掛了,但我們通過兩個從庫生成的binlog檔案來比較同步的狀態,可以簡單地查看binlog檔案的修改時間比較同步差異,
查詢mysql資料保存的目錄:
mysql> show variables like '%dir%';
+-----------------------------------------+----------------------------+
| Variable_name | Value |
+-----------------------------------------+----------------------------+
| datadir | /var/lib/mysql/ |
ll 資料目錄看看binlog檔案生成的狀態,
ll /var/lib/mysql/
查看從庫1的binlog狀態:
-rw-r----- 1 mysql mysql 241 Feb 15 00:10 mysql-bin.000035
-rw-r----- 1 mysql mysql 723 Feb 15 17:08 mysql-bin.000036
-rw-r----- 1 mysql mysql 2786 Feb 15 18:57 mysql-bin.000037
-rw-r----- 1 mysql mysql 488 Feb 16 10:22 mysql-bin.000038
-rw-r----- 1 mysql mysql 759 Feb 16 10:37 mysql-bin.000039
-rw-r----- 1 mysql mysql 247 Feb 16 10:29 mysql-bin.index
查看從庫2的binlog狀態:
-rw-r----- 1 mysql mysql 177 Feb 15 15:12 mysql-bin.000001
-rw-r----- 1 mysql mysql 2770 Feb 16 00:10 mysql-bin.000002
-rw-r----- 1 mysql mysql 2593 Feb 16 10:50 mysql-bin.000003
-rw-r----- 1 mysql mysql 96 Feb 16 00:10 mysql-bin.index
從庫1的mysql-bin.000039在10:37已經停止更新,而從庫二的mysql-bin.000003持續在更新,
通過mysqlbinlog來讀取檔案比較差異,筆者這里從庫一遠早于從庫二的搭建,所以從庫一的binlog檔案序號更大,
從庫一:
mysqlbinlog --start-datetime='2022-02-16 10:00:00' /var/lib/mysql/mysql-bin.000039
# at 705
#220216 10:37:15 server id 1 end_log_pos 736 CRC32 0x8523929e Xid = 183
COMMIT/*!*/;
# at 736
#220216 10:37:48 server id 2 end_log_pos 759 CRC32 0x238de33d Stop
從庫二:
mysqlbinlog --start-datetime='2022-02-16 10:00:00' --stop-datetime='2020-02-16 11:00:00' --no-defaults /var/lib/mysql/mysql-bin.000003
#220216 10:50:42 server id 1 end_log_pos 2562 CRC32 0x5e8d4964 Write_rows: table id 156 flags: STMT_END_F
BINLOG '
gmYMYhMBAAAAOwAAAM0JAAAAAJwAAAAAAAEACWZsaW5rX2NkYwAGdGVzdF8xAAMIDxEDKAAAAhD4
QQc=
gmYMYh4BAAAANQAAAAIKAAAAAJwAAAAAAAEAAgAD//gQAAAAAAAAAARkYXRhYgxmgmRJjV4=
'/*!*/;
# at 2562
#220216 10:50:42 server id 1 end_log_pos 2593 CRC32 0xbc78edd3 Xid = 486
COMMIT/*!*/;
# at 2593
#220216 11:39:28 server id 3 end_log_pos 2616 CRC32 0x3fe31a47 Stop
3.3 FlinkCDC mysql高可用第一次驗證
3.3.1 從作業checkpoint資訊中獲取最后一次的checkpoint資訊

3.3.2 切換到重庫2
把mysql_test_1的hostname切為從庫2,
Flink SQL> drop table mysql_test_1;
Flink SQL> create table mysql_test_1(
> id bigint primary key not enforced,
> data String,
> create_time Timestamp(3)
> ) with (
> 'connector'='mysql-cdc',
> 'hostname'='192.168.2.102',
> 'port'='3306',
> 'server-id'='5600-5604',
> 'username'='user_flink',
> 'password'='flink@testdb',
> 'server-time-zone'='Asia/Shanghai',
> 'debezium.snapshot.mode'='initial',
> 'database-name'='flink_cdc',
> 'table-name'='test_1'
> );
3.1.3 通過checkpoint重啟
Flink SQL> set 'execution.savepoint.path'='hdfs:///tmp/flink/checkpoints/88541fd8a08e1cee71aac55d2f39951f/chk-3'
Flink SQL> insert into hudi_test_1 select * from mysql_test_1;
這時候在flink web ui上查看新啟動的作業,發現作業重啟是失敗的,查看例外堆疊,提示binlog的檔案已經不存在,無法正常啟動作業,
The connector is trying to read binlog starting at Struct{version=1.5.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1644982889087,db=,server_id=0,file=mysql-bin.000039,pos=530,row=0}, but this is no longer available on the server. Reconfigure the connector to use a snapshot when needed.
例外資訊里提示的“file=mysql-bin.000039,pos=530”實際上是從庫一生成的binlog檔案與flinkcdc已經同步到的點位,這是checkpoint時保存下來的同步狀態,
而我們在前面的Binlog檔案分析中知道,從庫二的最新Binlog是mysql-bin.000003,這樣我們就發現從庫一和從庫二的binlog檔案不一致,是無法直接從checkpoint中進行主從切換的,
例外堆疊詳細資訊如下:
2022-02-16 11:41:29
java.lang.RuntimeException: One or more fetchers have encountered exception
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:223)
at org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:154)
at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:116)
at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:294)
at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:69)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:148)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:103)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.IllegalStateException: The connector is trying to read binlog starting at Struct{version=1.5.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1644982889087,db=,server_id=0,file=mysql-bin.000039,pos=530,row=0}, but this is no longer available on the server. Reconfigure the connector to use a snapshot when needed.
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.loadStartingOffsetState(StatefulTaskContext.java:179)
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.configure(StatefulTaskContext.java:113)
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:93)
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:65)
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.checkSplitOrStartNext(MySqlSplitReader.java:147)
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.fetch(MySqlSplitReader.java:69)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:140)
... 6 more
3.4 FlinkCDC mysql高可用第二次驗證
根據最新的FlinkCDC mysql connector官方檔案,mysqlcdc是支持高可用的,為什么會驗證失敗呢?我們深挖官方檔案發現,要實作高可用,mysql要開啟gtid,
3.4.1 GTID簡介
3.4.1.1 GTID是什么
Gtid是mysql在5.7引入的新特征,用于解決使用binlog - potition機制進行主從同步時缺憾:就像我們前面驗證的一樣,無法應對高可用!
GTID (Global Transaction ID)是全域事務ID,由主庫上生成的與事務系結的唯一標識,這個標識不僅在主庫上是唯一的,在MySQL集群內也是唯一的,
GTID由 server_uuid:transaction_id組成,server_uuid是mysql實體的uuid,transaction_id代表該實體上執行的事務數量,隨事務執行的數量自增,
3.4.1.2 GTID示例
下面是一個gtid示例:
d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169
其中d667b1cd-778a-11ec-aa60-6c92bf64e18c是服務uuid,1-61169代碼該實體執行了執行為id從1到61169的61169個事務,
3.4.1.3 GTID的尋址方式
開啟gtid后,會在binlog檔案頭部加上前一個binlog檔案的gtid范圍,尋址時通過比較gtid的范圍,確認所要同步的gtid所在binlog的位置,示意如下:
mysql-binlog.00001 previous-gtids:empty
mysql-binlog.00002 previous-gtids:1-100
mysql-binlog.00003 previous-gtids:101-200
尋找gtid=50,會從最新的binlog檔案mysql-binlog.00003讀取previous-gtids:101-200,比較50與101-200,確認50比101小,則往前兩個檔案查找,即從mysql-binlog.00001查找,
補充:
更多的gtid知識可以參見:
MySQL5.7殺手級新特性:GTID原理與實戰
下面我們踏上gtid高可用驗證之路,
3.4.2 mysql開啟gtid
對應mysql的組態檔(/etc/msyql/my.cnf)里添加以下配置:
主庫配置:
gtid_mode = on
enforce_gtid_consistency = on
從庫配置:
gtid_mode = on
enforce_gtid_consistency = on
log-slave-updates = 1
修改完配置后,重啟mysql服務:
service mysql restart
通過mysql-cli連接從庫,show slave status\G查看從庫slave的同步狀態,
從庫一模擬例外時與主庫有同步延遲,開啟gtid和沒開啟gtid的binlog決議是不一樣的,這時從庫一會顯示Slave_IO_Running=No,Last_IO_Error提示決議失敗,
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate anonymous transaction when @@GLOBAL.GTID_MODE = ON, at file /var/lib/mysql/mysql-bin.000039, position 1049.; the first event 'mysql-bin.000039' at 1049, the last event read from '/var/lib/mysql/mysql-bin.000039' at 1114, the last byte read from '/var/lib/mysql/mysql-bin.000039' at 1114.'
這時可以回滾主從庫的配置,重啟從庫一,讓從庫一在關閉gtid_mode完成同步后,再更新配置,
如果在更新配置程序中,主從庫gtid_mode不是同樣的狀態,會報以下錯誤,
將狀態修改為同一樣狀態即可,
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_IO_Error: The replication receiver thread cannot start because the master has GTID_MODE = ON and this server has GTID_MODE = OFF.
順利開啟gtid mode之后,每次更新資料陳述句執行時都會生成一個gtid,我們可以通過show master/slave status查看gtid的同步狀態,
master狀態:
mysql> show master status\G;
*************************** 1. row ***************************
File: mysql-bin.000042
Position: 764
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169
從庫一狀態:
mysql> show slave status\G;
##只截取gtid相關的資訊
*************************** 1. row ***************************
Master_UUID: d667b1cd-778a-11ec-aa60-6c92bf64e18c
Retrieved_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169
Executed_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169
1 row in set (0.00 sec)
從庫二狀態:
mysql> show slave status\G;
*************************** 1. row ***************************
Master_UUID: d667b1cd-778a-11ec-aa60-6c92bf64e18c
Retrieved_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:61168-61169
Executed_Gtid_Set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169
可見主從庫資料同步都到了最新的gtid: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61169,
3.4.3 重啟作業生成含gtid的checkpoint
將mysql_test_1的Host設定為從庫一,讓作業從從庫一生成的checkpoint中恢復,作業成功運行后kill掉從庫一,這時作業會報錯,現象與前述程序一致,不再贅述,
注意:
線上環境中,如果故障沒有恢復時更多的是從新的從庫中同步,
3.4.4 從含gtid的checkpoint中恢復
我們從web ui中拿到含有gtid的checkpoint進行作業重啟,發現依然是報binlog檔案例外,
Caused by: java.lang.IllegalStateException: The connector is trying to read binlog starting at Struct{version=1.5.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1645074443471,db=,server_id=0,file=mysql-bin.000044,pos=194,row=0}, but this is no longer available on the server. Reconfigure the connector to use a snapshot when needed.
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.loadStartingOffsetState(StatefulTaskContext.java:179)
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.configure(StatefulTaskContext.java:113)
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:93)
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:65)
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.checkSplitOrStartNext(MySqlSplitReader.java:147)
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.fetch(MySqlSplitReader.java:69)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:140)
... 6 more
3.4.5 例外原始碼分析
是gtid沒有生效?查看taskmanager的日志,checkpoint中已經包含gtid資訊,也傳遞給了SourceReaderBase,
2022-02-17 13:00:54,205 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Adding split(s) to reader: [MySqlBinlogSplit{splitId='binlog-split', offset={ts_sec=0, file=mysql-bin.000044, pos=194, gtids=d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61239, row=0, event=0}, endOffset={ts_sec=0, file=, pos=-9223372036854775808, row=0, event=0}}]
再查例外堆疊里的代碼, StatefulTaskContext里loadStartingOffsetState判斷isBinlogAvailable,binlog不可用就拋例外,
private MySqlOffsetContext loadStartingOffsetState(
OffsetContext.Loader loader, MySqlSplit mySqlSplit) {
BinlogOffset offset =
mySqlSplit.isSnapshotSplit()
? BinlogOffset.INITIAL_OFFSET
: mySqlSplit.asBinlogSplit().getStartingOffset();
MySqlOffsetContext mySqlOffsetContext =
(MySqlOffsetContext) loader.load(offset.getOffset());
if (!isBinlogAvailable(mySqlOffsetContext)) {
throw new IllegalStateException(
"The connector is trying to read binlog starting at "
+ mySqlOffsetContext.getSourceInfo()
+ ", but this is no longer "
+ "available on the server. Reconfigure the connector to use a snapshot when needed.");
}
return mySqlOffsetContext;
}
而isBinlogAvailable只判斷了是否找到檔案名,而沒有判斷gtidset,
private boolean isBinlogAvailable(MySqlOffsetContext offset) {
String binlogFilename = offset.getSourceInfo().getString(BINLOG_FILENAME_OFFSET_KEY);
if (binlogFilename == null) {
return true; // start at current position
}
if (binlogFilename.equals("")) {
return true; // start at beginning
}
// Accumulate the available binlog filenames ...
List<String> logNames = connection.availableBinlogFiles();
// And compare with the one we're supposed to use ...
boolean found = logNames.stream().anyMatch(binlogFilename::equals);
if (!found) {
LOG.info(
"Connector requires binlog file '{}', but MySQL only has {}",
binlogFilename,
String.join(", ", logNames));
} else {
LOG.info("MySQL has the binlog file '{}' required by the connector", binlogFilename);
}
return found;
}
3.4.6 gtid bugfix
前面在3.4.1.3講述了Gtid的尋址方式,mysql高可以模式下,切換主庫時,slave會根據已同步到的gtid,對比master binlog檔案的previous gtidset,最終確認要從新master的哪個binlog檔案開始同步,而原始碼根本沒有這樣一個尋址步驟,
所以很確認的是,flinkcdc bug了,上github提bug去,gtid沒有生效
根據githup上的反饋,有位兄弟21年年底已經發現了這個bug,并在一月份提交了bugfix[mysql] update check gtid set #761,但目前的修改代碼還沒有合入master分支,
查看對方修改的代碼,在isBinlogAvailable增加了checkGtidSet
private boolean isBinlogAvailable(MySqlOffsetContext offset) {
String gtidStr = offset.gtidSet();
if (gtidStr != null) {
return checkGtidSet(offset);
}
return checkBinlogFilename(offset);
}
checkGtidSet里做了以下幾件事:
1、查詢了master的gtidset–availableGtidStr ,
2、和checkpoint的gtidset對比,計算出需要同步的gtid范圍gtidSetToReplicate,
3、查詢是否清除過gtid–purgedGtidSet,和gtidSetToReplicate對比計算出沒有清除的范圍,確認沒有清除過,就從checkpoint中續點同步,否則重新開始全量同步,
private boolean checkGtidSet(MySqlOffsetContext offset) {
String gtidStr = offset.gtidSet();
if (gtidStr.trim().isEmpty()) {
return true; // start at beginning ...
}
String availableGtidStr = connection.knownGtidSet();
if (availableGtidStr == null || availableGtidStr.trim().isEmpty()) {
// Last offsets had GTIDs but the server does not use them ...
LOG.warn(
"Connector used GTIDs previously, but MySQL does not know of any GTIDs or they are not enabled");
return false;
}
// GTIDs are enabled
GtidSet gtidSet = new GtidSet(gtidStr);
// Get the GTID set that is available in the server ...
GtidSet availableGtidSet = new GtidSet(availableGtidStr);
if (gtidSet.isContainedWithin(availableGtidSet)) {
LOG.info(
"MySQL current GTID set {} does contain the GTID set {} required by the connector.",
availableGtidSet,
gtidSet);
// The replication is concept of mysql master-slave replication protocol ...
final GtidSet gtidSetToReplicate =
connection.subtractGtidSet(availableGtidSet, gtidSet);
final GtidSet purgedGtidSet = connection.purgedGtidSet();
LOG.info("Server has already purged {} GTIDs", purgedGtidSet);
final GtidSet nonPurgedGtidSetToReplicate =
connection.subtractGtidSet(gtidSetToReplicate, purgedGtidSet);
LOG.info(
"GTID set {} known by the server but not processed yet, for replication are available only GTID set {}",
gtidSetToReplicate,
nonPurgedGtidSetToReplicate);
if (!gtidSetToReplicate.equals(nonPurgedGtidSetToReplicate)) {
LOG.warn("Some of the GTIDs needed to replicate have been already purged");
return false;
}
return true;
}
LOG.info("Connector last known GTIDs are {}, but MySQL has {}", gtidSet, availableGtidSet);
return false;
}
3.4.7 合并bugfix代碼,驗證高可用
從gitbub中下載代碼(https://github.com/ververica/flink-cdc-connectors.git),將bugfix–gtid沒有生效的代碼修改入master,編譯打包,更新到flink環境中,從走上述驗證程序, 觀察到:
1、切換主從時,沒有再報binlog檔案找不到例外,
2、作業從checkpoint中開始同步,而不是全量,
現象表明,mysqlcdc高可用切換成功,
我們查看運行日志,進一步驗證:
1、獲取初始gtids=d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282
2、確認MySQL 當前GTIDset d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61290 包含初始 GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282,
3、計算可同步的GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:61283-61290,
4、設定同步起點開始同步,
詳細日志如下:
2022-02-17 20:05:55,961 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Adding split(s) to reader: [MySqlBinlogSplit{splitId='binlog-split', offset={ts_sec=0, file=mysql-bin.000005, pos=31359, gtids=d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282, row=0, event=0}, endOffset={ts_sec=0, file=, pos=-9223372036854775808, row=0, event=0}, isSuspended=false}]
2022-02-17 20:05:55,966 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Starting split fetcher 0
2022-02-17 20:05:55,970 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: TableSourceScan(table=[[default_catalog, default_database, mysql_test_1]], fields=[id, data, create_time]) -> NotNullEnforcer(fields=[id]) -> Map (1/1)#0 (7fdfa75135e212f596f80bd2716e0837) switched from INITIALIZING to RUNNING.
com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader [] - BinlogSplitReader is created.
2022-02-17 20:05:56,237 INFO com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - MySQL current GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61290 does contain the GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282 required by the connector.
2022-02-17 20:05:56,247 INFO com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - Server has already purged d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61103 GTIDs
2022-02-17 20:05:56,248 INFO com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:61283-61290 known by the server but not processed yet, for replication are available only GTID set d667b1cd-778a-11ec-aa60-6c92bf64e18c:61283-61290
2022-02-17 20:05:56,248 INFO io.debezium.relational.history.DatabaseHistoryMetrics [] - Started database history recovery
2022-02-17 20:05:56,249 INFO io.debezium.relational.history.DatabaseHistoryMetrics [] - Finished database history recovery of 0 change(s) in 1 ms
2022-02-17 20:05:56,277 INFO io.debezium.util.Threads [] - Requested thread factory for connector MySqlConnector, id = mysql_binlog_source named = binlog-client
2022-02-17 20:05:56,287 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - GTID set purged on server: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61103
2022-02-17 20:05:56,287 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Attempting to generate a filtered GTID set
2022-02-17 20:05:56,287 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - GTID set from previous recorded offset: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282
2022-02-17 20:05:56,288 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - GTID set available on server: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61290
2022-02-17 20:05:56,288 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Using first available positions for new GTID channels
2022-02-17 20:05:56,288 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Relevant GTID set available on server: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61290
2022-02-17 20:05:56,289 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Final merged GTID set to use when connecting to MySQL: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282
2022-02-17 20:05:56,289 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Registering binlog reader with GTID set: d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282
2022-02-17 20:05:56,289 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Skip 0 events on streaming start
2022-02-17 20:05:56,289 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Skip 0 rows on streaming start
2022-02-17 20:05:56,290 INFO io.debezium.util.Threads [] - Creating thread debezium-mysqlconnector-mysql_binlog_source-binlog-client
2022-02-17 20:05:56,293 INFO io.debezium.util.Threads [] - Creating thread debezium-mysqlconnector-mysql_binlog_source-binlog-client
2022-02-17 20:05:56,302 INFO io.debezium.connector.mysql.MySqlStreamingChangeEventSource [] - Connected to MySQL binlog at 10.130.49.141:3306, starting at MySqlOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.mysql.Source:STRUCT}, sourceInfo=SourceInfo [currentGtid=null, currentBinlogFilename=mysql-bin.000005, currentBinlogPosition=31359, currentRowNumber=0, serverId=0, sourceTime=null, threadId=-1, currentQuery=null, tableIds=[], databaseName=null], partition={server=mysql_binlog_source}, snapshotCompleted=false, transactionContext=TransactionContext [currentTransactionId=null, perTableEventCount={}, totalEventCount=0], restartGtidSet=d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282, currentGtidSet=d667b1cd-778a-11ec-aa60-6c92bf64e18c:1-61282, restartBinlogFilename=mysql-bin.000005, restartBinlogPosition=31359, restartRowsToSkip=0, restartEventsToSkip=0, currentEventLengthInBytes=0, inTransaction=false, transactionId=null]
四、總結
至此,能過三輪驗證,確認flinkcdc可以支持mysql高可用,需要條件是:
1、mysql服務開啟gtid_mode
2、目前官方發布的版本(2.2-SNAPSHOT)存在bug,不可以直接支持高可用,需要人工合并bugfix的代碼,自行編譯后才可以正常使用,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/426515.html
標籤:其他
