當 http client 回傳值為不為空,只讀取 response header,但不讀 body 內容就執行 response.Body.Close(),那么連接會被主動關閉,得不到復用,
測驗代碼如下:
// xiaorui.cc
func HttpGet() {
for {
fmt.Println("new")
resp, err := http.Get("http://www.baidu.com")
if err != nil {
fmt.Println(err)
continue
}
if resp.StatusCode == http.StatusOK {
continue
}
resp.Body.Close()
fmt.Println("go num", runtime.NumGoroutine())
}
}
正如大家所想,除了 HEAD Method 外,很少會有只讀取 header 的需求吧,
話說,golang httpclient 需要注意的地方著實不少,
如沒有 response.Body.Close(),有些小場景造成 persistConn 的 writeLoop 泄露,
如 header 和 body 都不管,那么會造成泄露的連接干滿連接池,后面的請求只能是
短連接,
背景關系
由于某幾個業務系統會瘋狂呼叫各區域不同的 k8s 集群,為減少跨機房帶來的時延、兼容新老 k8s 集群 api、減少k8s api-server 的負載,故而開發了 k8scache 服務,在部署運行后開始對該服務進行監控,發現 metrics 呈現的 QPS 跟連接數不成正比,qps 為 1500,連接數為 10 個,開始以為觸發 idle timeout 被回收,但通過歷史監控圖分析到連接依然很少,????
按照對 k8scache 呼叫方的理解,他們經常粗暴的開啟不少協程來對 k8scache 進行訪問,已知默認的 golang httpclient transport 對連接數是有默認限制的,連接池總大小為 100,每個 host 連接數為 2,當并發對某 url 進行請求時,無法歸還連接池,也就是超過連接池大小的連接會被主動clsoe(),所以,我司的 golang 腳手架中會對默認的 httpclient 創建高配的 transport,不太可能出現連接池爆滿被 close 的問題,
如果真的是連接池爆了? 誰主動挑起關閉,誰就有 tcp time-wait 狀態,但通過 netstat 命令只發現少量跟 k8scache 相關的 time-wait,
排查問題
已知問題, 為隱藏敏感資訊,索性使用簡單的場景設立問題的 case
tcpdump抓包分析問題?
包資訊如下,通過最后一行可以確認是由客戶端主動觸發 RST連接重置 ,觸發RST的場景有很多,但常見的有 tw_bucket 滿了、tcp 連接佇列爆滿且開啟 tcp_abort_on_overflow、配置 so_linger、讀緩沖區還有資料就給 close,
通過 linux 監控和內核日志可以確認不是內核配置的問題,配置 so_linger 更不可能,???? 大概率就一個可能,關閉未清空讀緩沖區的連接,
22:11:01.790573 IP (tos 0x0, ttl 64, id 29826, offset 0, flags [DF], proto TCP (6), length 60)
host-46.54550 > 110.242.68.3.http: Flags [S], cksum 0x5f62 (incorrect -> 0xb894), seq 1633933317, win 29200, options [mss 1460,sackOK,TS val 47230087 ecr 0,nop,wscale 7], length 0
22:11:01.801715 IP (tos 0x0, ttl 43, id 0, offset 0, flags [DF], proto TCP (6), length 52)
110.242.68.3.http > host-46.54550: Flags [S.], cksum 0x00a0 (correct), seq 1871454056, ack 1633933318, win 29040, options [mss 1452,nop,nop,sackOK,nop,wscale 7], length 0
22:11:01.801757 IP (tos 0x0, ttl 64, id 29827, offset 0, flags [DF], proto TCP (6), length 40)
host-46.54550 > 110.242.68.3.http: Flags [.], cksum 0x5f4e (incorrect -> 0xb1f5), seq 1, ack 1, win 229, length 0
22:11:01.801937 IP (tos 0x0, ttl 64, id 29828, offset 0, flags [DF], proto TCP (6), length 134)
host-46.54550 > 110.242.68.3.http: Flags [P.], cksum 0x5fac (incorrect -> 0xb4d6), seq 1:95, ack 1, win 229, length 94: HTTP, length: 94
GET / HTTP/1.1
Host: www.baidu.com
User-Agent: Go-http-client/1.1
22:11:01.814122 IP (tos 0x0, ttl 43, id 657, offset 0, flags [DF], proto TCP (6), length 40)
110.242.68.3.http > host-46.54550: Flags [.], cksum 0xb199 (correct), seq 1, ack 95, win 227, length 0
22:11:01.815179 IP (tos 0x0, ttl 43, id 658, offset 0, flags [DF], proto TCP (6), length 4136)
110.242.68.3.http > host-46.54550: Flags [P.], cksum 0x6f4e (incorrect -> 0x0e70), seq 1:4097, ack 95, win 227, length 4096: HTTP, length: 4096
HTTP/1.1 200 OK
Bdpagetype: 1
Bdqid: 0x8b3b62c400142f77
Cache-Control: private
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html;charset=utf-8
Date: Wed, 09 Dec 2020 14:11:01 GMT
...
22:11:01.815214 IP (tos 0x0, ttl 64, id 29829, offset 0, flags [DF], proto TCP (6), length 40)
host-46.54550 > 110.242.68.3.http: Flags [.], cksum 0x5f4e (incorrect -> 0xa157), seq 95, ack 4097, win 293, length 0
22:11:01.815222 IP (tos 0x0, ttl 43, id 661, offset 0, flags [DF], proto TCP (6), length 4136)
110.242.68.3.http > host-46.54550: Flags [P.], cksum 0x6f4e (incorrect -> 0x07fa), seq 4097:8193, ack 95, win 227, length 4096: HTTP
22:11:01.815236 IP (tos 0x0, ttl 64, id 29830, offset 0, flags [DF], proto TCP (6), length 40)
host-46.54550 > 110.242.68.3.http: Flags [.], cksum 0x5f4e (incorrect -> 0x9117), seq 95, ack 8193, win 357, length 0
22:11:01.815243 IP (tos 0x0, ttl 43, id 664, offset 0, flags [DF], proto TCP (6), length 5848)
...
host-46.54550 > 110.242.68.3.http: Flags [.], cksum 0x5f4e (incorrect -> 0x51ba), seq 95, ack 24165, win 606, length 0
22:11:01.815369 IP (tos 0x0, ttl 64, id 29834, offset 0, flags [DF], proto TCP (6), length 40)
host-46.54550 > 110.242.68.3.http: Flags [R.], cksum 0x5f4e (incorrect -> 0x51b6), seq 95, ack 24165, win 606, length 0
通過 lsof 找到行程關聯的 TCP 連接,然后使用 ss 或 netstat 查看讀寫緩沖區,資訊如下,recv-q 讀緩沖區確實是存在資料,這個緩沖區位元組一直未讀,直到連接關閉引發了 rst,
$ lsof -p 54330
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
aaa 54330 root 1u CHR 136,0 0t0 3 /dev/pts/0
aaa 54330 root 2u CHR 136,0 0t0 3 /dev/pts/0
aaa 54330 root 3u a_inode 0,10 0 8838 [eventpoll]
aaa 54330 root 4r FIFO 0,9 0t0 223586913 pipe
aaa 54330 root 5w FIFO 0,9 0t0 223586913 pipe
aaa 54330 root 6u IPv4 223596521 0t0 TCP host-46:60626->110.242.68.3:http (ESTABLISHED)
$ ss -an|egrep "68.3:80"
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 72480 0 172.16.0.46:60626 110.242.68.3:80
strace 跟蹤系統呼叫
通過系統呼叫可分析出,貌似只是讀取了 header 部分了,還未讀到 body 的資料,
[pid 8311] connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("110.242.68.3")}, 16 <unfinished ...>
[pid 195519] epoll_pwait(3, <unfinished ...>
[pid 8311] <... connect resumed>) = -1 EINPROGRESS (操作現在正在進行)
[pid 8311] epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2350546712, u64=140370471714584}} <unfinished ...>
[pid 195519] getsockopt(6, SOL_SOCKET, SO_ERROR, <unfinished ...>
[pid 192592] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 195519] getpeername(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("110.242.68.3")}, [112->16]) = 0
[pid 195519] getsockname(6, <unfinished ...>
[pid 195519] <... getsockname resumed>{sa_family=AF_INET, sin_port=htons(47746), sin_addr=inet_addr("172.16.0.46")}, [112->16]) = 0
[pid 195519] setsockopt(6, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
[pid 195519] setsockopt(6, SOL_TCP, TCP_KEEPIDLE, [15], 4 <unfinished ...>
[pid 8311] write(6, "GET / HTTP/1.1\r\nHost: www.baidu.com\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 94 <unfinished ...>
[pid 192595] read(6, <unfinished ...>
[pid 192595] <... read resumed>"HTTP/1.1 200 OK\r\nBdpagetype: 1\r\nBdqid: 0xc43c9f460008101b\r\nCache-Control: private\r\nConnection: keep-alive\r\nContent-Encoding: gzip\r\nContent-Type: text/html;charset=utf-8\r\nDate: Wed, 09 Dec 2020 13:46:30 GMT\r\nExpires: Wed, 09 Dec 2020 13:45:33 GMT\r\nP3p: CP=\" OTI DSP COR IVA OUR IND COM \"\r\nP3p: CP=\" OTI DSP COR IVA OUR IND COM \"\r\nServer: BWS/1.1\r\nSet-Cookie: BAIDUID=996EE645C83622DF7343923BF96EA1A1:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: BIDUPSID=99"..., 4096) = 4096
[pid 192595] close(6 <unfinished ...>
邏輯代碼
那么到這里,可以大概猜測問題所在,找到業務方涉及到 httpclient 的邏輯代碼,偽代碼如下,跟上面的結論一樣,只是讀取了header,但并未讀取完response body資料,
還以為是特殊的場景,結果是使用不當,把請求投遞過去后只判斷 http code?真正的業務 code 是在 body 里的,????
urls := []string{...}
for _, url := range urls {
resp, err := http.Post(url, ...)
if err != nil {
// ...
}
if resp.StatusCode == http.StatusOK {
continue
}
// handle redis cache
// handle mongodb
// handle rocketmq
// ...
resp.Body.Close()
}
如何解決
不細說了,把 header length 長度的資料讀完就可以了,
分析問題
先不管別人使用不當,再分析下為何出現短連接,連接不能復用的問題,
為什么不讀取 body 就出問題?其實 http.Response 欄位描述中已經有說明了,當 Body 未讀完時,連接可能不能復用,
// The http Client and Transport guarantee that Body is always
// non-nil, even on responses without a body or responses with
// a zero-length body. It is the caller's responsibility to
// close Body. The default HTTP client's Transport may not
// reuse HTTP/1.x "keep-alive" TCP connections if the Body is
// not read to completion and closed.
//
// The Body is automatically dechunked if the server replied
// with a "chunked" Transfer-Encoding.
//
// As of Go 1.12, the Body will also implement io.Writer
// on a successful "101 Switching Protocols" response,
// as used by WebSockets and HTTP/2's "h2c" mode.
Body io.ReadCloser
眾所周知,golang httpclient 要注意 response Body 關閉問題,但上面的 case 確實有關了 body,只是非常規地沒去讀取 reponse body 資料,這樣會造成連接例外關閉,繼而引起連接池不能復用,
一般 http 協議解釋器是要先決議 header,再決議 body,結合當前的問題開始是這么推測的,連接的 readLoop 收到一個新請求,然后嘗試決議 header 后,回傳給呼叫方等待讀取 body,但呼叫方沒去讀取,而選擇了直接關閉 body,那么后面當一個新請求被 transport roundTrip 再調度請求時,readLoop 的 header 讀取和決議會失敗,因為他的讀緩沖區里有前面未讀的資料,必然無法決議 header,按照常見的網路編程原則,協議決議失敗,直接關閉連接,
想是這么想的,但還是看了 golang net/http 的代碼,結果不是這樣的,????
分析原始碼
httpclient 每個連接會創建讀寫協程兩個協程,分別使用 reqch 和 writech 來跟 roundTrip 通信,上層使用的response.Body 其實是經過多次封裝的,一次封裝的 body 是直接跟 net.conn 進行互動讀取,二次封裝的 body 則是加強了 close 和 eof 處理的 bodyEOFSignal,
當未讀取 body 就進行 close 時,會觸發 earlyCloseFn() 回呼,看 earlyCloseFn 的函式定義,在 close 未見 io.EOF 時才呼叫,自定義的 earlyCloseFn 方法會給 readLoop 監聽的 waitForBodyRead 傳入 false, 這樣引發 alive 為 false 不能繼續回圈的接收新請求,只能是退出呼叫注冊過的 defer 方法,關閉連接和清理連接池,
// xiaorui.cc
func (pc *persistConn) readLoop() {
closeErr := errReadLoopExiting // default value, if not changed below
defer func() {
pc.close(closeErr) // 關閉連接
pc.t.removeIdleConn(pc) // 從連接池中洗掉
}()
...
alive := true
for alive {
...
rc := <-pc.reqch // 從管道中拿到請求,roundTrip 對該管道進行輸入
trace := httptrace.ContextClientTrace(rc.req.Context())
var resp *Response
if err == nil {
resp, err = pc.readResponse(rc, trace) // 更多的是決議 header
} else {
err = transportReadFromServerError{err}
closeErr = err
}
...
waitForBodyRead := make(chan bool, 2)
body := &bodyEOFSignal{
body: resp.Body,
// 提前關閉 !!! 輸出false
earlyCloseFn: func() error {
waitForBodyRead <- false
...
},
// 正常收尾 !!!
fn: func(err error) error {
isEOF := err == io.EOF
waitForBodyRead <- isEOF
...
},
}
resp.Body = body
select {
case rc.ch <- responseAndError{res: resp}:
case <-rc.callerGone:
return
}
select {
case bodyEOF := <-waitForBodyRead:
replaced := pc.t.replaceReqCanceler(rc.cancelKey, nil) // before pc might return to idle pool
// alive 為 false, 不能繼續 continue
alive = alive &&
bodyEOF &&
!pc.sawEOF &&
pc.wroteRequest() &&
replaced && tryPutIdleConn(trace)
...
case <-rc.req.Cancel:
alive = false
pc.t.CancelRequest(rc.req)
case <-rc.req.Context().Done():
alive = false
pc.t.cancelRequest(rc.cancelKey, rc.req.Context().Err())
case <-pc.closech:
alive = false
}
}
}
bodyEOFSignal 的 Close():
// xiaorui.cc
func (es *bodyEOFSignal) Close() error {
es.mu.Lock()
defer es.mu.Unlock()
if es.closed {
return nil
}
es.closed = true
if es.earlyCloseFn != nil && es.rerr != io.EOF {
return es.earlyCloseFn()
}
err := es.body.Close()
return es.condfn(err)
}
最侄訓呼叫 persistConn 的 close(), 連接關閉并關閉closech:
// xiaorui.cc
func (pc *persistConn) close(err error) {
pc.mu.Lock()
defer pc.mu.Unlock()
pc.closeLocked(err)
}
func (pc *persistConn) closeLocked(err error) {
if err == nil {
panic("nil error")
}
pc.broken = true
if pc.closed == nil {
pc.closed = err
pc.t.decConnsPerHost(pc.cacheKey)
if pc.alt == nil {
if err != errCallerOwnsConn {
pc.conn.Close() // 關閉連接
}
close(pc.closech) // 通知讀寫協程
}
}
}
總之
同事的 httpclient 使用方法有些奇怪,除了 head method 之外,還真想不到有不讀取 body 的請求,所以,大家知道 httpclient 有這么一回事就行了,
另外,一直覺得 net/http 的代碼太繞,沒看過一些介紹直接看代碼很容易陷進去,有時間專門講講 http client 的實作,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/240943.html
標籤:AI
上一篇:圖解鴻蒙Page Ability的啟動型別(LaunchType)
下一篇:回眸 2020,展望 2021
