Windows上的套接字發送/接收速度-有解無憂

在 Windows Python 3.7 i5 筆記本電腦上，通過 a 接收 100MB 資料需要 200 毫秒socket，這與 RAM 速度相比顯然非常低。

如何在 Windows 上提高此套接字速度？

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
t0 = time.time()
while True:
    data = conn.recv(8192)  # 8192 instead of 1024 improves from 0.5s to 0.2s
    if data == b'':
        break
print(time.time() - t0)  # ~ 0.200s

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
a = b"a" * 100_000_000  # 100 MB of data
t0 = time.time()
s.send(a)
print(time.time() - t0)  # ~ 0.020s

注意：問題 Windows 上的套接字發送/接收速度

One can see that the server is not immediately awaken when the client fill the TCP buffer which is a missed-optimization of the Windows scheduler. In fact, the scheduler could wake up the client before the server starvation so to reduce latency issues. Note that a non-negligible part of the time is spent in a kernel process and the time slice are matching with the client activity.

Overall, 55% of the time is spend in the recv function of ws2_32.dll, 10% in the send function of the same DLL, 25% in synchronization functions, and 10% in other functions including ones of the CPython interpreter. Thus, the modified benchmark is not slowed down by CPython. Additionally, synchronizations are not the main source of slowdown.

When processes are scheduled, the memory throughput goes from 16 GiB/s up to 34 GiB/s with an average of ~20 GiB/s which is pretty big (especially considering the time taken by synchronizations). This means Windows performs a lot of big temporary buffer copies, especially during the recv calls.

Note that the reason why the Xeon-based platform is slower is certainly because the processor only succeed to reach 14 GiB/s in sequential while the i5-9600KF processor reach 24 GiB/s in sequential. The Xeon processor also operate at a lower frequency. Such things are common for server-based processors that mainly focus on scalability.

A deeper analysis of ws2_32.dll show that nearly all the time of recv is spent in the obscure instruction call qword ptr [rip 0x3440f] which I guess is a kernel call to copy data from a kernel buffer to the user one. The same thing applies for send. This means that the copies are not done in user-land but in the Windows kernel itself...

If you want to share data between two processes on Windows, I strongly advise you to use shared memory instead of sockets. Some message passing libraries provide an abstraction on top of this (like ZeroMQ for example).

Notes

Here is some notes as pointed out in the comments:

If increasing the buffer size does not impact significantly the performance, then it certainly means that the code is already memory bound on the target machine. For example, with a 1 DDR4 memory channel @ 2400 GHz common on 3-year old PC, then the maximum practical throughput will be about 14 GiB/s and I expect the sockets throughput to be clearly less than 1 GiB/s. On much older PC with a basic 1 channel DDR3, the throughput should even be close to 500 MiB/s. The speed should be bounded by something like maxMemThroughput / K where K = (N 1) * P and where:

N is the number of copy the operating system perform;
P is equal to 2 on processor with a write-through cache policy or operating system using non-temporal SIMD instructions, and 3 otherwise.

Low-level profilers show that K ~= 8 on Windows. They also show that send performs an efficient copy that benefit from non-temporal stores and quite saturate the RAM throughput, while recv seems not to use non-temporal stores, clearly does not saturate the RAM throughput and performs a lot more reads than writes (for some unknown reason).

On NUMA system like recent AMD processors (Zen) or multi-socket systems, this should be even be worse since the interconnect and the saturation of NUMA nodes can slow down transfers. Windows is known to behave badly in this case.

AFAIK, ZeroMQ has multiple backends (aka. "Multi-Transport") and one of them operate with TCP (default) while another operate with shared memory.

uj5u.com熱心網友回復：

send如果您要同時發送大量資料，請不要撥打兩個電話。當實作看到第一個send時，沒有理由認為會有第二個，所以立即發送資料。但是當它看到第二個時send，它沒有理由認為不會有第三個，因此延遲發送資料以嘗試聚合一個完整的資料包。

如果它們是兩個不同的應用程式級訊息并且另一方確認第一條訊息，這實際上會很好。但這里不是這樣。

如果您正在設計應用程式級協議以使用 TCP，那么如果您關心性能，則必須使用 TCP。

如果您沒有應用程式級訊息，請確保在每次呼叫中聚合盡可能多的資料send——至少 4KB。

如果您確實有對方確認的應用程式級訊息，請嘗試在每個send呼叫中??包含完整的訊息。

但是您在代碼中所做的事情違反了所有這些原則，并且使實作無法很好地執行。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/443994.html

標籤：Python 视窗表现插座联网

上一篇：從客戶端發送檔案到服務器出錯

下一篇：單條訊息后Dart套接字onDone