Big Sliding Windows Increase Impact of Retransmission

dataflow Add comments

I was analyzing a situation at a local company, and found that large buffers on low-latency connections are counter-productive.

Consider the following:

  • Big server has 12x HBAs to Fibrechannel SAN. This could be anything: teamed NICs, for example
  • big writes (for example, 12 MByte) load-balanced (each HBA takes 1MB buffers to send)
  • all 12 buffers need to arrive at once for a two-stage commit

So the problem is that in Fibrechannel, this is breoken up into2112-byte frames (think like an MTU of 1480 or 1500 in Ethernet). The smallest atomic chunk is 2112, so the megabyte is actually 497 frames. If any of those frames is discarded or corrupted, the session is retransmitted.

The important impact of the third item above is that it’s actually 12MByte in a single transfer, only “shotgunned” by loadbalancing, but all must arrive, or none must. This means that (12 x 497) frames must all arrive, or all need to be retransmitted (as a result of the host-side multiplexing — SAN faithfully sends the other content perfectly fine)

So with 5964 frames, you only need a 1/10000 failure rate to cause every second transmission to fail. At 1/100000, 1 in 20 fails.

In the multiplexing application’s recovery phase, it needs to wait 30 seconds for a failure in some cases: even though FibreChannel immediately aborts in 496 of 497 failure cases, the multiplexing doesn’t get alerted until its own timeout has expired. It seems that this might be created for slower connections, such as across IPIP, FCoE, DWDM, or similar slower-than-it-seems connections with larger latencies.

That means that a system processing 51MBit/sec, or 6.4 MByte/sec, can buffer a sequence of 193MByte before a retransmission is required. If that happens 1/20, then you’ll only get (Poisson distribution, I know) 228MByte between failures which gives you roughly 44% efficiency.

Part-and-parcel with that, a failure will show a huge delay (30 sec) in response time while the failure is getting detected during which no transaction can hit the storage; when it finally gets freed up, the backlog of nearly 30sec needs to be retransmitted. The failure cases may occur more frequently when data ramps up (such as link-level congestion exhausting buffers). That means that during the busiest times, the failures will occur more frequently, and can cause neighbouring systems sharing resources to similarly be impacted in a flip-flop action like trading around the “Old Maid” in a childrens’ card game.

So what happens when you reduce the session size? What about 8k pages, which give you 4 frames per session? Similar to cut/join FTP uploads to reduce the retransmission cost, more of those 8k pages arrive, and although a 30-sec timeout is still a 30-sec timeout, the in-flight retransmission is only (4×12) 40 frames. Less than 1% of the big buffer cost, similar to the difference is size of each buffer. Efficiency drops, since a 3.1% inefficiency (8448/8192) replaces a 0.104% inefficiency (1049664 / 1048576 (both *1024)) but the overall throughput in a sub-optimal situation should be much higher due to reduce retransmission.

Reducing the timeout in the multiplexer application should reduce the retransmission cost so long as the timeout is not too high that successful transactions are failed. Considering Fibre’s fast response time (typically 3ms, rarely exceeding 12ms during spike situations so long as no single server has too much queue depth to rob the SAN of its buffer resource)

Leave a Reply

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in