Tag Archives: window scalling

TCP Performance. Why I couldn’t reach expected speed?

After you read this article you will find answers or suggestions for following questions:

  • I have 1Gbps NIC on PC and Server. They are on the same network, all links between are 1Gbps but our throughput is low.
  • TCP header should have 20 bytes, but I saw headers with 32 or 40 bytes. What means those values?
  • I heard about different congestion control mechanisms. What is the difference between them?

TCP connection is established between two end hosts. Doesn't matter how many routers, networks, switches you have between. TCP will ask remote host application to get/transmit data. This means that TCP interact directly with application on local and remote host. For example we open browser on our PC and enter in the address field www.google.com. This means that we are asking web page located on the google's server hosted by some hosting application, like apache2 or something else.

Let's take a look on the TCP header:

Source: https://ciscoskills.net/2011/02/25/understanding-tcp/tcpheader/

In the TCP header we could find answer why sometimes we could see TCP header with more than 20 bytes. Options allow us to extend header up to 52 bytes. Usually we could see bigger header in the TCP three way handshake, after that communication has standard header.

TCP is called reliable protocol and connection oriented. Let's review what means reliable protocol. TCP has on the header fields called Sequence number (seq) and Acknowledgement number (ack). Sequence numbers are generated in the random way at the beginning of TCP communication. Acknowledgement number is used to confirm that we get data send by sender. In case we don't have confirmation data will be retransmitted. Let's see how combination of seq and ack works as concept in following example:

First 3 packets in the communication are related tp three way handshake. You could note that start sequence number on the PC and Server are different (they could be the same). We set them different just to avoid confusion. This example shows us download of example.zip file from server. Server has to divide file on segments to send them via Internet. How many segments will be? This Depends on MSS (Maximum segment size) and size of the file.

Acknowledgement number has to confirm that we get data. For example server sends us two segments with total size 1500 bytes. Server start sequence number was 401, so 401 + 1500 bytes will be 1901. This means that server expect to get segment with ack 1901+1=1902. Server will keep sent segments without acknowledgement in his tx buffer. When server get ack about received data he release segments from tx buffer (keep in mind tx buffer is limited in size). In the same way we will get Ack number for the last sent segment.

Why PC acknowledge first two segments in one hit and third in separate ack? Server task is to send as much as he could segments, PC task is to advertise as fast as he could already received segments. So you could have 1 ACK segments per 50 segments and after that 1 ACK after next 5 segments.

TCP Performance

TCP could adapt speed of transfer to the link condition

Let's go back to the TCP header. There we have very important field called Window. This field has crucial impact on performance and speed transfer. Let's see why. We have a transfer sample between server and client. Let's suppose that client announce that his window is 14600 bytes (Window = Receive buffer size). This means that server could send segments with total size up to 14600 bytes until he has to stop and wait for ack message. Usually client will send ack faster and window will not become full. In case window become full server stop communication until client will say that he is ready to get new data. So in other words bigger window require less ack packets and increase throughput. Window field is limited to 16 bits, which means 2^16 =65535 bytes.

Why window field vary from packet to packet?

When we establish TCP connection window field has low value. This value is linked to default mss of OS multiplied to 2,3 or 10. In short time window value increase very fast up to saturation. Window is increased after each ACK sent by receiver. This procedure is called TCP slow start.

Here we have a formula which shows us what is maximum speed:

MaxSpeed = Window * 8 / Delay in sec.

65535 * 8 / 0.001 sec = 524 Mbps.

Most interesting in this formula is dependency of delay. When your server is far and between you is present slow slow link this means that speed will dramatically decrease.

Window field will decrease in case of lost packets, this will be done very fast. In case of too many lost packets TCP connection will be simply closed with error.

TCP Congestion control mechanism

Window, sequence number and acknowledgement number fields are responsible for the speed control. Server cannot push too much data on the client, this could overhead RX buffer and all new packets will be dropped (unacceptable situation for the TCP communication). All those specification were collected in the TCP congestion control mechanism called TCP Reno. In our days we have links, NICs, OS with much more capacity this is why it is required to do changes in the TCP stack inside of operating systems. See here list of new proposed mechanism. For example starting Linux kernel 2.6.19 up to kernel 3.1 default congestion mechanism inside of Linux is CUBIC.

Are the new requirements for the speed control change TCP header? No, just few new options will be added and TCP stack inside of operating system will be changed accordingly.

We said that TCP is connection oriented which means that before to start exchange of data we have to negotiate and answer following questions:

  • Are both hosts ready to exchange data between their applications?
  • Which additional options support both hosts?
  • At the end of communication we have to close connection. Operating systems limit maximum number of connections to avoid overhead of the system.

Which options could be interesting for us:

  • Selective Acknowledgement
  • Windows Scaling
  • Timestamp options

Let's look inside of packet captured by wireshark (good bless people who are working on this application).

Please pay attention that information placed between brackets in the wireshark is not present in real packet, it is info added by wireshark after analysis to help us understand communication.Different wireshark version will treat in different way this information.

Selective Acknowledgement

We said that Server should keep in his TX buffer sent segments until he would get Ack for data.Next question it is how data will de retransmitted in case of lost packet? Let's suppose that PC announced Window 14000 bytes. Server is sending 14 segments each by 1000 bytes. We lost segment #5 and #8. Classical TCP have to retransmit full window (all 14 segments). Selective ACK suport allow sender to send only #5 and #8. Also this allow server to release ack packets. I will give an example how it works little bit latter.

Window Scaling

Let's go back to the speed formula MaxSpeed = Window * 8 / Delay in sec. When we have Long Fat Networks (networks with big delay, like satellite communication) we could reach 600ms delay. This means that maximum speed through this network could be 65535 * 8 /0.6= 0.8 Mbps :). It could be strange to pay huge amount of money to get less than 1 Mbps just because of TCP limitation. Window Scaling option allow us to multiply announced window with scalar factor and get calculated window size. I will not explain how is calculated at which value multiply window, see RFC 1323 for more details. I just want to add that window scaling allow us to get up to 1 GB of RX Window without ack :).

Timestamp

Sequence and acknowledgement number are limited by in the TCP header by 32 bits. Together with ip source and destination those allow us to distinguish which application has to receive specific segment in the network. In High loaded servers this is not enough. It could be that segment will be assigned to wrong TCP stream when we reach 1 Gbps traffic. Timestamp added to the packet header allow us to get in game one more unique parameter to distinguish flow between them.

TCP Fast Retransmission

TCP is reliable protocol. He has to retransmit lost segments. How much time has to wait host until resend lost segment? Regular timer could be 2 sec. It is too much for high performance application. How to deal with that? We have ack segments which inform us how much data receiver got. Let's suppose that Server sent 14 segments. Segment #7 was lost. Receiver will inform Server that he got data up to segment #6 even in his buffer more data are present. Receiver will send again and again that he received up to segment #6. All new segments will be received and buffered. After third duplicate ACK server will think: "Hmmm.. Maybe he indeed didn't get 7th segment. Let me send it one more time.." (of course only in case selective ack is supported). In case of 1 ms delay between Server and PC retransmission could take 1.5 ms (regular timer 2 sec). Really impressive, isn't ? When delay between server and PC is more we could attest hundreds of DUP ACK messages in the capture (it should be like this!). For example 80 ms delay creates in the capture 80 DUP ack messages (it was around 80 on different tests).

Why not to adjust regular TCP timer for retransmission to relevant value instead of fast retransmission?

RFC 6582 is trying to adjust RTO (Recovery time Objective). We have different network types (Wi-Fi, Satellite, Fiber…). They have different delay and RTO couldn't be equal for all them. Right, let's measure RTT of each TCP segment and adjust RTO based on that measurement. How to do it ? Take a look in the RFC :)

Now we are ready to see full picture of TCP Congestion Control Mechanism:

Note: In this sample I did one mistake. After receiver notify lost packet he has to decarese Window Size.

What for information described in this article could be useful for us?

We still don't have answer for one objective of this article:

  • I have 1Gbps NIC on PC and Server. They are on the same network, all links between are 1Gbps but our throughput is low.

We have checked everything on the network, but issue is still present. Let's take a look on the traffic sniffer:

So in this graph we could see that our PC announce very low window size together with latency this creates average speed ~21Mbps. I observed that speed depends on the application used in the communication. For example built-in FTP client in Windows 7 got 35 Mbps, at the same conditions Filezilla FTP client got 140 Mbps. Try to use another application or adjust parameter is the operating system (do it carefully, could damage another things). Following image shows available parameters for the TCP in Centos6.5.

More often TCP performance could be affected by CPU utilization spike on the sender/receiver, buffer overload.. Traffic sniffer could help us to find when it happened.

Wireshark graph shows us that communication stops for several seconds. Let's explain what happened in the capture on that time:

Our PC for some reason decreases his window from 5840 bytes to 2920 bytes. Server sends last two segments (each 1460 bytes) and has to stop until ack from the PC. PC is sending last segment where window value is 0. This means stop sending traffic I couldn't get more. Server will wait and try from time to time asking PC if we could continue send traffic. Only after 6 seconds client will open again window and get remain info. Why PC closed window? You have to analyze logs on the PC to find reason.

Links:

  1. TCP congestion control – https://en.wikipedia.org/wiki/TCP_congestion_control
  2. Wireshark application –  www.wireshark.org/
  3. RFC 1323 TCP Extensions for High Performance – www.ietf.org/rfc/rfc1323.txt
  4. RFC 6582  The NewReno Modification to TCP's Fast Recovery Algorithm – https://tools.ietf.org/html/rfc6582