Abstract/Introduction
This paper reports on a detailed analysis of tcpdump traces with
persistent and non-persistent connections for identical URL sets.
On March 17th, 1997, tcpdump traces were taken on a benchmark
configuration. A client on the University of Waikato network made
sequential http requests to its local cache for 93 unique URLs. The list
of requests were repeated 5 times for each of two configurations: harvest
with persistent connections and harvest without persistent
connections.
The tests were performed sequentially, so they represent slightly different
network loads. Regardless, a straightforward comparison of the total
time is not an adequate measure of the performance difference; loss for an
individual request can severely perturb the overall average.
In fact, the total elapsed time for the persistent connection set took longer than the non-persistent configuration.
The results
show the benefits of persistent connections, the
impact of cache hierarchy decisions on performance, and illustrate a few of the TCP and ICP problems over long links.
-
Performance benefits from persistent connections
The primary goal is to understand improvements in
the latency distribution for servicing page requests with
persistant connections.
Persistant and Non-persistant Request Response Time distributions
| Time(sec) | .5-.75 | .75-1 | 1-1.25 | 1.25-1.5 | 1.5-1.75 | 1.75-2 | 2-2.25 | 2.25-2.5 | 2.5-5 | GT 5sec |
| Persistant | 29% | 30% | 10% | 7% | 4.5% | 3% | 3% | 1.5% | 2% | 10% |
| Non-Persist | 2.5% | 15% | 29% | 17% | 9% | 9% | 2.5 | 1% | 2 | 12% |
- Median Bin:
-
Persistent: The >.75 and <1.0 second bin.
This bin encompass the 30-60 percentiles.
Non-persistent: Median at 1.25 second bin boundary.
- Truncated average:
-
Average of all requests requiring less than 5 seconds:
Persistent: 1.127 seconds includes 92% of all references
Non-persistent: 1.446 seconds includes 89% of all references
- Tail Average:
-
Average of all requests requiring more than 5 seconds to complete:
Persistent: 25.129 seconds
Non-persistent: 14.169 seconds
The truncated averaged round-trip (including everything less than 5 seconds)
as measured from the UDP RT times was: .2778 seconds.
- Note:
-
Because long tail distributions dramatically effect the average or
cumulative response time, it is important to look at the median time, a
truncated average, and/or a distribution of the response times.
It looks like the improvement in total response time is exactly the same as
one RTT. This is in fact exactly what we would expect.
Here is the configuration:
--------- ------------ -------------
| Client | | NZ cache | | Palo Alto |
| | | | | cache |
--------- ------------ -------------
no persistent Persistent Persistent
capabilities capabilities capabilities
We only save the SYN/SYN between the NZ cache and the PA cache
The steady state case for the Non-persistent configuration:
Client - NZ cache : tcp connection setup
: GET HTTP request
: tcp data transfer back to Client
NZ cache - PA cache: UDP ICP
NZ cache - PA cache: tcp connection setup
: GET HTTP request
: tcp data transfer back to NZ cache
The steady state case for the Persistent configuration:
Client - NZ cache : tcp connection setup
: GET HTTP request
: tcp data transfer back to Client
NZ cache - PA cache: UDP ICP
NZ cache - PA cache:
: GET HTTP request
: tcp data transfer back to NZ cache
An additional hope is that persistent connections will eliminate the slow
start effects for subsequent connections. Since the measured savings was
a single RTT it is unlikely that there was a significant performance
improvement for this test-case from slow start.
The trace showed that persistent connections did eliminate slow start for many requests. But that any additional delays in the UDP or request setup reactivated
the slow start mechanism. The real reason for little or no performance
improvement from slow start is that the tcp window for the NZ cache was
much too small to fill the pipe, and most of the time the two caches were
waiting for ack/data from each other. See section 3.
One of the main advantages of persistent connections is that it reduces the
amount of state required at the server. Persistent connections halved the
number of connections made in the NZ cache, and the PA cache. If the client
supported persistent connections, the number of tcp connections for test on the
NZ cache would have been 2 instead of 465 with the client + 1 with the PA cache.
A similar situation is true if servers supported persistent connections. Many
connections to the same server would have collapsed into a single connection.
Hierarchical cache
There are several reasons to use hierarchical caches; if they are not used carefully, however, they result in much poorer response time without providing any other benefits.
parent The inter cache communication adds additional latency to each
cache miss. Care needs to be taken to understand the impact and the benefits
from the ICP.
In the test configuration the NZ cache had a single parent, the PA Cache.
The parent cache was polled to see if it had the data before the data
was retrieved. It is current implementation and operation practice to use ICP as a failover mechanism as well as a data query. A cache will not send a request to a parent that fails to respond to ICP.
If the parent doesn't respond the cache automatically makes the request directly. It illustrates the importance understanding how the cache protocol
effects performance. Implementing failover as a side-effect of ICP is a
poor design choice. A seperate failover mechanism should be implemented so that otherwise unnecessary ICP overhead can be eliminated.
First asking the parent cache if it has the data, and then requesting the
data from that parent regardless of the answer introduces an additional
RT delay with out providing any bandwidth saving. From the tracedump
the penalty is evaluated by examining the UDP time.
Inter-cache Protocol overhead - UDP message time
For this trans-pacific link the average UDP
response time for the 90th-percentile was .278 seconds. By running the cache in
single parent mode, the response time to the client would have
improved another .278 seconds, to .849 sec for the persistent
configuration and 1.268 for the non-persistent configuration.
The tracedump looks like:
Client FIN for previous request and ack to NZ cache:
12:31:53.958256 memphis.cc.waikato.ac.nz.1736 > osiris.3128: F 91:91(0) ack 7519 win 33580
12:31:53.958381 osiris.3128 > memphis.cc.waikato.ac.nz.1736: . ack 92 win 8760
Client tcp SYN with NZ cache for next request:
12:31:53.997898 memphis.cc.waikato.ac.nz.1737 > osiris.3128: S 473856000:473856000(0) win 32768
12:31:53.998199 osiris.3128 > memphis.cc.waikato.ac.nz.1737: S 1811449806:1811449806(0) ack
12:31:53.999886 memphis.cc.waikato.ac.nz.1737 > osiris.3128: . ack 1 win 33580
Client sends GET request in data pkt:
12:31:54.000257 memphis.cc.waikato.ac.nz.1737 > osiris.3128: P 1:89(88) ack 1 win 33580
NZ Cache sends UDP request to parent PA cache and waits for response:
12:31:54.006140 osiris.3130 > cache.nlanr.pa-x.dec.com.3130: udp 65
12:31:54.048889 osiris.3128 > memphis.cc.waikato.ac.nz.1737: . ack 89 win 8760
12:31:54.279761 cache.nlanr.pa-x.dec.com.3130 > osiris.3130: udp 61
Got response - now send GET request via tcp (for non-persistent
connections this would first involve establishing a tcp connection)
12:31:54.282252 osiris.46918 > cache.nlanr.pa-x.dec.com.3128: P 257:375(118) ack 56491 win 8760
**12:31:54.468909 osiris.46918 > cache.nlanr.pa-x.dec.com.3128: P 257:375(118) ack 56491 win 8760
12:31:54.647664 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: . ack 375 win 33580
PA cache sends back first data packets:
12:31:54.679161 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: P 56491:56703(212) ack 375 win 33580
12:31:54.718401 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: . ack 375 win 33580
** Note: The NZ cache tcp implementation doesn't properly calculate the
RTT for use as a response timeout, even though it has sufficient
information to do so.
If there is a single parent cache, the parent should not be polled prior
to the actual data request. Under most circumstances multiple parenting
should be avoided for similar reasons. It is unlikely that the additional
parents will contribute substantially to the hit ratio, and waiting for
the response is costly. Having multiple parents that resolve to a single
parent for each URL is fine, or should be.
Multiple parents are often used for redundancy in the case where a parent
dies or becomes unreachable. Other mechanisms should be used for dealing
with this.
Parental miss distributions
The relative importance of the Superflous ICP time is obvious when
compared with the expected service time of the parent.
The miss distribution times for requests through the Digital Equipment
Palo Alto proxy produce median access times on the order of between .25
and .5 sec. The PA cache used in these experiments shares the same
Internet connection with the Digital proxy, and should experience the same
miss distribution. This means that even if the parent cache is likely to
miss the penalty for time asking it if it has the data is comparable to
the time that it takes to get the data even if it doesn't have it.
TCP over long/slow links
The TCP implementation or parameters on the NZ cache seemed very poor.
Several problems were evident from the tcpdump trace:
- As noted above, most new requests to the PA cache put the tcp
connection into a restart state (regardless of the persistence) with the
RT timeout set to a constant that was considerably smaller that the
RTT to the destination. From the TCP connection SYN-SYN it should have
been able to calculate a reasonable timeout.
- The window size was much too small for the link.
The maximum window size should be equivalent to the amount of data
required to fill the pipe, such that the receiving end doesn't need to
wait for the next data packet and the sender doesn't need to wait for the
next ack. Depending on the actual condition of the line, the virtual
window size might be much lower than the allowed maximum.
There was little loss seen in the tcpdump trace, so for the test, one
would expect to have the link mostly full; it was mostly empty.
Each 1460 byte packet took about .01 seconds to transmit on a link with a
RTT of .278 Sec. This should translate to a max. window size of about 25
packets, or 36500. The PA cache had a max window size of 33580, which is
appropriate for the link latency. But in this configuration the PA cache
was not in the position to receive packet streams from the NZ network.
About 1/3 of the pipe is utilized, which means that for every packet
sent the receiving host waits two packet wait times for the next packet.
When there was other work to be done, the NZ cache seemed to delay
ace's to the PA cache. It was typical for the NZ cache to accept a
stream of packets from the PA cache, and feed them to the Client cache, and
then finally respond with an ack to the PA cache. This stretched out the
amount of time the NZ cache was waiting for data from the PA cache.
Future work - extrapolating.
- Extrapolating to Satellite links
- Extrapolating improvements to hierarchies.
- Load tests to evaluate the reduced system load from
fewer open tcp connections.
- Re-evaluate with proper window size and new ICP implementation.
-
Thanks
- Special thanks to Virgil Champlin who put up with and my numerious TCP
questions, and for providing great insights into the world of tracedump.