Observations from a Trans-Pacific cache pair

Kathy J Richardson, Digital Network Systems Laboratory
in collaboration with

Donald Neal, University of Waikato, New Zealand
Duane Wessels and Kim Claffy at NLANR

Abstract/Introduction


This paper reports on a detailed analysis of tcpdump traces with persistent and non-persistent connections for identical URL sets. On March 17th, 1997, tcpdump traces were taken on a benchmark configuration. A client on the University of Waikato network made sequential http requests to its local cache for 93 unique URLs. The list of requests were repeated 5 times for each of two configurations: harvest with persistent connections and harvest without persistent connections.

The tests were performed sequentially, so they represent slightly different network loads. Regardless, a straightforward comparison of the total time is not an adequate measure of the performance difference; loss for an individual request can severely perturb the overall average. In fact, the total elapsed time for the persistent connection set took longer than the non-persistent configuration.

The results show the benefits of persistent connections, the impact of cache hierarchy decisions on performance, and illustrate a few of the TCP and ICP problems over long links.

Performance benefits from persistent connections


The primary goal is to understand improvements in the latency distribution for servicing page requests with persistant connections.

Persistant and Non-persistant Request Response Time distributions
Time(sec).5-.75.75-11-1.251.25-1.51.5-1.751.75-22-2.252.25-2.52.5-5GT 5sec
Persistant29%30%10%7%4.5%3%3%1.5%2%10%
Non-Persist2.5%15%29%17%9%9%2.51%212%

Median Bin:
Persistent: The >.75 and <1.0 second bin. This bin encompass the 30-60 percentiles.
Non-persistent: Median at 1.25 second bin boundary.
Truncated average:
Average of all requests requiring less than 5 seconds:
Persistent: 1.127 seconds includes 92% of all references
Non-persistent: 1.446 seconds includes 89% of all references
Tail Average:
Average of all requests requiring more than 5 seconds to complete:
Persistent: 25.129 seconds
Non-persistent: 14.169 seconds

The truncated averaged round-trip (including everything less than 5 seconds) as measured from the UDP RT times was: .2778 seconds.
Note:
Because long tail distributions dramatically effect the average or cumulative response time, it is important to look at the median time, a truncated average, and/or a distribution of the response times.

It looks like the improvement in total response time is exactly the same as one RTT. This is in fact exactly what we would expect. Here is the configuration:
	---------		------------		-------------
	| Client |              | NZ cache |		| Palo Alto |
	|        |              | 	   |		| cache     |
	---------		------------		-------------
       no persistent		Persistent		Persistent
	capabilities		capabilities		capabilities

We only save the SYN/SYN between the NZ cache and the PA cache

The steady state case for the Non-persistent configuration:	
	Client - NZ cache  : tcp connection setup
			   : GET HTTP request
			   : tcp data transfer back to Client
	NZ cache - PA cache: UDP ICP
	NZ cache - PA cache: tcp connection setup
			   : GET HTTP request
			   : tcp data transfer back to NZ cache

The steady state case for the Persistent configuration:	
	Client - NZ cache  : tcp connection setup
			   : GET HTTP request
			   : tcp data transfer back to Client
	NZ cache - PA cache: UDP ICP
	NZ cache - PA cache:
			   : GET HTTP request
			   : tcp data transfer back to NZ cache

An additional hope is that persistent connections will eliminate the slow start effects for subsequent connections. Since the measured savings was a single RTT it is unlikely that there was a significant performance improvement for this test-case from slow start.

The trace showed that persistent connections did eliminate slow start for many requests. But that any additional delays in the UDP or request setup reactivated the slow start mechanism. The real reason for little or no performance improvement from slow start is that the tcp window for the NZ cache was much too small to fill the pipe, and most of the time the two caches were waiting for ack/data from each other. See section 3.

One of the main advantages of persistent connections is that it reduces the amount of state required at the server. Persistent connections halved the number of connections made in the NZ cache, and the PA cache. If the client supported persistent connections, the number of tcp connections for test on the NZ cache would have been 2 instead of 465 with the client + 1 with the PA cache. A similar situation is true if servers supported persistent connections. Many connections to the same server would have collapsed into a single connection.

Hierarchical cache


There are several reasons to use hierarchical caches; if they are not used carefully, however, they result in much poorer response time without providing any other benefits. parent The inter cache communication adds additional latency to each cache miss. Care needs to be taken to understand the impact and the benefits from the ICP. In the test configuration the NZ cache had a single parent, the PA Cache. The parent cache was polled to see if it had the data before the data was retrieved. It is current implementation and operation practice to use ICP as a failover mechanism as well as a data query. A cache will not send a request to a parent that fails to respond to ICP. If the parent doesn't respond the cache automatically makes the request directly. It illustrates the importance understanding how the cache protocol effects performance. Implementing failover as a side-effect of ICP is a poor design choice. A seperate failover mechanism should be implemented so that otherwise unnecessary ICP overhead can be eliminated. First asking the parent cache if it has the data, and then requesting the data from that parent regardless of the answer introduces an additional RT delay with out providing any bandwidth saving. From the tracedump the penalty is evaluated by examining the UDP time.

Inter-cache Protocol overhead - UDP message time

For this trans-pacific link the average UDP response time for the 90th-percentile was .278 seconds. By running the cache in single parent mode, the response time to the client would have improved another .278 seconds, to .849 sec for the persistent configuration and 1.268 for the non-persistent configuration. The tracedump looks like:
Client FIN for previous request and ack to NZ cache:
12:31:53.958256 memphis.cc.waikato.ac.nz.1736 > osiris.3128: F 91:91(0) ack 7519 win 33580 
12:31:53.958381 osiris.3128 > memphis.cc.waikato.ac.nz.1736: . ack 92 win 8760 
Client tcp SYN with NZ cache for next request:
12:31:53.997898 memphis.cc.waikato.ac.nz.1737 > osiris.3128: S 473856000:473856000(0) win 32768 
12:31:53.998199 osiris.3128 > memphis.cc.waikato.ac.nz.1737: S 1811449806:1811449806(0) ack 
12:31:53.999886 memphis.cc.waikato.ac.nz.1737 > osiris.3128: . ack 1 win 33580 
Client sends GET request in data pkt:
12:31:54.000257 memphis.cc.waikato.ac.nz.1737 > osiris.3128: P 1:89(88) ack 1 win 33580 
NZ Cache sends UDP request to parent PA cache and waits for response:
12:31:54.006140 osiris.3130 > cache.nlanr.pa-x.dec.com.3130: udp 65 
12:31:54.048889 osiris.3128 > memphis.cc.waikato.ac.nz.1737: . ack 89 win 8760 
12:31:54.279761 cache.nlanr.pa-x.dec.com.3130 > osiris.3130: udp 61
Got response - now send GET request via tcp (for non-persistent
connections this would first involve establishing a tcp connection)
12:31:54.282252 osiris.46918 > cache.nlanr.pa-x.dec.com.3128: P 257:375(118) ack 56491 win 8760
**12:31:54.468909 osiris.46918 > cache.nlanr.pa-x.dec.com.3128: P 257:375(118) ack 56491 win 8760 
12:31:54.647664 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: . ack 375 win 33580
PA cache sends back first data packets:
12:31:54.679161 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: P 56491:56703(212) ack 375 win 33580 
12:31:54.718401 cache.nlanr.pa-x.dec.com.3128 > osiris.46918: . ack 375 win 33580
** Note: The NZ cache tcp implementation doesn't properly calculate the RTT for use as a response timeout, even though it has sufficient information to do so.

If there is a single parent cache, the parent should not be polled prior to the actual data request. Under most circumstances multiple parenting should be avoided for similar reasons. It is unlikely that the additional parents will contribute substantially to the hit ratio, and waiting for the response is costly. Having multiple parents that resolve to a single parent for each URL is fine, or should be. Multiple parents are often used for redundancy in the case where a parent dies or becomes unreachable. Other mechanisms should be used for dealing with this.

Parental miss distributions

The relative importance of the Superflous ICP time is obvious when compared with the expected service time of the parent. The miss distribution times for requests through the Digital Equipment Palo Alto proxy produce median access times on the order of between .25 and .5 sec. The PA cache used in these experiments shares the same Internet connection with the Digital proxy, and should experience the same miss distribution. This means that even if the parent cache is likely to miss the penalty for time asking it if it has the data is comparable to the time that it takes to get the data even if it doesn't have it.

TCP over long/slow links


The TCP implementation or parameters on the NZ cache seemed very poor. Several problems were evident from the tcpdump trace:

Future work - extrapolating.

Extrapolating to Satellite links
Extrapolating improvements to hierarchies.
Load tests to evaluate the reduced system load from fewer open tcp connections.
Re-evaluate with proper window size and new ICP implementation.

Thanks

Special thanks to Virgil Champlin who put up with and my numerious TCP questions, and for providing great insights into the world of tracedump.