The Norwegian research network
  
  Search:

6. Statistics

[DESIRE] [DESIRE Web Cache] [Web Cache Architecture]

Contents

Checklist

We recommend that you

  • keep logs unavailable unless anonymized
  • rotate and process logfiles once a day (or more) due to their large size
  • keep the large size of the log files in mind if you write programs to analyze them
  • keep the privacy of your users in mind when you publish results of log file analyzing
  • find out what kind of statistics you want from your server, or you'll drown in numbers

Aggregated statistics is useful to both your users and other Web cache managers in your mesh, make it available to your users.

6.1 Log files

We recommend that you

Laws on privacy differs from one country to another, make sure that the caching system does not violate the privacy of users. Be very careful with your logs, avoid giving out information about your individual users and remember that usage patterns may be sensitive information (i.e. on what material users access).

Logfiles on a busy server can grow very large. A server with about 300 000 requests a day will produce an access-log file of about 30-40 MB. To save disk-space the log files should be rotated, compressed and stored once a day.

And you never know when somebody, by intent or accident, is going to make heavy use of your server. We've seen examples where a script has downloaded the same few files over and over again for a whole day. In this case the server was able to take the load, but the disk where the log file was stored was filled, and we lost a lot of log data. So if analyzing the log file is important, you should have some extra space on your log file disk.

6.2 Log file analyzing

A log file analyzing program must be written with the large size of the log file in mind. It should not try to read the whole log file into memory at once. If it does you may very well use both all the available memory and swap space.

Analyzing large log files may take both considerable time and CPU power. If the server is very busy, you should consider analyzing the log files on a separate machine to avoid disturbing the cache server.

6.3 What to analyze for

There are a lot of parameters you may want to analyze for. Here we will mention the ones we have found significant when analyzing log files from a Squid server. Most of them apply in general for all Web cache servers.

Request type

To measure how efficient the Web cache server is you will want to analyze for the following parameters:

Hit
When the requested object is served directly from the cache.
If Modified Since (IMS)
When an IMS request to the origin server confirms the freshness of the object. This will be done on stale objects, and when a client forces the server to check a fresh object.
Refresh
When the client forces the server (Pragma: no-cache) to fetch the object from the origin server regardless of the state on the cached copy.
Miss
When the object has to be fetched from parents, siblings or the origin server because it doesn't exist in the cache, or is stale.
Error
When the server is unable to serve the request for some reason.
Denied
When the server denies to serve the client.

These parameters should be counted in both accesses and bytes served to the clients. They should also be counted for both the HTTP and the ICP protocol. Please note that refresh is not applicable on ICP requests, and you may also want to split the type of hit on these. If the object is small enough (it fits into a udp-packet), the ICP protocol allows the object to be included in the reply on a hit. If it is too large it will have to be fetched by a separate HTTP request.

Traffic and usage

These parameters will give you an idea of how many users you serve, how much time your server uses to process requests, and how busy your server is.

Hosts using the server
You should count how many different hosts (IP-numbers) that use your server. You can also count how much traffic (in accesses and bytes) they cause. But in this case you should consider the privacy of your clients. If you don't need these numbers per host you should count per domain only.
Elapsed time for requests
NOTE! This only applies to HTTP requests. It is a good idea to count how many requests that are served within a given time. I.e. you count how many requests are served within 2, 4, 16, ... seconds. This will give you an idea of how fast your users are served. It is a good idea to use a logarithmic scale, preferably a log(2) scale.
Connections per second
This will give you an idea of how busy your server is.

Parents and siblings

These parameters will tell you how effective your parents and siblings are.

Hierarchy
This will tell how the request was resolved. I.e. if it was sent to the origin server, or resolved through a parent or sibling. It will also give you hints of why the request was routed this way (source fastest, parent/sibling hit, ...).
Family
If the request was resolved through a parent or a sibling you should gather statistics on how well the specific parent or sibling performs. You can do that by ignoring all the hierarchy tags that indicate a connection directly to the origin server, and count the resulting hierarchy tags by host name. This will give you an idea of how much traffic that the different parents and siblings handle for you. But it will not give you things like the hit rate for a particular parent/sibling. By just analyzing the log file you are not able to find out how many requests that were sent to a particular parent/sibling. You can only count how many requests it answered. If you want to know how many requests it got you'll have to incorporate the the rules for contacting parents/siblings given in the configuration file for the Squid server. If you do this it will not be easy to analyze old log files since the configuration may change.

DePStat (Desire Project Statistics)

The Desire Project has made some experimental scripts for analyzing native format Squid logs available. They can be fetched from the DePStat home. These scripts implements the requirements stated above.

6.4 Presentation of statistics

Example: UNINETT hit rates

[UNINETT top level hit rate] UNINETT
[Tromsų first level hit rate] Tromsų
[UNINETT top level hit rate] UNINETT
[Tromsų first level hit rate] Tromsų
The figures shows the hit rates of the first-level cache at the Tromsų University and the UNINETT top-level cache for the period 21. Oct - 1. Dec 1996, the drop in hit rate on 25. Nov is due to a total cleanout of the UNINETT cache. More statistics are available for the UNINETT and Tromsų University Web caches.

Typical total savings for the two-level cache system are around 50% on the number of connections made to the origin server, and around 55% for bytes downloaded.


Lars Slettjord, Ingrid Melve

cache-desire@uninett.no 2003-12-18