6. Statistics
[DESIRE]
[DESIRE Web Cache]
[Web Cache Architecture]
Contents
- 6.1 Log files
- 6.2 Log file analyzing
- 6.3 What to analyze for
- 6.4 Presentation of statistics
Checklist
We recommend that you
- keep logs unavailable unless anonymized
- rotate and process logfiles once a day (or more) due to their large size
- keep the large size of the log files in mind if you write programs to
analyze them
- keep the privacy of your users in mind when you publish results of
log file analyzing
- find out what kind of statistics you want from your server, or you'll
drown in numbers
|
Aggregated statistics is useful to both your users and other
Web cache managers in your mesh, make it available to your users.
6.1 Log files
We recommend that you
- keep logs unavailable unless anonymized
- rotate and process logfiles once a day
Laws on privacy differs from one country to another, make sure that
the caching system does not violate the privacy of users. Be very
careful with your logs, avoid giving out information about your
individual users and remember that usage patterns may be sensitive
information (i.e. on what material users access).
Logfiles on a busy server can grow very large. A
server with about 300 000 requests a day will produce an access-log
file of about 30-40 MB. To save disk-space the log files should be
rotated, compressed and stored once a day.
And you never know when somebody, by intent or accident, is going
to make heavy use of your server. We've seen examples where a script
has downloaded the same few files over and over again for a whole
day. In this case the server was able to take the load, but the disk
where the log file was stored was filled, and we lost a lot of
log data. So if analyzing the log file is important, you should have
some extra space on your log file disk.
6.2 Log file analyzing
A log file analyzing program must be written with the large size of
the log file in mind. It should not try to read the
whole log file into memory at once. If it does you may very well use
both all the available memory and swap space.
Analyzing large log files may take both considerable time and CPU
power. If the server is very busy, you should consider analyzing the
log files on a separate machine to avoid disturbing the cache server.
6.3 What to analyze for
There are a lot of parameters you may want to analyze for. Here we
will mention the ones we have found significant when analyzing log
files from a Squid server. Most of them apply in general for all
Web cache servers.
Request type
To measure how efficient the Web cache server is you will want to
analyze for the following parameters:
- Hit
- When the requested object is served directly from the cache.
- If Modified Since (IMS)
- When an IMS request to the origin server confirms the freshness of
the object. This will be done on stale objects, and when a client forces
the server to check a fresh object.
- Refresh
- When the client forces the server (Pragma: no-cache) to fetch the
object from the origin server regardless of the state on the cached
copy.
- Miss
- When the object has to be fetched from parents, siblings or the
origin server because it doesn't exist in the cache, or is stale.
- Error
- When the server is unable to serve the request for some reason.
- Denied
- When the server denies to serve the client.
These parameters should be counted in both accesses and bytes
served to the clients. They should also be counted for both the HTTP
and the ICP protocol. Please note that refresh is not applicable on
ICP requests, and you may also want to split the type of hit on
these. If the object is small enough (it fits into a udp-packet), the ICP
protocol allows the object to be included in the reply on a hit. If it
is too large it will have to be fetched by a separate HTTP request.
Traffic and usage
These parameters will give you an idea of how many users you serve,
how much time your server uses to process requests, and how busy your
server is.
- Hosts using the server
- You should count how many different hosts (IP-numbers) that use
your server. You can also count how much traffic (in accesses and
bytes) they cause. But in this case you should consider the privacy of
your clients. If you don't need these numbers per host you should
count per domain only.
- Elapsed time for requests
- NOTE! This only applies to HTTP requests. It is a good
idea to count how many requests that are served within a given
time. I.e. you count how many requests are served within 2, 4,
16, ... seconds. This will give you an idea of how fast your users are
served. It is a good idea to use a logarithmic scale, preferably a
log(2) scale.
- Connections per second
- This will give you an idea of how busy your server is.
Parents and siblings
These parameters will tell you how effective your parents and
siblings are.
- Hierarchy
- This will tell how the request was resolved. I.e. if it was sent
to the origin server, or resolved through a parent or sibling. It will
also give you hints of why the request was routed this way (source
fastest, parent/sibling hit, ...).
- Family
- If the request was resolved through a parent or a sibling you
should gather statistics on how well the specific parent or sibling
performs. You can do that by ignoring all the hierarchy tags that
indicate a connection directly to the origin server, and count the
resulting hierarchy tags by host name. This will give you an idea of
how much traffic that the different parents and siblings handle for
you. But it will not give you things like the hit rate
for a particular parent/sibling. By just analyzing the log file you
are not able to find out how many requests that were sent to a
particular parent/sibling. You can only count how many requests it
answered. If you want to know how many requests it got you'll have to
incorporate the the rules for contacting parents/siblings given in the
configuration file for the Squid server. If you do this it will not be
easy to analyze old log files since the configuration may change.
DePStat (Desire Project Statistics)
The Desire Project has made some experimental scripts for analyzing
native format Squid logs available. They can be fetched from the
DePStat
home. These scripts implements the requirements stated above.
6.4 Presentation of statistics
Example: UNINETT hit rates
UNINETT
Tromsų
UNINETT
Tromsų
The figures shows the hit rates of the first-level cache at the Tromsų University and the UNINETT top-level cache for the period 21. Oct - 1. Dec 1996, the drop in hit rate on 25. Nov is due to a total cleanout of the UNINETT cache. More statistics are available for the UNINETT and Tromsų University Web caches.
Typical total savings for the two-level cache system are around 50% on the number of connections made to the origin server, and around 55% for bytes downloaded.
Lars Slettjord, Ingrid Melve