DePStat (Desire Project Statistics)

This is a package for analyzing Squid 1.1.x native format log files. The package has been developed for UNINETT as a part of the Caching our Desire project, which is a part of the DESIRE program.

The software is written in perl5, and has been known to analyze about 500 lines/second on a 166 MHz pentium processor running FreeBSD. Since it will be used to analyze huge log files it has been written to minimize memory usage. It will also save the results during the run, which is very handy when analyzing log files for several days.

The script analyzes the log file on a daily basis. The result will be the same if you break the log file for one or many day(s) in several pieces, or if you run it on the whole log file.

Downloading the package

The latest package will be available from http://www.uninett.no/prosjekt/desire/DePStat/current.tar.gz.

The package contains these files:

DePStat-x.x[bx].pl
The program which analyzes native format access log files from Squid.
dbformat.txt
Describes how the different data are stored, and what keys I use.
h1.pl
A small library to load, save and print perl Hashes.
h3.pl
A small library to load, save and print perl Hashes of Hashes of Hashes.
dispH3.pl
A very small program for displaying Hashes of Hashes of Hashes.

The h1.pl and h2.pl may be used to easily read the analyzed data into other applications, and demonstrates how to traverse the data structures used. The dispH3.pl program can be used to look at the extracted data.

Please check if you agree on our way of counting hits, misses, ims, errors, deny and refresh. The code is the documentation (at the moment), and the lists near the start of DePStat-x.x[bx].pl describes which log tags we count as what.

We have not made any tools for visualizing these data yet. We have an unfinished program that makes LARGE HTML tables of these data, but a graphical presentation would be far superior. If anyone wants to write something that visualizes these numbers, please go ahead. :-)

Analyzed parameters

These are thoughts that where made before and during the process of writing this package. Everything is not implemented yet.

There are a lot of parameters one may want to analyze for. Here we will mention the ones we have found significant when analyzing log files from a Squid server. Most of them apply in general for all web-cache servers.

Request type

To measure how efficient the web-cache server is you will want to analyze for the following parameters:

Hit
When the requested object is served directly from the cache.
If Modified Since (IMS)
When an IMS request to the origin server confirm the freshness of the object. This will be done on stale objects, and when a client force the server to check a fresh object.
Refresh
When the client force the server (Pragma: no-cache) to fetch the object from the origin server regardless of the state on the cached copy.
Miss
When the object has to be fetched from parents, siblings or the origin server because it doesn't exist in the cache, or it is stale.
Error
When the server is unable to serve the request for some reason.
Denied
When the server denies to serve the client.

These parameters should be counted in both accesses and bytes served to the clients. They should also be counted for both the HTTP and the ICP protocol. Please note that refresh is not applicable on ICP requests, and you may also want to split the type of hit on these. If the object is small enough (fits into a udp-packet), the ICP protocol allow the object to be included in the reply on a hit. If it is too large it will have to be fetched by a separate HTTP request.

Traffic and usage

These parameters will give you an idea of how many users you serve, how long time your server use to process requests, and how busy your server is.

Hosts using the server
You should count how many different hosts (IP-numbers) that use your server. You can also count how much traffic (in accesses and bytes) they cause. But in this case you should consider the privacy of your clients. If you don't need these numbers per host you should count per domain only.
Elapsed time for requests
NOTE! This only applies to HTTP requests. It is a good idea to count how many requests that are served within a given time. I.e. you count how many requests that are served within 2, 4, 16, ... seconds. This will give you an idea of how fast your users are served. It is a good idea to use a logarithmic scale, preferably a log(2) scale.
Connections per second
This will give you an idea of how busy your server is.

Parents and siblings

These parameters will tell you how effective your parents and siblings are.

Hierarchy
This will tell how the request was resolved. I.e. if it was sent to the origin server, or resolved through a parent or sibling. It will also give you hints of why the request was routed this way (source fastest, parent/sibling hit, ...).
Family
If the request was resolved through a parent or a sibling you should gather statistics on how well the specific parent or sibling performs. You can do that by ignoring all the hierarchy tags that indicate a connection directly to the origin server, and count the resulting hierarchy tags by host name. This will give you an idea of how much traffic that the different parents and siblings handle for you. But it will not give you things like the hit rate for a particular parent/sibling. By just analyzing the log file you are not able to find out how many requests that were sent to a particular parent/sibling. You can only count how many requests it answered. If you want to know how many requests it got you'll have to incorporate the the rules for contacting parents/siblings given in the configuration file for the Squid server. If you do this it will not be easy to analyze old log files since the configuration may change.


February 1997 / webmaster@uninett.no