DePStat (Desire Project Statistics)
This is a package for analyzing Squid 1.1.x native format
log files. The package has been developed for
UNINETT as a part of the
Caching our
Desire project, which is a part of the DESIRE program.
The software is written in perl5, and has been known to analyze
about 500 lines/second on a 166 MHz pentium processor running
FreeBSD. Since it will be used to analyze huge log files it has been
written to minimize memory usage. It will also save the results during
the run, which is very handy when analyzing log files for several
days.
The script analyzes the log file on a daily basis. The result will be
the same if you break the log file for one or many day(s) in several
pieces, or if you run it on the whole log file.
Downloading the package
The latest package will be available from
http://www.uninett.no/prosjekt/desire/DePStat/current.tar.gz.
The package contains these files:
- DePStat-x.x[bx].pl
- The program which analyzes native format access log files from Squid.
- dbformat.txt
- Describes how the different data are stored, and what keys I use.
- h1.pl
- A small library to load, save and print perl Hashes.
- h3.pl
- A small library to load, save and print perl Hashes of Hashes of Hashes.
- dispH3.pl
- A very small program for displaying Hashes of Hashes of Hashes.
The h1.pl and h2.pl may be used to easily read the analyzed data
into other applications, and demonstrates how to traverse the data
structures used. The dispH3.pl program can be used to look at the
extracted data.
Please check if you agree on our way of counting hits, misses, ims,
errors, deny and refresh. The code is the documentation (at the
moment), and the lists near the start of DePStat-x.x[bx].pl describes
which log tags we count as what.
We have not made any tools for visualizing these data yet. We have an
unfinished program that makes LARGE HTML tables of these data, but a
graphical presentation would be far superior. If anyone wants to write
something that visualizes these numbers, please go ahead. :-)
Analyzed parameters
These are thoughts that where made before and during the process of
writing this package. Everything is not implemented yet.
There are a lot of parameters one may want to analyze for. Here we
will mention the ones we have found significant when analyzing log
files from a Squid server. Most of them apply in general for all
web-cache servers.
Request type
To measure how efficient the web-cache server is you will want to
analyze for the following parameters:
- Hit
- When the requested object is served directly from the cache.
- If Modified Since (IMS)
- When an IMS request to the origin server confirm the freshness of
the object. This will be done on stale objects, and when a client force
the server to check a fresh object.
- Refresh
- When the client force the server (Pragma: no-cache) to fetch the
object from the origin server regardless of the state on the cached
copy.
- Miss
- When the object has to be fetched from parents, siblings or the
origin server because it doesn't exist in the cache, or it is stale.
- Error
- When the server is unable to serve the request for some reason.
- Denied
- When the server denies to serve the client.
These parameters should be counted in both accesses and bytes
served to the clients. They should also be counted for both the HTTP
and the ICP protocol. Please note that refresh is not applicable on
ICP requests, and you may also want to split the type of hit on
these. If the object is small enough (fits into a udp-packet), the ICP
protocol allow the object to be included in the reply on a hit. If it
is too large it will have to be fetched by a separate HTTP request.
Traffic and usage
These parameters will give you an idea of how many users you serve,
how long time your server use to process requests, and how busy your
server is.
- Hosts using the server
- You should count how many different hosts (IP-numbers) that use
your server. You can also count how much traffic (in accesses and
bytes) they cause. But in this case you should consider the privacy of
your clients. If you don't need these numbers per host you should
count per domain only.
- Elapsed time for requests
- NOTE! This only applies to HTTP requests. It is a good
idea to count how many requests that are served within a given
time. I.e. you count how many requests that are served within 2, 4,
16, ... seconds. This will give you an idea of how fast your users are
served. It is a good idea to use a logarithmic scale, preferably a
log(2) scale.
- Connections per second
- This will give you an idea of how busy your server is.
Parents and siblings
These parameters will tell you how effective your parents and
siblings are.
- Hierarchy
- This will tell how the request was resolved. I.e. if it was sent
to the origin server, or resolved through a parent or sibling. It will
also give you hints of why the request was routed this way (source
fastest, parent/sibling hit, ...).
- Family
- If the request was resolved through a parent or a sibling you
should gather statistics on how well the specific parent or sibling
performs. You can do that by ignoring all the hierarchy tags that
indicate a connection directly to the origin server, and count the
resulting hierarchy tags by host name. This will give you an idea of
how much traffic that the different parents and siblings handle for
you. But it will not give you things like the hit rate
for a particular parent/sibling. By just analyzing the log file you
are not able to find out how many requests that were sent to a
particular parent/sibling. You can only count how many requests it
answered. If you want to know how many requests it got you'll have to
incorporate the the rules for contacting parents/siblings given in the
configuration file for the Squid server. If you do this it will not be
easy to analyze old log files since the configuration may change.
February 1997 / webmaster@uninett.no