ChecklistWe recommend that you
|
Any Web caching software that is unable to cooperate with several servers is questionable, because the single server approach does not scale sufficiently when one is experiencing exponential growth. One monolithic national server has severe scaling problems as well as routing policy problems. One of the things we know about Internet traffic is that it increases exponentially, a single server is not a good solution. The approach has been tested in UK [Smith96], and even with a cache distributed over 6 computers located at 2 different places (in order to provide redundancy), there is heavy load.
Criteria number 1: The server software must be able to cooperate with several other servers.The Web Consortium Problem Statement on Propagation, Replication and Caching states: "There is an urgent need for making the Web more mature in order to scale to a number of at least 100 times the current size, and efficient techniques for replication and caching is a corner stone in achieving this goal." Any Web caching solution must be able to scale to at least 100 times today's use.
The scaling issue is illustrated by the fact that on the 16th September 145 GB of traffic went in/out of UNINETT, a rough estimate indicates that half of this was Web traffic, 72 GB makes about 15 million connections.
Criteria number 2: The server solution must be scalable with a factor of at least 100A Web caching system should be transparent for the user, the only result he should notice is faster response [Bekker96]. This implies that servers must fail gracefully.
Criteria number 3: The server system must fall back gracefully in case of failures
The Netscape proxy server belongs to the second generation of proxy servers. It is a fast, stable and reliable server. Due to its advanced process model (using pre-forked processes) and the dynamic and sophisticated Resource Manager, it should be well suitable for creating a hierarchical caching service.
A big disadvantage, however, is the way it deals with unreachable or misconfigured proxy servers higher in the caching hierarchy. There are no mechanismes for detecting failures and route around them.
Even with the use of the Automatic Proxy Configuration facility of the Netscape browser, the URLs, which are served by the parent of the proxy server, are unreachable. A chain of Netscape proxy servers is as strong as the weakest link. The misbehaviour of the chain of proxy servers is at least the sum of the individual components. Due to this shortcoming, the authors discourage the use of the Netscape proxy server for creating a hierarchical caching service.
The Netscape proxy does not have support for the Internet Cache Protocol (ICP).
The Squid software is being developed as a free version of the Harvest software. A commercial version (Cached-3.*) is available. Given the requirements of the academic community to save money, Squid has been chosen as the first test case. Cached-3.* needs to be further investigated.
Squid is available for most flavors of Unix, and has also been ported to OS/2 recently.
The Squid proxy server belongs to the second generation of proxy servers. It is a fast single process server (implements its own "threads" in a select-loop). Squid is a public domain server based on the Harvest v1.4 code. It is developed "on the net", and the 1.1 version has been in a beta development state with frequent changes up until the release of Squid-1.1.0 on December 6, 1996 . Some of the 1.1 beta versions have been very stable, others have been quite unusable, but the next improved version usually comes whithin a day or two (this is to be expected in beta development). Squid-1.1.0 and above has proved to be stable and reliable.
Squid use the Internet Cache Protocol (ICP) version 2 to cooperate with other proxy servers. There are several slightly different versions of ICP around (one in Harvest 1.4, another in Squid, and a third in Harvest Cached-3.*). Although Squid communicates best with its own ICP version it can cooperate with the other versions. Due to the ICP protocol, the ability to ignore unreachable proxy servers, and the ability to use other (non-ICP) servers as parents, Squid is an excellent choice for creating a mesh of cooperating proxy servers.
Squid is in compliance with all the criteria stated above for a Web caching system.
But cooperating Squid servers need to have approximately the same idea of how long an object should be cached. If cache A only stores an object for 2 days, but cache B stores it for 2 months, cache A will probably get a stale copy from cache B the next time it asks for the object. This problem will probably be solved by a change in the ICP protocol soon.
Squid requires a lot of memory. It keeps an indexed list of everything on its disk cache in memory, so if you have lots of disk-space you'll need lots of memory. It is also possible to reserve a bit of memory to use as a very fast cache for popular objects.
UNINETT have made an easy to install package of Squid called SamSquid. This package is available for the SAMSON machines (hp-ux 9.05 with our own maintenance system for software) and Linux (for 2.0 kernel ELF systems). More information about this package will be available shortly.
SURFnet's service pages for more information on the SURFnet caching mesh.
The Web Consortium is implementing Jigsaw [9], a "proof of concept" Web cache server. It is implemented in Java and supports both HTTP/1.1 and ICP.
Cached-3.* has support for ICP (in a different flavor from Squid) and is a good alternative for building a Web cache mesh for those who prefer commercial software with support.
WCol [10] is a prefetching Web cache server with limited support for ICP.
There is a number of Web cache servers without support for ICP. These are not well suited for building Web cache meshes.
A Web cache server which expects heavy traffic should be configured with this in mind. The server software usually has features that are convenient, but not strictly neccesary. Many of them will slow down the server, and should be avoided.
It is possible (and sometimes the default) to log the domain-names of connecting clients in most Web cache server software. But this is not recomended on a busy server. Logging the domain-names will slow down the service, and it is very easy to do a DNS lookup on these names when the logs are processed. Not logging the connecting clients names will force you to configure access-control on the cache server by IP numbers, but this does not make configuration more difficult.
Please note that DNS is not eliminated as a single point of failure if you do not log hostnames. DNS is still needed to resolve hostnames in the URL's. Because of this, it is very important to use a stable and fast DNS service. If you have trouble with your DNS service, you should consider starting a caching-only DNS server which serves only the Web cache server. You can run the DNS server on the same machine as the Web cache server, but make sure you have enough memory to do this without much swapping.
A Web cache server gives you the possibillity to log more than just the accesses. Logging of user agent, mime-headers, and special debugging info are nice to have, but they are not crucial. Unless you really need this information you should turn off these logs. Usually you will only need this information for special projects, and then you can turn them on for the duration of the project.
It is also a good idea to buffer the nessecary logging. This will reduce the number of disk-writes. Some server software is also able to do identd lookups, but you should not use this unless you really need it.
| The rest of this document is largely based upon our experiences with Squid, and Squid specific issues are set apart this way in the rest of the document. |
Henny Bekker, Lars Slettjord, Ingrid Melve
| cache-desire@uninett.no | 2002-10-29 |