On interpreting access statistics

Note: This document was originally written in 1995 to explain the stats situation at Cranfield University where I worked at the Cranfield Computer Centre. It was initially intended for local users there, but quickly gained popularity (notoriaty?) elsewhere. The content has had few changes and updates since then.

There is no discussion of cookie tracking (yet) in this document.
There is no discussion of the very similar problem of guessing web browser popularity from webserver logs.
I had over-estimated the extent to which caching (and hierarchical caching) would be used.
Cranfield University has a proud history of leadership in the web. It was one of the very first UK sites to have a webserver at all (in 1993), was at the forefront of the UK caching effort, and in enabling individual users to publish on the web (early 1994). I am grateful that they permitted what was essentially my personal rant to be hosted so prominantly for so long, and have provided a long term redirect to this page.
On re-reading my original, I see that this document is a bit hyperbolic. So be it. It is after all an acknowledged rant.

A useful analogy is with putting up advertising posters. You will never really know how many people have noticed them or read them.

It is not enough to say that the statistics should be taken with a grain of salt; they should be taken with a salt lick. If you want to understand why no inference about the number of people reading your pages can be made from web statistics read on. Otherwise, you may wish to just trust that statement or may wish to skip to the section on Quick Questions and Answers.

What web stats are really good for?

Caching:
Essential for the web and disastrous for statistics

Now that you have an idea of what caching is, you are in a better position to understand why it is impossible to make any inference about numbers of people reading your pages from web statistics. But there is more to come described in the section on multiple hits per users. What is necessary to understand about caching is that some users may go through a long and efficient cache chain (as described in the example) and other users may not. Much of this depends on how their site is set up or how they set things up themselves.

One user many hits

Big pages are little pages

Quick Questions and Answers about web statistics

Can stats be used to assess changes over time?

Can stats be used to assess relative popularity in different Internet domains such as .ac.uk, or .jp?

A clear example of this is the number of accesses from "numerical domains" that have recently started to top various lists. These are accesses from sites that don't have proper reverse DNS listings. Such sites are probably misconfigured single user machines, where either the particular machine that is used in misconfigured or the organisation they belong to has not straightened out its machine names properly. It is reasonable to assume that those running such misconfigured systems are far more likely to not have configured their proxies correctly, so far less caching will be seen from those sites.

Can stats be used to assess relative popularity of different pages?

Is there some multiplier which can applied to the stats to get more meaningful results?

Can I ensure that my document is never cached?

Quiet embarrassingly, many of the pages on this site don't normally cache properly. This is because I had some technical difficulties with my configuration of server side includes and the so-called "XBitHack". I've fixed that now, but now have to fix dozens of documents to use things properly.

Can I put counters in my page?

Please note that even if you think that statistics can be made useful, counters on individual pages are displayed to the reader, who isn't in the position to make the various adjustments needed to get some sense of true readership.

Can we get stats from the sites that do caching?

The other reason for "mostly no" is that even the large caches are willing to only send a byte count. That is, one major UK cache is considering sending out on a monthly basis how many bytes of data they served up in our name.

We must remember that the caches are doing us a favour by making our pages much easier to reach. We cannot ask them to take on a task that would degrade the service or place an additional administrative, disk, memory and CPU load on them. Without caching, the web would have collapsed long ago.

Can I infer from stats a minimum number of readers?

So, the only certain inference that can be made is that there was at least one from a particular domain, or for a particular page.

How can I gauge interest in pages?

Setting up these forms is not as difficult to do as it first appears, and courses are offered on it by the computing centre staff.

If web stats are so bad why are they kept at all?

Then why make the stats public?

Is this all just an excuse to avoid the work of maintaining stats?

Time to ask your questions

When this page was hosted by Cranfield there was a form for mailing comments. I have disabled that since moving this document to its current location, because (a) I don't have as good a mailform system as was available at Cranfield, (b) there are spam/privacy concerns about collecting unconfirmed email addresses, which I hadn't considered in 1995 for what was initially intended as an internal document, (c) this was partially an attempt to promote the use of Mailforms at Cranfield, and (d) history has shown that I am often not very good at responding to the queries that I get.

Version: $Revision: 2.7 $
Last Modified: $Date: 2004/07/13 18:30:32 $ GMT
First established at orgininal site: Summer 1995
First established at goldmark.org: April 25, 2001
Author: Jeffrey Goldberg

Why web usage statistics are (worse than) meaningless