Why web usage statistics are (worse than) meaningless
Note: This document was originally written in 1995 to
explain the stats situation at Cranfield University where I
worked at the Cranfield
Computer Centre. It was initially intended for local users
there, but quickly gained popularity (notoriaty?) elsewhere. The
content has had few changes and updates since then.
-
There is no discussion of cookie tracking (yet) in this document.
-
There is no discussion of the very similar problem of guessing web
browser popularity from webserver logs.
-
I had over-estimated the extent to which caching (and hierarchical
caching) would be used.
-
Cranfield University has a proud history of leadership in the web.
It was one of the very first UK sites to have a webserver at all (in
1993), was at the forefront of the UK caching effort, and in
enabling individual users to publish on the web (early 1994). I am
grateful that they permitted what was essentially my personal rant
to be hosted so prominantly for so long, and have provided a long
term redirect to this page.
- On re-reading my original, I see that this document is a bit
hyperbolic. So be it. It is after all an acknowledged rant.
Web usage statistics, such as those produced by programs such as analog cannot
be used to make strong inferences about the number of people who have
read a website or webpage. Although those who compile these
statistics usually try to make this clear, people still insist on
misusing them to make overly strong inferences. Attaching meaning to
meaningless numbers is worse than not having the numbers at all. When
you lack information, it is best to know that you lack the
information. Web statistics may give the user a false sense of
knowledge which can be worse than being knowingly ignorant.
A useful analogy is with putting up advertising posters. You will
never really know how many people have noticed them or read them.
It is not enough to say that the statistics should be taken with a
grain of salt; they should be taken with a salt lick. If you want
to understand why no inference about the number of people
reading your pages can be made from web statistics read on.
Otherwise, you may wish to just trust that statement or may wish
to skip to the section on Quick Questions and Answers.
Web stats are useful for web administrators to get a sense of the
actual load on the server. This is useful for diagnostics and
planning, and for detecting unusual behaviour that may require
planning action. The goal of the administrator is to keep the server
running smoothly under expected loads, while improving the speed and
reliability of obtaining documents from the site. The best way to
achieve this is to have browsers retrieve documents from places closer
to where they will be used (and even from memory) than to get them
from the disk on the server. It is only when the file is retrieved
from the server that the server has the ability to keep track of the
access.
Let's take a fictitious example of what might happen when someone in
Nome, Alaska, say at Nome Community College (this would be a
polytechnic in the UK), wants to read Cranfield's
Prospectus. The user would somehow select the URL with his/her
browser, which would then try the following.
- Browser Cache
- The particular instance of the browser will look in its
own memory (or what it may have saved on the its local disk).
If it finds the page corresponding to the sought for URL there
it will not go any further, and our site will never know
that the request was made.
- Local site cache
- If the page was not in the browser cache, the browser may look
to its site cache. That is, if someone at the user's same site
recently retrieved the page, it may be available to the user there.
If it finds the page corresponding to the sought for URL there
it will not go any further, and our site will never know
that the request was made.
- Local regional cache
- The site cache may be configured to look in a local regional
cache, say at the University of Alaska, Nome campus which
might provide a caching service for smaller sites around
Nome.
If it finds the page corresponding to the sought for URL there
it will not go any further, and our site will never know
that the request was made.
- Large regional cache
- The local regional cache may be configured to look
in a large regional cache, say in Fairbanks Alaska, which
might provide caching for sites in Alaska that use it.
If it finds the page corresponding to the sought for URL there
it will not go any further, and our site will never know
that the request was made.
- The Cranfield accelerator
- An accelerator is an out-going cache for a site.
When a document is requested from the site, the accelerator
sees whether it has it stored (it stores them in ways much
faster to find and retrieve then the server does with files
in the directory structure) and serves that up.
While it would be possible to have the accelerator keep a
record of which files it served up and to whom,
this would defeat the
purpose, because it would require a disk operation to make
that record.
In addition to over-estimating the degree of caching that
would be in place, this last step about accelerators is
also no longer relevant. The accelerator was needed when
Cranfield was running the original CERN server over an AFS
filesystem. Given the nature of modern web server set-ups,
accelerators are no longer needed.
Now that you have an idea of what caching is, you are in a better
position to understand why it is impossible to make any inference
about numbers of people reading your pages from web statistics. But
there is more to come described in the section on
multiple hits per users. What is necessary to understand about
caching is that some users may go through a long and efficient cache
chain (as described in the example) and other users may not. Much of
this depends on how their site is set up or how they set things up
themselves.
Imagine (in the extreme case) a user who is doing no caching
whatsoever. Now if that user comes across the Cranfield Home Page 20 times
while browsing around the Cranfield pages that will count as 20 hits.
Remember the statistics are about accesses, not about people.
When comparing hits for different directories, it is
important to note how documents are structured. If you have a
directory with a single document on one hand, and on the other you
have another directory with the same amount of real content broken in
to twenty smaller documents, you will find far more hits into that
second section.
Most of everything listed here is either mentioned above or can
be inferred from the explanations above.
If there is a question
that you would like to see added to this list, or if you have
other comments on this document, please use the form
at the end to submit queries. [Sorry, that form is now defunct.]
A quick list of the questions is provided here.
Not really. The number of individuals and sites using caches is
rising all the time, as is the amount of disk space and memory used
for caching. When the Cranfield Accelerator goes live (early
November, 1995), there should be an actual drop in our server stats,
while an increase in accesses, due to increased speed and reliability
of the server. Caching has been on the rise for more than a year now.
Even so, loads on systems (including ours) have gone up dramatically.
Unfortunately not even this is possible. Suppose for example that Japan has
a very high level of regional and national caching while Singapore
does not (the example is fictitious). Under these circumstances, web
statistics might show more accesses from Singapore than from Japan even
if more people in Japan read our pages.
A clear example of this is the number of accesses from "numerical
domains" that have recently started to top various lists. These are
accesses from sites that don't have proper reverse DNS listings. Such
sites are probably misconfigured single user machines, where either
the particular machine that is used in misconfigured or the
organisation they belong to has not straightened out its machine names
properly. It is reasonable to assume that those running such
misconfigured systems are far more likely to not have configured their
proxies correctly, so far less caching will be seen from those sites.
Not really. The more popular pages will cache more, meaning that
real differences between page hits will be dramatically distorted.
It is probably safe to say this if one page shows more hits then
another that there really were more accesses to that page, but there
are circumstances under which even that weak inference won't be
true.
Not really. This is because any such multiplier would have to
differ from page to page and
differ from access region to access region.
Yes you can. There are several ways to do so, and there are some
circumstances for which it is even legitimate, but to do
so merely to get better stats is seriously misguided. This is
for two reasons:
- You will make your page (much) harder for people to get to and
add to network traffic unnecessarily.
- If someone fails to reach your page at our site, they may give
up on the site all together. Thus hard to get at pages (unless
there is a clear reason for them being such) will be unfair
to other providers at the site.
Quiet embarrassingly, many of the pages on this site don't
normally cache properly.
This is because I had some technical difficulties with
my configuration of server side includes and the so-called "XBitHack".
I've fixed that now, but now have to fix dozens of documents to
use things properly.
You may have noticed some pages with web counters. There are
basically two ways to put them in your page: the wrong way and
the very wrong way. The wrong way merely doesn't work and
will not be more useful than normal statistics. The very wrong
way is counter productive because it
subverts the caching mechanism which
is not a good idea just to get statistics.
Please note that even if you think that statistics can be made useful,
counters on individual pages are displayed to the reader, who isn't in
the position to make the various adjustments needed to get some sense of
true readership.
Yes and no, but mostly no. There are two reasons for "mostly no".
One is simply that there are too many small caches out there which may
have cached our stuff (including the browser software internal cache).
Clearly not all of these are going to send us records on a regular
basis which we would then have to incorporate into all of the other
records to process statistics.
The other reason for "mostly no" is that even the large caches are willing
to only send a byte count. That is, one major UK cache is considering
sending out on a monthly basis how many bytes of data they served up
in our name.
We must remember that the caches are doing us a favour by making our
pages much easier to reach. We cannot ask them to take on a task that
would degrade the service or place an additional administrative, disk,
memory and CPU load on them. Without caching, the web would have
collapsed long ago.
Yes and no. If by minimum you mean "at least one" then
yes. If you have 400 hits from Japan then you can conclude that
during that period you had at least one reader from Japan. You
cannot infer that there were at least 400 readers,
because the same reader may hit a page many times in
a short period of time.
So, the only certain inference that can be made is that there was at
least one from a particular domain, or for a particular page.
One way is to set up Mail Reply Forms
in your pages like the one at the end of this
document. Of course many more people will read your pages than
will complete the form, but the form can be used to judge serious
interest. Most people will, however, not fill out a form unless they
think they will get some sort of useful response, even if they read
the document seriously. (Did you fill out the form for this
document?)
Setting up these forms is not as difficult to do as it first appears,
and courses are offered on it by the computing centre staff.
They are useful for system administrators to judge the actual load
on the server. The section on what stats are good for
contains more information.
Popular demand. It is not the computer centre's job to deny users
some service just because we know the request to be misguided.
Attempts to eliminate these statistics from the system met with
complaints. However, no great effort will be put into maintaining
statistics or access to them either. It is hoped that this document
will make it easier for the computer centre to withdraw statistics
altogether, except for what is required for system maintenance.
No. But you may have noticed that many of the individual problems and
difficulties could be partially mitigated by collecting and
using more information (from some caches for example or times of
requests) and using that to make very rough estimates of various
correction factors. It would take serious statistic analysis of the
sort that professional market research firms may be able to undertake
and still the estimates (and relative hits on pages or from regions)
would remain iffy. Performing complicated analyses on dubious data
only compounds the problem, and the marginal utility would be negative
(ie, the large amount of extra effort would not be justified by the
tiny gain in meaningfulness of the statistics).
When this page was hosted by Cranfield there was a form for mailing
comments. I have disabled that since moving this document to its
current location, because (a) I don't have as good a mailform system
as was available at Cranfield, (b) there are spam/privacy concerns
about collecting unconfirmed email addresses, which I hadn't
considered in 1995 for what was initially intended as an internal
document, (c) this was partially an attempt to promote the use of
Mailforms at Cranfield, and (d) history has shown that I am often not
very good at responding to the queries that I get.
Version: $Revision: 2.7 $
Last Modified: $Date: 2004/07/13 18:30:32 $ GMT
First established at orgininal site: Summer 1995
First established at goldmark.org: April 25, 2001
Author: Jeffrey Goldberg