Imagine a situation where an * arbitrary* sequence of observations
is
revealed one symbol at a time to an observer who, at
each time point **t**, makes a decision regarding the next
outcome based on the past .
There is a loss function associated with this decision and the goal
is to minimize .
The lossless coding and gambling problems can be formulated
in this framework as a ``log-loss'' problem, where in the binary case

if and if . But perhaps, the simplest example of such a problem is where the observed sequence is binary and the goal is to predict the next outcome, i.e., and is the Hamming distance. This prediction problem is the subject of our work. Interestingly, despite its simplicity, it raised unexpected and unique difficulties. Furthermore, as it turned out, our solution of the universal prediction problem has a broader scope, because it required us to develop a framework for dealing with general sequential decision problems.

It should be emphasized that in this work the observation sequence is arbitrary. If we were to assume that the observation sequence comes from, say, an ergodic source with unknown probability law (if the probability is known the optimal decision is given by the classical Bayesian decision theory), then the universal sequential decision problem would still be non-trivial, but it had already been solved by Algoet, Cover, Hajek and others (see e.g. [6]).

In this setting of an arbitrary individual observation sequence, it might seem surprising, at first glance, that the past can be useful in predicting the future, afterall when a sequence is arbitrary, the future is not necessarily related to the past. Nonetheless, it turns out that sequential (randomized) prediction schemes exist that utilize the past, whenever helpful in predicting the future, and utilize the best FS predictor, even if this FS predictor is tuned, in retrospect, to the specific sequence. A similar observation has been made in data compression [2] and gambling [3]. However, the solution to the ``log-loss'' problems was tailored to them, and could not be generalized to the prediction problem. Fortunately, this made the prediction problem more interesting!

It should be clear that when dealing with arbitrary observation sequences
the goal must be to compete with a limited class of predictors.
If the goal were to compete against any predictor,
the problem would be meaningless since for
any specific binary sequence, the scheme ,
would correctly predict the next outcome, and make zero errors.
We limit the class of predictors we compete with to be FS machines
not only to avoid trivialities, but
also because this limitation reflects the resources one has in real life situations.
And philosophically, we are not attempting to achieve the ``best'' possible performance
in predicting the sequence, which is an ill-defined goal, but instead we take an * agnostic* approach:
to do whatever we can in competing with a restricted class of strategies.

So what is the best that an FS machine can do in predicting a specific sequence?
An FS predictor is a machine in which the predicted value depends
only on the current state . After the value of is revealed,
the machine moves to the next state according to , and so on.
Now, for a given observation sequence , we define its **S**-state predictability, denoted
, as the minimal fraction of errors made by
any FS predictor, that may depend on but has only **S** states, in predicting .
The predictability takes on values between
zero and a half where zero corresponds to perfect predictability and
a half corresponds to total unpredictability.
But we want to define the predictability for * any* finite **S**. Clearly, for a fixed
length sequence, as **S** is increased the best **S**-state predictor will eventually make zero errors.
Thus, to consider a large **S**, the sequence length must be increased first.
So we define the **S**-state predictability of an infinite individual sequence as
where we take the limit supremum
since for an arbitrary
sequence there might not be a limit. Finally, the FS predictability of ,
denoted , is defined as .
This definition of the FS predictability is in the spirit of the definition
of the FS compressibility in [2].

While the foregoing definition enables a different optimal
FS predictor for each sequence, the main result of our work is the construction
of * universal* predictors, independent of the particular sequence,
that always attain the FS predictability!

Tue Aug 8 11:36:42 PDT 1995