For a while, we tried to solve the universal prediction problem by forcing an analogy to the universal coding problem. However, although we felt that optimal Bayes prediction with respect to the conditional probabilities induced by the coding algorithm should work, we could not prove it. So we decided to concentrate first on a simpler problem: Find a non-anticipating predictor that outperforms, for any sequence, the best ``fixed'' (single-state) predictor, that predicts either constantly ``0'' if or constantly ``1'' otherwise, where and are the number of zeros and ones along the sequence.
At first, it was not clear how to achieve even this modest goal for an arbitrary sequence. The best single-state predictor makes the optimal Bayesian prediction with respect to the empirical first-order letter probabilities of the data. Thus, an intuitive suggestion for non-anticipating predictor is to count the number of zeros and ones, and , along the sequence , calculate the empirical probabilities , and then predict if , if , and flip a fair coin otherwise. This obvious ``majority'' predictor seems to work well in many examples, and certainly in the case where the sequence comes from a Bernoulli source, where with high probability it will converge, after a while, to the optimal Bayesian choice. However, this predictor may be bad for some sequences. For example, the sequence : While the fraction of errors made by the best single state predictor (which can be either ``0'' or ``1'' in this example) is , the sequential majority predictor makes errors 75% of the time on the average.
Studying this example further, we realized that the difficulty lies in the fact that for this sequence converges to which is a discontinuity point of the optimal Bayesian predictor, that predicts ``0'' when and ``1'' when . Well, if this is the case, what happens if we force the decision of the universal predictor to be continuous with respect to the empirical probability estimate? How about the predictor:
where is the continuous, piecewise linear, function depicted in Figure 1? The randomized predictor (1) provides a decision rule that is continuous with respect to the estimated empirical probability. Note that the randomization occurs only when the empirical probability estimate is close to where there is a danger of making a ``sizeable'' error in comparison with the optimal Bayes decision. Forcing the decision rule to be continuous was indeed a good idea as we were able to prove that the predictor (1) attains asymptotically the single-state predictability for any sequence. ) on the log-loss problem played a special role since it was easier to obtain, and so it served as a vehicle to prove similar results for other loss functions.