Universal prediction has many interesting applications. An immediate application in information theory is the problem of lossy and lossless predictive coding. Other examples that came to our attention are branch prediction in computer programs or page prefetching into cache memory. As a matter of fact, a predictor based on LZ parsing was suggested for page prefetching in . There are, of course, the obvious applications to weather forecasting, the stock market, etc. It would be interesting, then, to see the effect of our work in solving these practical problems.
In any practical implementation, the rate of convergence to the predictability is crucial. This is a drawback of the universal predictor based on the LZ algorithm, and actually a drawback of all the ``growing order'' Markovian predictors. The reason for the slow convergence rate lies in the fact that even when the sequence can be predicted using a small number of states, the algorithm tries to estimate the behavior in a growing number of states, and so the effective number of samples per state is small.
Unfortunately, we usually do not know what and how many states should be considered. So what can we do? How can we ``lock'' on the smallest set of states that explains the data? Well, at least in sequential universal coding we feel that we know the answer. In a recent work we did with Marcelo Weinberger it was shown how to attain the optimal convergence rate to the entropy both in the probabilistic setting and in the setting of individual sequences. The idea was either to estimate sequentially the state structure, but this works only in the probabilistic setting, or to use a ``mixture'' of all possible state models for assigning a probability distribution to the next symbol, which also works in the individual sequence setting.
But what about attaining the optimal convergence rate in prediction and in other sequential decision problems? We do not have an answer to this, but we do have some thoughts. The sequential universal coding procedure generates a conditional probability for the next symbol, given the past; thus it defines a ``universal probability'' (a notion suggested by Rissanen) whose logarithm is as close as possible to the log-probability assigned by the best model in a class to the given sequence. Maybe, if we perform a Bayesian sequential decision using the universal probability instead of the unknown best model, we shall also attain the optimal rate for this decision problem. Of course, as we saw in this work, if the Bayesian decision is a discontinuous function of the probability assigned to the symbol on which we decide, it is a good idea to force it to be continuous by a randomized procedure. Does it work?
We find this approach attractive. For us the notion of information extraction is associated with the coding problem and with its performance measure, the entropy. Coding considers directly the probabilities assigned to the sequence, and as a result, technically useful properties, like the chain rule, can be applied. And, after all, we believe that the ``description'' or coding performance is the ``right'' measure of complexity.