Class #8: Occam’s Razor, the Universal Prior, and PAC-Learning

Posted on November 3, 2011 by Scott

Can we even hope for a non-tautological justification of Occam’s Razor? Can one explain, non-circularly, why the “Anti-Inductivists” are wrong?

What are the important assumptions made by Valiant’s PAC model, and how reasonable are those assumptions?

What are the relative merits of the PAC-learning and Bayesian accounts of learning?

Do the sample complexity bounds in PAC-learning theory (either the basic m=(1/ε)log(|H|/δ) bound, or the bound based on VC-dimension) go any ways toward “justifying Occam’s Razor”? What about the Bayesian account, based on a universal prior where each self-delimiting computer program P occurs with probability ~2^-|P|?

Does some people’s criticism of PAC-theory—namely, that the number of bits needed to pick out an individual hypothesis h∈H need not be connected to “simplicity” in the intuitive sense—have merit? If so, what would need to be done to address that problem?

This entry was posted in Uncategorized. Bookmark the permalink.

38 Responses to Class #8: Occam’s Razor, the Universal Prior, and PAC-Learning

Peli Grietzer says:

November 4, 2011 at 8:48 pm

A weird idea: There’s a good anthropic explanation of why we’re in a universe that’s simple up to the present time*, but no good explanation at all (as far as I can tell) of why we’re in a universe whose total spacei-time is simple. Should we conclude that we’re in a temporal pocket of simplicity and induction will stop working soon?

*Unless we’re Boltzmann brains, I guess.

Reply
Edward Z. Yang says:

November 6, 2011 at 5:52 am

I’m going to bring in some of my philosophy of science background here and explain one particular solution to the problem of induction, which is the reliabilist solution. Recall that the problem of induction is the problem of *justifying* induction, that is to say, giving a reason why we come to have knowledge by performing induction. There is a hidden claim here, which is that we actually need to *justify* that induction is, in some sense, a knowledge-gaining activity. If rational beings don’t need to justify their methods of knowledge acquisition, then the problem dissolves. Of course, an anti-inductivist may claim you are simply dodging the question, and with not very much subtlety either.

Why, then, do we think that we do not need to justify beliefs which we call knowledge? It certainly seems that something more than simply truth is necessary: if I randomly generate the belief “the fourteenth digit on the screen in an airport in Texas is 2” and it happens by sheer luck to be right, it doesn’t really seem to be knowledge at all. This makes the idea that we need to justify knowledge very appealing, since the difference between me having looked at the screen and me having guessed randomly is the difference between having a convincing explanation and an unconvincing one.

So what does the “reliabilist” say? They say that you do not necessarily need to know the explanation; if the inquisitive child continually asks “Why?” at some point you are allowed to shrug your shoulders and fail to reply, and still count what you know about the topic at hand knowledge. The key, then, is that the knowledge comes from a *reliable* source. This is a contingent fact, one that depends on the nature of reality, not from anything that humans (necessarily) have access to.

In this sense, PAC learning attempts to describe some mathematical structure about reality, without demanding the algorithm “know” why it will learn something probably approximately.

Reply
D says:

November 8, 2011 at 10:14 pm

In response to your questions above, I’ll contend that neither the complexity bounds of PAC learning nor the Bayesian prior “justifies” Occam’s razor–they both seem to ultimately be circular in nature.

For PAC-learning, as you point out in your last paragraph, we have a fixed set of hypotheses under consideration, and the number of samples relates to the size of this set. But this says nothing about the complexity of these hypotheses, nor how we “should” define/select which hypotheses to consider in our class H! We talk about having H consist of, say, all binary strings of a certain length, and thus “longer” (i.e. more complicated) hypotheses require more samples to learn and distinguish. But there’s nothing to say we have to make this choice for H–it’s merely a convenience for analysis; it’s easier for us to conceptualize and work with. You could have a small finite class of possible hypotheses that are all very complex. The only reason the PAC-learning approach might appear to favor simpler hypotheses is that we often choose our hypothesis class in such a way that it is smaller if the hypotheses are simpler.

Similarly, the Bayesian universal prior is chosen arbitrarily. There’s no particular reason we have to use the 2^{-|P|} distribution, other than convenience: it’s clean, easily-defined, and results in a valid probability distribution. But it’s an arbitrary choice of prior; by analogy to the PAC-learning argument, we could have a prior that assigns more weight to a small subclass of complicated hypotheses and only a tiny amount of weight to every other hypothesis (even simple ones). We choose the standard universal prior because it’s easier to work with, not because it reflects Fundamental Truth. So we can’t therefore claim that the resulting preference for shorter hypotheses itself constitutes an argument for taking Occam’s Razor as a Fundamental Truth.

I certainly think there are good reasons for using Occam’s Razor, I just don’t think that there are necessarily philosophical justifications for those reasons (particularly not ones rooted in the above arguments).

Reply
- bobthebayesian says:
  
  November 9, 2011 at 7:15 am
  
  You say, “There’s no particular reason we have to use the 2^{-|P|} distribution, other than convenience: it’s clean, easily-defined, and results in a valid probability distribution.” But this is at odds with what motivated this particular choice of prior historically. The whole point was that by relating the prior probability to the length of the shortest machine that outputs bits that correspond to your data after receiving uniformly random input, you can *prove* that your predictions in the long run will be as good or better (modulo a constant depending on how complex the machine is) than any frequentist-based predictions, or even any other Bayesian predictions made with a different prior.
  
  Check out this page: ( http://www.scholarpedia.org/article/Algorithmic_probability ) and in particular the section on Solomonoff Induction, e.g. “This [error bound] is remarkable as it means that Solomonoff’s inductive inference system will learn to correctly predict any computable sequence with only the absolute minimum amount of data. It would thus, in some sense, be the perfect universal prediction algorithm, if only it were computable.”
  
  I am sympathetic to your point of view that nothing we have described so far gives any kind of “hands down” justification for Occam’s razor. PAC-learning basically says that we can’t really do any better when we want to learn good explanations for observations. PAC still needs Bayes in some sense because you have to have a way to pick your hypothesis class H up front, and no matter what you decide on, it represents your prior beliefs about the problem. But Bayes can’t perfectly justify Occam’s razor either, specifically because this universal prior is not computable. However, In practice I am content with saying, “if not Occam’s razor, then by what criteria do you suggest we select the best model from a collection of equally-consistent models?” I’ve never heard an answer to this that isn’t just “Use a sufficiently uninformative prior and check for simplicity” in disguise, and I’m also content believing that the observed success of the razor places the burden of proof on the other guys.
  
  Reply
  - bobthebayesian says:
    
    November 9, 2011 at 7:27 am
    
    Also, in the page that I linked above, the picture shows Solomonoff’s beard. Clearly if anyone needed Occam’s razor, it’s him. So I’ll keep on trusting his prior. (I hope that horrible jokes are allowed.)
- Andy D says:
  
  November 9, 2011 at 3:56 pm
  
  D, the universal prior U is not an “arbitrary” choice of prior; it has several advantages. First, it has a reasonable “generative model”: there is a randomized Turing machine whose output distribution is U. Second, U in a sense “takes into account” all possible generative models of this kind, lending noticeable weight to each of them.
  
  These advantages were discussed in class; for more info, see http://tinyurl.com/c49j974 or the Li-Vitanyi book on Kolm. complexity. These advantages don’t force you to “believe” in the universal prior (or prove Occam’s razor), but they do motivate its use.
  
  Reply
  - D says:
    
    November 10, 2011 at 8:12 pm
    
    My point isn’t that there aren’t advantages to U, or that there aren’t reasons to use it in practice. My point is that those reasons aren’t philosophical in nature, or that the choice of U specifically is not unique.
    
    You mention two specific advantages. The first is that it has a “reasonable” generative model. First, I’d push back on what “reasonable” means and whether your definition for “reasonable” is a matter of convenience or a matter of Fundamental Truth (and why). We might not be able to easily work with things that don’t have reasonable generative models to make predictions, but why–philosophically speaking–should having a “reasonable” generative model–or a generative model at all–influence whether or not it is a good starting point for Bayesian reasoning? It seems question-begging.
    
    Even disregarding that, there’s another objection which is the same as the objection to the second advantage: it’s not unique in having the properties you describe. U can be output by a randomized Turing machine, but it’s hardly the only distribution with that property. U assigns nonzero weight to every possible hypothesis, but it’s not the only distribution with that property either.
    
    Here’s an alternative to U: Consider a specific hypothesis (possibly a very complicated one) x. Consider the distribution (x+U)/2 –that is, each possible hypothesis is assigned half the weight it is in U, except x is assigned an additional 0.5. I claim this can be easily output by a randomized Turing machine, and that it also gives noticeable weight to every possible model. (x+U)/2 is only one of an infinite number of possible alternatives with the same advantages as U; why–philosophically (not practically)–should we prefer U to any of these? (If you claim that (x+U)/2 is philosophically suspect for being asymmetric–that is, specifically preferring x–why is U not philosophically suspect for preferring shorter hypotheses?)
  - bobthebayesian says:
    
    November 10, 2011 at 9:26 pm
    
    But the whole point is that you can *prove* that it will require you to collect more data to arrive at an optimal prediction with your alternative prior than if you used U (if U was computable). The properties of U that make it unique are *precisely* philosophical in nature — in the long run, to learn the objectively correct answer with the smallest amount of training data you should have U as your prior (if it were computable, and also modulo a constant similar to the “objectivity constant” defined in the discussion of K-complexity itself). Using your alternative to U is also uncomputable and provably worse than U itself.
  - D says:
    
    November 18, 2011 at 7:51 pm
    
    bobthebayesian: “modulo a constant” is the key phrase here. Is my prior not also optimal modulo a constant?
  - bobthebayesian says:
    
    November 18, 2011 at 8:28 pm
    
    The Solomonoff prior is the unique prior that would learn to correctly predict any computable sequence with only the absolute minimum amount of data, if it were computable. I think the “modulo a constant” phrase from before is not relevant to this at all… it only means that once you fix your universal computer, there is some translation cost for worrying about programs for a different universal computer. Your prior is provably less optimal that Solomonoff’s, which would be strong philosophical motivation to choose Solomonoff’s if it were computable and if your goal is to learn a concept with minimal data. Again, these things are all very related to Occam’s razor, but do not conclusively prove that this is the “right” way to view it. It’s just that there is not only motivation for Solomonoff’s idea, but very philosophically relevant motivation that cuts right at the “optimal choice” issue presented by the Occam’s razor problem.
Alessandro Chiesa says:

November 9, 2011 at 2:06 am

Valiant’s original paper introducing the PAC model considered a quite simple setting; of course, this was to make the point that learning may in certain instances happen “in the absence of explicit programming”. While I am not sure whether a PAC-type model could be used to justify Occam’s Razor, still, the notion of “probably approximately correct” seems quite appealing to me as a phenomenon. I do not know much about PAC learning but I wonder: are there generalizations of this model for which “surprising” (positive!) results can be shown? (Of the kind that look a lot less like toy examples for making a point, but more like there is something interesting is going on.)

Reply
- Scott says:
  
  November 9, 2011 at 2:25 am
  
  Ale: One of my favorite nontrivial learning algorithms is the Linial-Mansour-Nisan algorithm, which PAC-learns arbitrary AC⁰ functions, under the uniform distribution over inputs, in n^polylog(n) time and using n^polylog(n) samples. Actually, the algorithm itself is quite simple: just approximate all the low-weight Fourier coefficients of the function, and use those to do your prediction. But the proof that the algorithm works basically amounts to Hastad’s switching lemma. Another good example is the Klivans-Servedio algorithm, for PAC-learning DNF formulas under arbitrary input distributions in exp(n^1/3) time.
  
  Reply
bobthebayesian says:

November 9, 2011 at 6:32 am

In [1] (see bottom for sources), the author describes two views of Occam’s razor: (1) beauty and simplicity are themselves worthwhile for whatever reason and we should therefore seek simple explanations as a means to obtain the worthwhile simplicity/elegance/beauty/explicability they contain, and (2) one should believe that (among a pool of explanations that are all equally consistent with the data) simpler explanations are a priori more likely to be correct. (I refer to these as (1) and (2), resp.) The author suggests that (1) is a perfectly fine pursuit but that (2) is demonstrably incorrect, and offers empirical evidence against (2).

There’s a lot in the paper that is interesting to think about, but there was one comment in particular that I think we can analyze effectively. While offering empirical evidence against (2), the author says, “Perhaps the single most significant piece of evidence against the second razor is the
broad success of multiple-model approaches,” and then proceeds to cite bagging, boosting, stacking, and error-correcting output codes as examples of this kind of model-combination or model-averaging procedure. I think it would be a very interesting task to try to sharpen this position by considering Philosophical Majoritarianism as the appropriate setting to actually try to rigorously justify model-averaging in the same sense that we would like to rigorously justify Occam’s razor. Despite this potential connection with majoritarianism, however, I think we can offer some kind of rebuttal using computational learning theory.

Rather than trying to focus on all of the different examples given by [1], I prefer to restrict things to the controlled setting of mixture models. Basically, we can just treat a mixture model as a convex combination of some probability distributions. I take it as a premise that mixture models serve as an adequate proxy for the basic argument in [1]: Observed cases where a mixture model is more successful than similarly consistent single models is a defeat of (2) because the mixture model is to be considered more complex by virtue of being a mixture.

(We could take non-trivial issue with this, but I am content to accept it for now. I don’t think my reduction to mixture models makes this into a straw-man, but it’s worth critical re-examination to make sure mixture models really do serve as a proxy for large chunks of [1]’s argument.)

To keep this brief, I’ll just summarize some interesting results from [2]. It is proved that in cases where a class of probability distributions is fixed and known at the outset of learning, then the metric entropy of the hypothesis class is essentially the same as the VC-dimensions ([2] proves that the metric entropy is bounded by a constant that depends on the VC-dimension multiplied by a polynomial function of a parameter epsilon, and some previous results show certain bounds for the other direction, so metric entropy and VC-dimension sort of sandwich each other similarly to the way that Kolmogorov complexity w.r.t. two different programming languages has tight “objectivity” bounds). Basically, a concept class having finite metric entropy is roughly equivalent to it having finite VC-dimension as far as certain learning applications are concerned.

For a bounded metric space U, the covering number C(U,epsilon) is defined as the cardinality of the smallest set S of points inside of U such that epsilon balls around the points in S form a covering of U (according to the metric). A common example is that if U is the unit square, then the covering number is O(1/eps^2) because each epsilon-ball covers only about eps^2 area. The metric entropy is defined as log(C(U,eps)). The intuitive meaning of metric entropy is the number of bits required to specify an element of U, with up to an eps amount of distortion in the metric.

[2] shows how to extend the notion of metric entropy to a case where a class of probability distributions is specified. In particular, if one starts out with a large class of distributions, P*, and the defines P_l to be the class of convex combinations of (finitely-many) distributions from P*, then [2] proves that the following are equivalent: the metric entropy of some concept class C (w.r.t. P_l) is finite; the VC-dimension of C is finite; C is PAC-learnable in the same sense we discussed in class, specifically for the class of these mixture distributions.

My apologies for how wordy this is. It makes more sense in light of Section 4 of [2] (hopefully). But the general idea is that the same kind of “Occam’s Razor Theorem” we discussed for generally learning (w/fixed hypothesis class H) also holds if we remove the need for the fixed H and instead assume we are specifically learning a mixture model. I know my details aren’t ironed out fully, but this seems to me to give us some justification for believing that, among mixture models, simpler mixture models are special for surviving contact with the data in the same sense that simpler hypotheses from H were special for surviving contact with the data in the regular PAC setting.

The main take-away is that even among mixture models, there has to be some trade-off between model performance and model complexity (such as the Bayesian Information Criterion, etc.) Once we feel convinced that models of a particular type (mixture models) outperform “simpler” models, the simpler models cease to be “consistent with the data” (literally, for one type of model to outperform another, we’d have to observe some data that makes one appear more consistent than the other). Even if mixture models reach this better performance when compared with what we might consider “simpler” models, we still have to apply simplicity constraints when choosing mixture models that are consistent with the data. And as it turns out, if we are willing to fix a class of mixture distributions and make use of metric entropy, then PAC-learning guarantees apply similarly to mixture distributions.

My belief is that if we flesh this out a little more rigorously, it would be a strong rebuttal to the “empirical evidence” presented by [1] to argue against interpretation (2) of Occam’s razor. This doesn’t say that (2) is justified; only that learning good-performing mixture models is not at all a violation of (2) and so empirical evidence of its success is not a convincing approach to ruling out (2).

[1] P. Domingos. “The Role of Occam’s Razor in Knowledge Discovery,” Data Mining and Knowledge Discovery, Vol. 3, pp. 409-425. 1999.

[2] S. Kulkarni. “On Metric Entropy, Vapnik-Chervonenkis Dimension, and Learnability for a Class of Distributions,” Technical report. MIT, Laboratory for Information and Decision Systems. 1989.

Reply
- Katrina LaCurts says:
  
  November 9, 2011 at 6:12 pm
  
  The comment about mixture models in [1] stood out to me as well, and I actually saw it as a case *for* Occam’s Razor. It seems that the author is discussing three different models, all of which perform exactly the same on the training data:
  
  1. O, the original model; somewhat accurate on testing data, very simple
  2. M, the mixture model; more accurate than O on testing data, more complex than O (For ease of explanation, I’m also going to assume that O is one of the models in the mix)
  3. S, a single model; as accurate as M on testing data, more complex than M (S was explicitly generated from M)
  
  Following the author’s second definition of Occam’s Razor, assuming O, M, and S had all had the same accuracy on the training data, we should’ve chosen O; it was the simplest. First, in this situation, I have trouble believing that M and S would’ve not have done better on training data than O. But moreover, it seems a reasonable guess that since M uses O as part of its ensemble, M should perform at least as well as O on testing data. Given this extra piece of intuition, I don’t think one can automatically apply Occam’s Razor, because now all things aren’t equal. We have two models performing equally as well, but *something else* that tells us one should be better than the other. So let’s at least test both.
  
  Moreover, if we were to test all three models on the testing data, we’ll see M and S perform the best. Between the two, we’d almost certainly use M. It’s much easier to describe M, explain how it works, and explain why it works (by virtue of explaining that the models making up the ensemble work well in different cases). This seems like the right place to apply Occam’s Razor to this problem.
  
  There’s also the idea that Occam’s Razor can sometimes lead us down the wrong path, for instance if we had said from the outset “single models are always simpler than ensemble models, thus we should never use ensemble models”. But that would’ve been an incorrect application of Occam’s Razor, because it relies on a false assumption (single models are not always simpler than ensemble models, as shown above).
  
  Reply
  - bobthebayesian says:
    
    November 9, 2011 at 6:17 pm
    
    I agree. You make a good insight by saying, “Given this extra piece of intuition, I don’t think one can automatically apply Occam’s Razor, because now all things aren’t equal. We have two models performing equally as well, but *something else* that tells us one should be better than the other. So let’s at least test both.” In [1], a lot of attention is also paid to “domain specific knowledge” and I often hear that phrase touted around like it is somehow different than a prior and somehow trumps Occam’s razor. However, if you possess domain specific knowledge, then the simplest explanation, according to Occam’s razor, will still have to survive contact with whatever data it is that ultimately grounds your belief in the domain specific knowledge. One cannot decouple domain specific knowledge, which is a summary of some previously experienced data, from the training data or testing data at hand. Any type of data you introduce into the system must be accounted for when comparing whether two different models are consistent. I think a lot of people claim that a “more complex” solution is more accurate, thereby violating the razor, without acknowledging that they are holding the two different models to two different data standards.
  - amosw says:
    
    November 11, 2011 at 2:57 am
    
    Hi bobthebayesian. You say “domain specific knowledge, which is a summary of some previously experienced data”.
    
    I’m sympathetic towards this point of view, but isn’t it called into doubt by the famous studies establishing that the concept of object permanence exists in human infants as early as 3 months old[1]? I.e., it seems that humans are born with at least *some* inductive biases, and thus have “domain knowledge” that is NOT a summary of their “previously experienced data”. Arguably it IS a summary of their ancestor’s previously experienced data, but I don’t think that’s what you are claiming here.
    
    [1] http://en.wikipedia.org/wiki/Object_permanence#Contradicting_Evidence
  - bobthebayesian says:
    
    November 11, 2011 at 3:18 am
    
    My personal feeling is that ancestor-experienced data is not any differently than “consciously observed data” except in some parochial ways — though I’m by no means certain on that and knowledge that an entity is “just born with” could be an interesting thing to explore. Evolution specifies bits of information like other learning algorithms, just slower and not necessarily monotonically. (Check out this discussion http://wiki.lesswrong.com/wiki/Speed_limit_and_complexity_bound_for_evolution and the paper cited there by Worden.)
    
    I think what we refer to as domain specific knowledge is very valuable, but it just represents a summary of data that we came into contact with in the past. So whatever you’re born with is a BIG part of your prior — and indeed, this prior is not always optimal. Whenever it is not optimal, we refer to this as a cognitive bias, such as confiromation bias, conjunction fallacy bias, availability bias, or an interesting “vulnerability-expert” bias discussed just today (http://www.overcomingbias.com/2011/11/fear-makes-trust-blindness.html). A lot can be discussed about justifiable reasoning in an entity unfortunate enough to be born with cognitive biases that it cannot fully correct (e.g. humans), but my main point was that this alone does not challenge Occam’s razor.
    
    This is one reason why I’m motivated to look at the falsifiability of the razor. If when we say “mutually consistent class of hypotheses” what we really mean is “hypotheses for which no empirical test could distinguish them in terms of prediction accuracy”, then the Razor is not falsifiable. But on the other hand, the very minute we every perform an empirical test that does distinguish some of the hypotheses from others in terms of prediction accuracy, then in some sense all simplicity bets are off because the hypotheses cease to be mutually consistent with the data. You remove the bad ones from consideration, regardless of simplicity/complexity, rinse and repeat. If this is what we really mean — and I believe it’s non-trivial as to whether this truly is what we mean by the razor — then we have a falsifiability problem.
    
    K-complexity might provide us with an objective way around this. What would it mean for Occam’s razor to be falsifiable? I conjecture that one possible interpretation would be that nature generates true theories T according to some probability P(T), and that if Occam’s razor is incorrect, we should find that the Typical Set (constructed as per Cover and Thomas Information Theory text (2nd ed.) pp. 59) should contain most of the theories with low K-complexity (i.e. the probability that nature generates T as a true theory is more or less unrelated to K(T)).
    
    Since K is uncomputable, this does not directly lead to falsifiability (and of course suffers from all kinds of particular choices about how to detect such things, what kind of proxy complexity like gzip to use, etc.), but philosophically it seems satisfactory. As Miguel pointed out in a comment below, variational evidence does suggest nature achieves a large amount of compression in its theories. There’s still some amount of anthropocentric reasoning here, but we can at least make it very explicit and decide if we can live with the premises, which is a good start.
Andy D says:

November 9, 2011 at 4:29 pm

If you’re in this class and serious about the life of the mind, I’m guessing you already read PARADE magazine. But just in case you missed it, Marilyn vos Savant has been discussing the probabilities of simple and complex dice-roll outcomes recently:
http://www.parade.com/askmarilyn/2011/10/Sundays-Column-10-23-11.html
Any thoughts on her argument?

Reply
- bobthebayesian says:
  
  November 9, 2011 at 5:02 pm
  
  At first I felt extremely sure that I had a good reason to agree with Marilyn, but I want to think about it a bit more now. When she says, “It was far more likely to be (b), a jumble of numbers,” she is basically appealing to membership in the Typical Set, from standard information theory (pp. 58-64 of Cover and Thomas). For a relatively larger value of epsilon, everything that “resembles” 66234441536125563152 will occur with a much higher probability than 11111111111111111111.
  
  I prefer to think about these things in terms of wagering. What odds would you assign to each of the two strings having been generated? The straightforward answer leaves them both equally likely. If you appeal to the volatility of your winnings in a game like this, then the Typical Set argument is a good reason to prefer (b). However, this seems too “practical” and doesn’t really address the philosophical issue about whether to regard one as more likely than the other.
  
  Of course, as discussed in class, we could view the Kolmogorov complexity of the two strings as a measure of their randomness, and then believe it is somehow “more miraculous” for the coin to have specified a super low entropy result like all 1’s. If we then accept the axiom that we should believe in the least miraculous consistent explanations (a form of Occam’s razor), this again would prefer (b).
  
  Reply
kasittig says:

November 9, 2011 at 5:45 pm

While PAC and Bayesian models can work somewhat well on certain types of data, we’re still extremely far from being able to develop computational models of classification that are able to beat humans in classification accuracy, which I find somewhat strange. In fact, many large companies – PayPal, most famously – do the majority of their critical fraud-detection work with complex data visualizers and an army of well-trained but extremely average workers, rather than attempting to develop better algorithms or heuristics. What is it about pattern recognition that is so easy for humans but so difficult for computers?

I feel as though there is some subtlety that we must be missing in how humans execute binary classifications. In the case of fraud, perhaps we need to quantify why something “looks funny” or “feels wrong”. I suppose that, despite implementing and using many different machine learning algorithms for many different classification problems, I’m fundamentally disappointed with the current state of the field, and I’m skeptical that any of the current machine learning algorithms will be able to be tweaked to be as accurate as humans on complex tasks. We may just be approaching the problem incorrectly, but as it currently stands, I don’t expect computers to surpass the accuracy of well-trained humans anytime soon.

Reply
Cuellar says:

November 9, 2011 at 6:07 pm

We say that all possible outcomes of 100 coin tosses are equally likely. This follows from a successive application of Bayes´ rule and the fact that each side of the coin is equally likely. On the other hand, we will suspect that the coin is not fair if we see 100 heads in a row. This is because we assume some outcomes are more ´random´ than others and thus more likely appear. This issue mentioned in class raises several questions, some of which I present here.

How is it possible that we have two contradicting intuitions about the likelihood of the outcomes? First of all, this proves that the in the Bayesian framework the universal prior, if it exists, is not common knowledge. It also rises questions about the Bayesian framework itself. Does Baye´ rule fail in this context? Or, more importantly, is the Bayes´rule ´context dependent´? It might be interesting to think that there is two types of probability that respond to different rules. The first is the ‘expectation probability’ which we use to predict possible outcomes and responds to Bayes’ rule. The expectation probability is used by gamblers or actuaries and it works in practice. The second is the ‘existence probability’ which describes how likely it is for a given scenario to be part of our universe. In the common use of the word, both probabilities are the same, but there is a subtle difference. One way of thinking about it is the following: When something is true a priori, its existence probability is 1; when something is necessarily true its expectation probability is 1. I do disagree with some of the consequences of this idea, but it seems to explain the dualism of probability and seems fairly interesting.

It is also interesting to think what makes us decide that the tossing of a coin behaves randomly. If the Kolmogorov complexity of a sequence of 100 heads in a row is smaller than another random sequence q, wouldn’t it be easier for the ‘universe’ to generate the first sequence instead of q? I think it is not clear why the throws of the coin are truly independent when we introduce the complexity of the word as a relation between the throws.

The same idea as before can be applied to Occam’s razor. If the laws of the universe were chosen randomly, then Kolmogorov-random laws should be more likely to be governing us. If we want to sustain Occam’s razor, should we accept that our universe was created by intelligent design? Probably not, but there is some discrepancy between our intuition of probability and the definition of Occam’s razor as we presented it in class.

Reply
- bobthebayesian says:
  
  November 9, 2011 at 6:23 pm
  
  I disagree that these are two contradicting intuitions. The reason that we automatically assume that 100 heads is evidence in favor of a biased coin is that we have prior beliefs about the availability of the biased coin. Imagine a world where possessing a biased coin was punishable by instantaneous death and this was perfectly enforced. Then someone would have to be insane for carrying around such a coin and perhaps seeing 100 heads in a row would then serve as amazing evidence for the “freak occurrence” of a super low entropy event. This is precisely why the Bayesian point of view works so well in practice.
  
  I am (obviously) biased in preference of the Bayesian approach, but I very much appreciate and respect the non-trivial critiques that can be leveled against it. However, in most mundane situations, if you think you have something that casts the Bayesian perspective in a contradictory light, it usually means there are some hidden details about the real prior beliefs that aren’t being considered. The really interesting critiques get down to the business of what a prior is in ill-defined situations, how we come to know our prior, and to what degree human brains are capable of Bayesian reasoning (i.e. what resource limitations do we have, or are we truly ‘universal explainers’ as Deutsch suggests).
  
  Reply
- Nemion says:
  
  November 10, 2011 at 2:18 am
  
  Very insightful. I liked this section:
  
  “It is also interesting to think what makes us decide that the tossing of a coin behaves randomly. If the Kolmogorov complexity of a sequence of 100 heads in a row is smaller than another random sequence q, wouldn’t it be easier for the ‘universe’ to generate the first sequence instead of q? I think it is not clear why the throws of the coin are truly independent when we introduce the complexity of the word as a relation between the throws.”
  
  We might be able to solve this issues by saying that the universal prior is the way we posit theories of the world. The “it feels like this is how it works … ” talk, not necessarily the way the world actually works.
  
  I actually think this is the right version of Occam’s Razor. From our perspective, the simplest hipothesis is the most likely to be correct …. simple of course depends on our language, definitions, etc … and it being correct from our viewpoint does not imply to be the case … And then of course, our prior need not be the universe’s prior, but could be preety close, (or not, and maybe not even close to comply with the Occam Razor, etc …)
  
  Reply
  - bobthebayesian says:
    
    November 10, 2011 at 3:49 am
    
    I think another misconception here is that an event having low entropy, or a string having low Kolmogorov complexity, might make it “easier” to generate. Every string is generated by Nature according to the same physical process (e.g. the physical act of flipping a coin). Nature is not using a short program to generate coin flips of all 1’s and some different, longer program to produce a mixture of 1’s and 0’s. Those programs are just means by which we can measure the after-the-fact complexity of whatever Nature did produce.
    
    This is another reason why seeing a low entropy or low K-complexity event means something. There simply are fewer of them. They are far outside of the Typical Set mentioned in my other comment to the Parade article linked by Andy. Nature can produce any potential 100-flip sequence with the same physical overhead, but among those sequences, almost all of them will have high entropy.
wjarjoui says:

November 9, 2011 at 6:46 pm

Goodman’s riddle: “why do grue, bleen, and F (x, y) go “against the grain of the world,” whereas green, blue, and multiplication go with the grain?”
I would rather ask “why do ’emeralds are green'” go with the grain when the other descriptions do not – isn’t it because we have no reason to believe that after January 2030 the emeralds will change color? In fact, do we not know something stronger than that, that emeralds will never change color? If we do know that, the it is clear why “emeralds are green” goes with the grain of the world – because it agrees with a phenomena proven based on the grain of the world, or it is induced from the grains of the world. If we do not know that the emeralds will change color after that date, then “emeralds are green” does not go with the grains of the world, it simply goes with the grains of our intuition.

On Occam’s Razor:
We can look at the application of the razor in real life. Natural phenomena usually take place in the most energy-efficient way. This is true even for human beings and animals. Take for example the technique of cats drinking water – it is a very simple procedure but also happens to be the most energy-efficient technique for the cat to drink water (look up paper by P. Reis MIT). So is there a correlation between simplicity and energy cost?

It is also sometimes important to examin the opposite argument when it comes to the razor. We are inclined today to say a lot of the physics is simple and that’s a good reason why it is true. For example De Wolf says how Einstein’s theorems are true because they are simple – but how much of that simplicity is relative? Maybe we can find something even simpler which is closer to the truth?

Reply
Nemion says:

November 10, 2011 at 2:03 am

I liked wjarjoui comment about Occam’s Razor and how is it that in nature things tend to take place in the most energy efficient way. It seems promising the idea that a relationship can be drawn from this principle of low energy cost and simplicity of explanation. Could it be that low energy cost and short and beautiful explanations are correlated in some way? This seems to be possible in some cases, but it also seems to be possible unlikely in the asymptotic case. If an algorithm description is short, that does not mean it would be efficient, and there is at least an intuitive relationship between a energy cost of an algorithm run and its running time. Maybe the right answer here should lie in bounded Kolmogorov complexity?

As for the universal prior, I understand the motivation behind positing it, nevertheless even though it complies with our formal requirements it still seems an arbitrary choice of distribution. This is a kind of principle in which the idea behind is understandable and reasonable, but specifying it diminishes its believability, particularly since there seems to be no way of formally verifying it. As for PAC learning, it gives me more confidence specially because within the definitions of its framework we can prove a learning theory consistent with the Occam Razor. Nevertheless, I am also suspicious as to whether the assumptions the model is making are fair, or generalizable enough as to justify Occam’s razor.

Reply
amosw says:

November 10, 2011 at 10:29 am

I want to attempt to make a tentative connection between the universal
prior and a phenomenon observed many times in science (and in
mathematics): namely that correct theories (or theorems) often have
many independent originators. That is, although “stealing of ideas”
no doubt sometimes occurs: we have many examples of different people
far removed coming up with the same result through very different
chains of reasoning.

I’ll try to do this by recapping the basic goal of inductive inference
as: given some observed data D and a set of hypotheses H, we would like
to know the probability that it was produced by a given hypothesis h
drawn from H. We denote this as p(h|D). The way to compute this is of
course with Bayes’ rule: p(h|D) = p(D|h)p(h)/p(D). It is usually
pretty easy to compute the likelihood p(D|h), and p(D) is constant with
respect to different h’s, so the only real problem is the prior: p(h).

I’d next like to make the simplifying assumption that at any moment
in modern human history, we are only keeping around hypotheses whose
likelihood on the data we have thus far gathered is about equal. This
assumption allows us to just fold p(D|h) and p(D) together as some
constant 1/Z, and now our problem is one of maximizing
P(h|D) = p(h)/Z over all h. In other words, this means that we have
restricted H to all those h that explain the data about equally well,
and now we only really care about p(h).

At first this seems like a ridiculously hard problem: how can we
possibly say a priori that a given hypothesis h1 = 0111101010…011
representing, say, Heisenberg’s matrix mechanics is more probable
than another hypothesis h2 = 100101001…001 representing, say,
Schrödinger’s wave mechanics?

Solomonoff gives an answer to this question: the probability of h
is the sum of the probability of all programs that emit h. What is
the probability of any particular program p? Well, Solomonoff argues
that it is simply 1/2^|p|.

Now, we humans have some (pretty) successful theories currently
in operation. What I have attempted to sketch in the above is
an argument that either we got really really lucky, or that our
theories are produced by short programs, OR that they are actually
pretty complicated AND are produced by fairly long programs BUT that
there are many such long programs that produce the same theory. Under
the assumption that human cognition is computable, then we might expect
that several humans would then produce the same theory via different
computations.

Reply
- D.R.C. says:
  
  November 14, 2011 at 11:50 pm
  
  It is very possible that many long programs can have the same simple output, depending on exactly how you are defining the way in which we create all TMs of length n. There are two common ways of considering the problem. The first is the very simplistic notion of generating representing each state/transition diagram as a list of bits, and writing them in some order. The other is a prefix-free encoding variation of the latter, where no TM can have another’s sequence of bits starting the coding.
  
  The simplistic viewpoint actually makes more sense foe Occam’s razor, since it essentially allows small Turing machines to be embedded in larger ones. What happens in this case is that the first k states for k<n might be a completely isolated state transition diagram, with all other data being completely unused, changing any of those bits results in the same result, which gives many more possibilities for the same result. So that sub-TM would be the theory chosen by Occam's razor, since the smaller the TM there is, the more possible setting of the random bits there are.
  
  Reply
  - bobthebayesian says:
    
    November 15, 2011 at 12:17 am
    
    This brings up an interesting question. Is it possible to define a useful, objective concept that measures the “relevance” of an algorithm? For example, suppose we wanted to write a program that takes an input integer N and then computes and outputs the sum of the integers from 1 to N, in Python, say:
    
    #—-Program (1)—-
    def sum1(N):
    total = 0
    for i in range(N):
    total = total + (i+1)
    return total
    
    #—-Program (2)—-
    def sum2(N):
    return N*(N+1)/2
    
    #—-Program (3)—-
    def sum3(N):
    time_waster = N*N*N
    for k in range(time_waster):
    print “Waiting to start”
    return sum2(N) #Or just explicitly copy the code of sum2 here
    
    Clearly, the three of these have different complexities. But I would argue that we view Program (1) and Program (2) as spending all of their computation “on task” and the arbitrary for-loop in Program (3) is wasting time. Its slowdown is artificial.
    
    Now suppose we had a more difficult algorithm than just summing, and we did not really know if certain portions of the algorithm could safely be omitted or synopsized in a closed-form formula. Would there still be any objective way to look at the algorithm and decide if the steps it is taking are “directly related” to the output that is desired? Would some of its subroutines be “more directly related” to the desired output than other subroutines?
    
    I’m not sure that complexity is the answer here. For example, imagine re-writing sum2() in a way that explicitly assigns M = (N+1)/2 and then returns N*M. For large N, doing it with an intermediate storage shouldn’t matter, but to a human, doing that intermediate storage is slightly “off task” — why not just return it all at once like sum2() does? Compilers might make these differences practically meaningless, but in terms of which one linguistically expresses a simpler algorithm, this idea of “computational relevance” might make a difference. And it surely must make a difference when comparing large, unwieldy, and inscrutable machine learning algorithms.
    
    I’m worried my examples above are just superficial though. I think there’s something here, so any help on thinking of a less trivial example is welcome.
  - bobthebayesian says:
    
    November 15, 2011 at 12:18 am
    
    (WordPress didn’t like my indentations, but I’ll assume it’s easy enough to see what the correct Python code is doing).
Hagi says:

November 10, 2011 at 4:47 pm

I do agree with most people on this thread that we have not seen a conclusive, non-circular justification for Occam’s razor, yet. It certainly is not obvious to me that such a justification would exist.

For this reason a practical perspective on Occam’s Razor is more attractive to me. I liked the interpretation of Occam’s Razor we learnt in class that uses the VC dimensionality in the PAC learning framework. If the VC dimension of a hypothesis class is very large or infinite, then the predictive power of this class will not be very good . This was also the best description of over-fitting I have come across.

However, I find not having a non-circular justification for Occam’s Razor worrying; since it is the only argument I have for Peli Grietzer’s question at the very top of this thread. Assuming Occam’s Razor, it makes sense to me that given our experience with the universe so far, it will most probably keep being regular enough for induction. However if we do not accept Occam’s Razor as an ontological principle, then I do not know how to justify to myself that induction will not stop working soon. This is certainly a reason to suspect that maybe there does exist a non-circular, fundamental justification for Occam’s Razor.

Reply
- bobthebayesian says:
  
  November 10, 2011 at 5:39 pm
  
  What would it mean for induction to stop working? As in, how could we possibly detect that that was the case? I think this is a very interesting line of questioning, but I cannot come up with a good idea for what that could correspond to. I think it’s due to a lack of imagination on my part… but one thing I hope to examine in the final paper is the falsifiability of Occam’s razor. Is it falsifiable? This might further motivate use of Kolmogorov complexity.
  
  Reply
Andy Drucker says:

November 10, 2011 at 10:22 pm

Thanks everyone for the probing comments. But my head is spinning a bit. Can we try to apply these concepts to a simple, controlled situation and see what the different accounts of inductive reasoning give us?

Let’s imagine there are two websites, SimpleBits.com and RandomBits.com. These websites are maintained by an unknown, untraceable person, who covers his/her/its tracks extremely well. Every minute for the past year, each website has been publishing a new bit to a growing list. So far, the list on SimpleBits looks like 11111111111111111111111111…. , and that on RandomBits looks like 50-50 random bits to all tests performed so far (by the computationally-bounded public).

So, pick your favorite induction scheme, and we can ask:

(a) what is your prediction for the next bit in each case? Why?

(b) what if anything do you *know* about each source?

(c) what, if anything, is the probability of each outputting a 0 next? or, a string of one hundred 0’s next?

(d) Are the previous two questions clear? Can we reach consensus on the answers, or is everything dependent on controversial background assumptions?

(e) If we can’t reach consensus on whether we can achieve *knowledge* of the future of these two sites, or even on their future probability distributions, perhaps we can still give accounts of how some kind of “practical success” or “mastery” is still possible on these two sites. This might connect up to the concept of a “reliable source” brought in by Edward.

“Practical success” here might involve shifting our attention from the very next bit, to some kind of longer-term success over the entire sequence. And of course, we’ll probably have to assume *something* in our accounts. I have some thoughts about this, but will hold back and see if others want to chime in.

Reply
Miguel says:

November 11, 2011 at 2:38 am

A quick comment on the issue of an ontological justification of Occam’s Razor: I agree with wjarjoui’s and Nemion’s comments above in that a variational account seems to be an appealing argument for simplicity in nature. From statistical mechanics we know how conserved physical systems made of very large numbers of locally interacting components can exhibit simple macroscropic behaviors: as the number of components increases, any given empirical distribution based on observed data will converge exponentially fast to the ‘true’ Boltzmann distribution of the system. Furthermore, for low temperatures, again as the number of components increases, such a system will be overwhelmingly likely to persist in a small number of (ground) states, and thus will be very simple to describe. So given some data collected from a small number of components of the system, and given two hypotheses that fit it (one some complicated, overfitted account; the other one a simple histogram of the frequency of the n most likely states, for small n), the simpler hypothesis is provably more likely to be correct (particularly so if the system’s temperature happens to be in low).

Admittedly such classical thermodynamical systems are very restricted cases with a *lot* of structure (‘true’ distribution governing micro behavior, conserved quantities, local interaction, etc), but they are very common in nature, and the compression achieved by going from a component-by-component description of, say, a gas to a three-parameter description is quite remarkable.. We have known these facts for a very long time for ideal gases, spin lattices and the like, but perhaps they point at a non-anthropic account for the ubiquity of low K explanations in nature, esp since the weak convergence principles on which they rely are applicable to broader computational settings.

Reply
Harry James Potter (@hjpev) says:

November 11, 2011 at 7:24 pm

It occurs to me that because any finite system can be described just by writing out all its components in full, arbitrarily complex theories are already ruled out a priori. I wonder if this has any relevance to the Occam’s Razor question.

Reply
sbanerjee says:

December 1, 2011 at 8:24 pm

In attempting to find a good explanation for why Occam’s Razor is so appealing, I would like to refer to the over-fitting PAC learning argument that others have made. The argument essentially is that when tackling a learning task with a PAC model, simpler models are better just because they have a lower generalization error. In the space of all possible models that fit a specific data set, there are a larger number of complex models purely because the allowance of greater complexity lets us tweak more and more variables. This makes it so that it is more likely that a data set is fit by a complex model and this causes more generalization error since the complexity of the model hints at higher level of overfitting.

Now consider the counter-argument found in Section 3 from the paper ‘The Role of Occam’s Razor in Knowledge Discovery.’1 It suggests that Occam’s Razor and simplicity might not be the key to lowering generalization error. In the paper, Domingos breaks off Occam’s Razor into two version – one that just calls for simplicity and one that argues with generalization error:

“Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.”

He goes on to talk about specific arguments against the correctness of this second version of the razor and one argument that specifically stood out to me was the one he presented in Section 3.3:

“According to conventional wisdom, overfitting is caused by overly complex models, and Occam’s razor combats it by introducing a preference for simpler ones. How- ever, drawing on the study of multiple comparison problems in statistics (Miller, 1981), Jensen and Cohen (1999) have shown that overfitting in fact arises not because of complexity per se, but because attempting a large number of models leads to a high probability of finding a model that fits the training data well purely by chance. Attempting 10 complex models incurs a smaller risk of overfitting than attempting 100 simple ones. Overfitting is thus best combated not by the second razor, but by taking this multiple testing phenomenon into account when scoring candidate models.”

This counter-argument does seem to break the argument that simplicity is better because it stops overfitting, but I would argue that it still leaves room for the argument that simplicity discourages overfitting. Occam’s Razor does not have any specific magic that helps make simplicity the winner all the time, but it does aid the argument that if you are working with simplicity then you are likely to have less generalization error.

Reply
- sbanerjee says:
  
  December 1, 2011 at 8:25 pm
  
  You can find a copy of the referenced paper at:
  
  Click to access dmkd99.pdf
  
  Reply
Pingback: A good thread about some philosophical questions on… | Xinxi's Blog

	[Draft] Why –… on Class #7: Kolmogorov Complexit…
	6.S899 Student Proje… on Student Projects
	Shtetl-Optimized… on Student Projects
	What books to read t… on Class #1: Introduction and the…
	A good thread about… on Class #8: Occam’s Razor,…