Cutting Away The Misconceptions About Occam’s Razor

Introduction

\(\newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}}\) Life is full of uncertainties, yet it constantly requires many decisions and explanations. The Parsimony Principle is a heuristic in philosophy to approach life’s uncertainties . This principle is commonly phrased as “Do not multiply entities beyond necessity,” known as Ockham’s Razor. Although this exact way of phrasing is not present in the texts of William of Ockham, it is closely associated with him because of his philosophical views. In the following, we will clarify Ockham’s Razor and refer to it as the razor in our interpretation. Then, we will show some instances of the razor in machine learning (ML), a field that grows fast and needs mathematical formulations for its complex models.

Ockham’s razor does not state that “from two explanations, the simpler one is more likely to be true.” The razor is not a tool for deciding what is true. Nevertheless, some might adopt the razor from philosophy in that way. For example, states, “A simple theory is always preferable to a complex theory if the complex theory does not offer a better explanation” and tries to justify this with probabilistic arguments. The problem with their approach is that, as their author admits, their razor is so correct that, ironically, calling it a razor seems useless. It is the nature of how things work, and it is obvious. Such a way of adopting this principle does not seem very helpful and conflicts with the principle itself. Furthermore, the principle of parsimony is meant to be a heuristic, not a decision tool . Therefore, in the following, we will define the razor as an abstract tool we use in science in its different forms.

Definition

Let us define the simplicity of a hypothesis as the cost of testing it. Cost can be the computational costs to test the hypothesis or the resources spent on training people that test it. We can prove the hypotheses we can test; therefore, our hypothesis is simple enough to prove concerning our resources. So we have no choice but to adopt simplicity when developing hypotheses. We do not like simplicity because it is inherently good or faithful; we like it because it is affordable. In this spirit, we will define the razor as the tool to simplify our test of the hypothesis in each case. In science, the razor is sometimes the experiment setup we created, making it simpler to see if our hypothesis is correct. This way, even the most complicated truths can easily be understood with time. This approach also makes more sense with the razor analogy, as it helps us cut away the parts that might make it harder for the collective eyes of science to see the truth.

In our current definition, one does not necessarily have to make a hypothesis easy to explain or interpret; it might be the short runtime of testing software that makes it easier to test the hypothesis; this is why one could try to make the definition of simplicity more specific. Ideally, one would define simplicity as the relative easiness of explaining a concept to a layperson, but we do not think this kind of simplicity is independent of efficiency; in other words, we could be finding something easier to understand because we were introduced to similar concepts because they were so profitable from a resource costs point of view.

Parsimony Principle In Action

Although the definition provided for the razor was very abstract, we can see its instances in some mathematical formulations. For example, consider robust ML, a field concerned with providing guarantees for the reliability of models; this can be in the form of proving if a model will change its prediction with a minimal change in the input.

Let \(f:\R^m\rightarrow \R\) be a fully connected neural network model for binary classification with \(ReLU\) activation except last layer. Let \(f(x)\geq 0\) mean class A indicated with \(1\) and class B indicated with \(-1\) if negative. For all input \(\mathbf{x}\in \R^m\) and for all perturbations \(\mathbf{\tilde x}\) such that \(\|\mathbf{\tilde x}-\mathbf{x}\|_{\infty} \leq \epsilon\) where \(\epsilon\) is the allowed distance from the given input, we try to prove \(\operatorname{sign} f(\mathbf{x}) = \operatorname{sign} f( \mathbf{\tilde x}) = c\) where \(c\) is the class for input \(\mathbf{x}\) from our dataset. To guarantee this we solve

\[\min_{\mathbf{\tilde x}} \operatorname{sign}{(f(\mathbf{x}))}f(\mathbf{\tilde x})\]

subject to

\[\begin{array}{lc} \|\tilde{\mathbf{x}}-\mathbf{x}\|_\infty \leq \epsilon .& \\ \end{array}\]

Note that \(\operatorname{sign}{(f(\mathbf{x}))}\) is a constant in this minimization. After finding an optimal \(\mathbf{\tilde x}\), we check if the prediction has changed as this will give us the worst case margin. If \(f(\mathbf{x})f(\mathbf{\tilde x}) >0\) we can say the model is robust for that input. If \(f(\mathbf{x})f(\mathbf{\tilde x}) <0\), there is an adversarial example for that input.

This is called \emph{exact certification}, and it turns out to be an NP-complete problem . However, in practice, the situation is not that bad. We can currently verify relatively small datasets and models, formulating a generalized version of this problem as a mixed integer linear program (MILP) .

After introduced this as a MILP problem, provided an algorithm to improve efficiency by reducing the number of binary variables of the optimization problem. They also reported a heuristic sparsification algorithm that sets some small weights of the model to zero to decrease the runtime of the verification. Setting some weights to zero improves the runtime since the solver will have fewer symbols to optimize. However, setting big portions to zero leads to a decrease in accuracy. This heuristic sparsification algorithm might be an example of applying the parsimony principle. For a complicated problem like this, one might need to sacrifice accuracy to give mathematical guarantees for an ML model. Giving such guarantees makes ML models less of a black box, and the overall goal is to minimize these sacrifices with time. For example, proposed ways of encouraging sparsity during training can increase the accuracy and the efficiency of verifying it.

Conclusion

Putting complex phenomena into formal and testable settings aligns with the parsimony principle. Having to make some reductions gives some more of that sentiment of not multiplying entities without necessity. To be more specific, in the last example, we presented some works that try to make ML models less of a black box by putting them into a formal, testable setting. There are also some notably related applications in ML. For example, the differential privacy field is focused on giving probabilistic guarantees to ensure the model will not memorize sensitive information. Another example can be explainable artificial intelligence which provides tools to explain the behavior of ML models. These are all tools to ultimately understand ML models; if we understand them, they will be simpler.