The Unexpected Lesson Within A Jelly Bean Jar (2024)

The Unexpected Lesson Within A Jelly Bean Jar (1)

How Jelly Beans helped me understand a key Artificial Intelligence principle

On a livestock fair in late Victorian Plymouth, England, a statistician with the name of Francis Galton asked around 800 attendees to guess the weight of an ox that was on display. He then calculated the median of all estimates, which ended up being 1207 lbs¹. To his surprise, the measured weight of said ox was 1198 lbs, which put the median estimation at ~0.01% off from the real weight. As Galton himself noted¹:

…the middlemost estimate expresses the vox populi, every other estimate being condemned as too low or too high by a majority of the voters

This effectively means that as a group, or as a collection of independent thinkers, we are very, very good estimators.

As I love Data and Science, I wanted to replicate this experiment myself, so not so long ago I did so at my office in my own way. I conducted the Jelly Bean Jar Game, you might have heard of it before.

I bought a jar and filled it with exactly 490 beans (yes, I counted them all). Then, like Sir Francis Galton did, I asked 30 of my co-workers to give an estimate of the amount of Jelly Beans in the jar. To my surprise, the distribution of estimates looked like this:

The Unexpected Lesson Within A Jelly Bean Jar (4)

With a mean estimate of 487, only three Jelly Beans off from the ground truth! With this simple experiment, I was getting more and more convinced that the vox populi or Wisdom of the Crowds¹ ² is a real thing.

As a group, we are very good estimators, individually, not so much.

NOTE: Patient individuals outperformed those that made wild guesses. In my experiment, some individuals measured the volume of the jar and estimated the volume of each jellybean to then extrapolate this to the amount of jellybeans within the jar. Other simply went and said “Hmm I don’t know… 1000” (see the figure). Nonetheless all estimations were centered around one value, being the ground truth. Keep this in mind.

In the rest of this essay I will compare this vox populi principle with one that has kept my interest for a long time. It might sound crazy, but I think Artificial Neural Networks³ share a common ground with it. Especially because in both cases a collection of parts is given one single task and work together to solve it. I hope that you too feel this way by the end of the text.

A good way to start this comparison is probably by providing a definition of what neurons do in Artificial Neural Networks. I found this description to be rather compelling and simple to understand⁴:

Each neuron receives one or more input signals x 1, x 2, …, x m and outputs a value y to neurons of the next layer and so forth. The output y is a nonlinear weighted sum of input signals.

Under this point of view then, neurons in an ANN are the individuals of a collective thinking. In fact, the de facto architecture of ANN’s is a collection of connected individual regressors³. The output of a neuron with n input neurons is defined by⁵ :

The Unexpected Lesson Within A Jelly Bean Jar (5)

Each output h then is a function with parameters W and b of the sum of individual linear regressions from all inputs x, which in turn will be the input (after an activation function, usually non-linear³ ⁶) of the next layer. The neurons collectively and only collectively, solve tasks. Try building an ANN classifier for a complex task with one neuron, you must probably going to fail. This will be like Galton asking one single person to give an estimate of the ox’s weight. The estimation is probably going to be wrong. It is here where ANN’s really work collectively. This concept can be visualized in the next example:

The Unexpected Lesson Within A Jelly Bean Jar (6)

In the image above the trained NN is taking as input 784 features from the image of a “2" and will classify it accordingly. The complexity of the system increases drastically with each added neuron, but in turn increases the amount of possible feature permutations that effectively pushes up the performance of the classifier. Add too many though and you will be a victim of overfitting⁷. I recommend you to visit this Google Playground to understand these and other concepts better where you can see the effect each added (or removed) neuron has on a simple classifier. Try training the model with only the first two features (X¹ and X²) and see the results. Now do it with more. can you find the minimum amount of neurons needed to get good results? Do you need many neurons/layers to do simple tasks? The answer is no. Will get back to this in a moment.

Going back to oxen and jelly beans, this will be like finding the minimum amount of individuals required for a very good estimation. Surely asking 10,000 people about the weight of the ox will reduce the error, but at 800 we are already 99% around the ground truth. Increasing the complexity of an algorithm is useful only when the desired output has not been satisfied. From here, computationally speaking will be best to reduce the amount of estimators to find the minimum required to reach the desired performance. The vox populi reduces the cost of the computation once this balance is found. To understand this, we can look at the next figure I quickly made in Python:

The Unexpected Lesson Within A Jelly Bean Jar (7)

We can create a set of random normal distributions with μ = 1 and σ = 0.1 while increasing the amount of samples from 10 to 1000. Because we know that the mean ground truth is by design equal to 1, we can then compute the average across these distributions and see how close it gets to μ. As you might have guessed, the more data we have the better, meaning that our estimation gets closer and closer to our ground truth. After infinite samples we reach μ, but this is unpractical for obvious reasons. It might even be that that 1000 samples is too costly for whatever reason and we decide to use the set with 500 points for our analysis, which yields an error that satisfy our needs. It is our sweet spot: general enough to maximize performance, but specific enough to minimize error. Artificial Neural Networks follow a similar (albeit not identical, mind you) principle.

Although there are some general rules on how many neurons and layers you should use⁸, choosing these limits is a common problem that I frequently encounter while building Deep Neural Networks. Too many neurons and/or layers for a rather simple problem will probably cause severe overfitting (asking 10,000 individuals about the ox’s weight or using 1000 points or more in our previous example). Too little and you would not be able to generalize your model for blind testing. In a way then (and very generally speaking), ANN’s feel comfortable with a balance of simplicity and complexity.

Going back to Google’s TensorFlow Playground, we can see that the time it takes for a simple ANN to reach low loss values is short in a very simple classification task:

The Unexpected Lesson Within A Jelly Bean Jar (8)

Although trivial, this exemplifies perfectly the point I am trying to convey. Test an Training loss reach ~0.05 in about 350 epochs (see values just above the scatter). Now let see what happens with an overly complex ANN classifying the same data and using the same parameters:

The Unexpected Lesson Within A Jelly Bean Jar (9)

Not even 200 epochs and loss values are still not at the same levels as in the previous example. If we wait long enough though, the network does the job. Following our normal distribution example from previous paragraphs, you could use thousands of points to get a “better”estimation of μ, but the error would not compensate for the high cost. In this example, the same is happening. Not even thinking about other overfitting problems⁹, the latter architecture is too costly for such a task. The first architecture does the job perfectly with very low cost, so choosing it over the other is the wiser decision.

I like to think of AI models, and specifically Deep Neural Networks, as complex systems that should be built as simple as possible. Believe it or not, my Jelly Bean Jar experiment helped understand this principle. Both cases require to partition a certain task (just enough) to collectively solve it. This seems to be the best solution. As Albert Einstein himself noted on a lecture in 1933¹⁰:

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience .

I can’t argue with that. Can you?

Thank you for reading!

References:

[1] Galton, F. Vox populi (1907), Nature, 75(7), 450–451.

[2] Text on the story of Wisdom of the Crowds: https://towardsdatascience.com/on-the-wisdom-of-crowds-collective-predictive-analytics-302b7ca1c513

[3] ANN resources: https://towardsdatascience.com/nns-aynk-c34efe37f15a

[4] Koutsoukas, A., Monaghan, K. J., Li, X., & Huan, J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data (2017), Journal of cheminformatics, 9(1), 42.

[5] Great text on the foundations of Multilayer Neural Networks: http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/.

[6] Activation functions: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6.

[7] A word on overfitting: https://www.jeremyjordan.me/deep-neural-networks-preventing-overfitting/

[8] https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

[9] https://towardsdatascience.com/preventing-deep-neural-network-from-overfitting-953458db800a

[10] Robinson, A. Did Einstein really say that? (2018) Nature, 557(7703), 30–31.