Probability is a fascinating yet often perplexing field because it deals with uncertainty and randomness, concepts that can be challenging to intuitively grasp. Many probability problems appear simple on the surface but reveal surprising and counterintuitive results upon closer inspection. For example, the Monty Hall problem, where switching doors in a game show leads to better odds of winning, defies most people’s gut instincts. This highlights the importance of rigorous mathematical analysis in probability. Without careful calculation and logical reasoning, our intuitive judgments can lead us astray, underscoring the need for a systematic approach to understanding and applying probability in real-world situations. Today, we will tackle such a problem.
Problem Statement
Quoting from Quanta magazine, [1]
” Imagine, that you have an urn filled with 100 balls, some red and some green. You can’t see inside; all you know is that someone determined the number of red balls by picking a number between zero and 100 from a hat. You reach into the urn and pull out a ball. It’s red. If you now pull out a second ball, is it more likely to be red or green (or are the two colors equally likely)? “
Let us define the number of balls in the urn as \(N\). Originally, there are \(N+1\) different possible configurations for the urns with the number of red balls from \(0\) to \(N\). Let is label the urns as \(u_i\), with \(i\in \{0,1,2,\cdots,N\}\). Since the number of red balls is pulled from a uniform distribution, the probability of getting any \(u_i\) is simply \(\frac{1}{N+1}\).
First, we will shut-up and apply the rigorous machinery of probability theory. Second, we will do a simulation to double check the results.
Shut up and calculate
Given the first ball is red, the probability that it came from uln \(u_i\) (i.e., an uln with \(i\) red balls) is
\[\begin{eqnarray}
P(u_i|x_1=R)=\frac{P(x_1=R|u_i) P(u_i)}{P(x_1=R)},
\tag{1}
\end{eqnarray}\]
which is the Bayesian formula. We can compute the denominator as
\[\begin{eqnarray}
P(x_1=R)=\sum_{i=0}^N P(x_1=R|u_i)P(u_i)=\sum_{i=0}^N \frac{i}{N}\frac{1}{N+1}=\frac{1}{2}.
\tag{2}
\end{eqnarray}\]
And the numerator as
\[\begin{eqnarray}
P(x_1=R|u_i) P(u_i)= \frac{i}{N}\frac{1}{N+1}=\frac{i}{N(N+1)}.
\tag{3}
\end{eqnarray}\]
Putting Eqs. (2) and (3) back into Eq. (1) gives
\[\begin{eqnarray}
P(u_i|x_1=R)=\frac{2 i}{N(N+1)}.
\tag{4}
\end{eqnarray}\]
Note that this is very important. Originally, the probability of getting an urn with \(i\) red balls was flat at \(\frac{1}{N+1}\). However, the fact that first ball is red implies that it is more likely that it came from an urn with more red balls, see Fig. 1. Now that we have a probability distribution for having \(u_i\), we can compute the probability of getting a red ball in the second draw:
\[\begin{eqnarray}
P(x_2=R)&=&\sum_{i=0}^N P(x_2=R|u_i)P(u_i|x_1=R)= \sum_{i=0}^N \frac{i-1}{N-1} \frac{2 i}{N(N+1)}=\frac{2 }{(N-1)N(N+1)}\sum_{i=0}^N \left( i^2-i\right) \nonumber\\
&=&\frac{2 }{(N-1)N(N+1)}\left[ \frac{N(N+1)(2N+1)}{6}-\frac{N(N+1)}{2}\right]
=\frac{2N+1}{3(N-1)}-\frac{1}{N-1}\nonumber\\
&=&\frac{2}{3}
\tag{5}
\end{eqnarray}\]
Therefore, getting a red ball is twice as likely as getting a green ball, independent of how many balls there are in the urn.
A tweak
The statement “All you know is that someone determined the number of red balls by picking a number between zero and 100 from a hat” is so critical. In order to show why that is, let’s revise the preparation of the urn as follows:
“The urn is prepared by adding balls to the urn one by one by flipping a coin. If the coin is heads, you add a red ball, a green one otherwise.” This will completely transform the question. The number of red balls in the urn will have the binomial probability density:
\[\begin{eqnarray}
f_b(i, N)={N \choose i}p^i(1-p)^{N-i}
\tag{6}.
\end{eqnarray}\]
Now, we just need to redo the math. Given the first ball is red, the probability that it came from urn \(u_i\) (i.e., an urn with \(i\) red balls) is
\[\begin{eqnarray}
P(u_i|x_1=R)=\frac{P(x_1=R|u_i) P(u_i)}{P(x_1=R)},
\tag{7}
\end{eqnarray}\]
which is still the Bayesian formula. We can compute the denominator as
\[\begin{eqnarray}
P(x_1=R)=\sum_{i=0}^N P(x_1=R|u_i)P(u_i)=\sum_{i=0}^N f_b(i, N)\frac{i}{N}=\frac{1}{N}\langle i \rangle=p
\tag{8}.
\end{eqnarray}\]
For a fair coin used in the preparation \(p=1/2\).
The numerator reads
\[\begin{eqnarray}
P(x_1=R|u_i) P(u_i)=\frac{i}{N} f_b(i, N).
\tag{9}
\end{eqnarray}\]
Putting Eqs. (8) and (9) back into Eq. (7) gives
\[\begin{eqnarray}
P(u_i|x_1=R)=\frac{1}{N p} i f_b(i, N).
\tag{10}
\end{eqnarray}\]
Now that we have a probability distribution for having \(u_i\), we can compute the probability of getting a red ball in the second draw:
\[\begin{eqnarray}
P(x_2=R)&=&\sum_{i=0}^N P(x_2=R|u_i)P(u_i|x_1=R)= \frac{1}{pN(N-1)} \sum_{i=0}^N (i-1)i f_b(i, N)
=\frac{1}{pN(N-1)}\left( \langle i^2 \rangle -\langle i\rangle \right)\nonumber\\
&=&\frac{1}{pN(N-1)}\left( \langle i \rangle^2 +\sigma^2 - N p\right)
=\frac{1}{pN(N-1)}\left(N^2 p^2+N p(1-p)- N p\right)\nonumber\\
&=&\frac{1}{pN(N-1)}N(N-1)p^2\nonumber\\
&=&p,
\tag{11}
\end{eqnarray}\]
which is the probability of the coin that was used to build the urn! Therefore, getting a red ball in the second draw is as probable as getting a green one provided that the coin was unbiased.
Simulation
Here is a simulation code written in R. If you are interested in playing with the simulation, you can copy it below.
# A code to simulate the problem https://www.quantamagazine.org/perplexing-the-web-one-probability-puzzle-at-a-time-20240829/
# Find the math at https://tetraquark.netlify.app/post/redballgreenball/redballgreenball/index.html?src=githubrepo
library(plotly)
colorize <- function(x) { if (x<0){"R"}else{"G"}} # will use this to map +1,-1 to colors
probBalls=0.5# Set this as a global variable.
Nballs=100;Nsim=40000; #set the number of balls in an urn, and the number of simulations
t2 <- list(size = 20)
wbinomProb<- function(x) { x/Nballs* dbinom(x, size=Nballs, prob=probBalls) } # will use this to map +1,-1 to colors
binomProb<- function(x) { dbinom(x, size=Nballs, prob=probBalls) } # will use this to map +1,-1 to colors
firstBall=c();secondBall=c();gBalls=c();rBalls=c();BallsOneTwo=c() # initialize arrays to store various values
simulator <- function(drawMode) {
for(s in c(1:Nsim)){
if(drawMode=="uniform"){
randomRedCount=sample.int(Nballs, 1) # this is how the urn is set up: number of red balls is pulled from a unifor distribution
balls=c(rep(-1,randomRedCount ),rep(1,Nballs-randomRedCount )) # repeat -1 randomRedCount times to get reds, and +1 for greens
}
if(drawMode=="binomial"){
balls=2*rbinom(Nballs, 1, probBalls)-1
}
ballsC=sapply(balls,colorize) # map numbers to colors
NumOfGreens=round(Nballs/2+sum(balls)/2) #compute the number of Green balls
gBalls=c(gBalls,NumOfGreens)# log the value for the green balls in this urn
rBalls=c(rBalls,Nballs-NumOfGreens)# log the value for the red balls in this urn
randomIndex=sample.int(length(balls), 1) # draw a random index, this will select a ball from the urn
thisfirstBall=ballsC[randomIndex] # color of the first ball
firstBall=c(firstBall,thisfirstBall) # log the first ball color
ballsC=ballsC[-randomIndex]#remove this ball from the urn.
randomIndex=sample.int(length(ballsC), 1) # draw another random index for the second ball
thissecondBall=ballsC[randomIndex] # color of the second ball
secondBall=c(secondBall,thissecondBall)# log the second ball
BallsOneTwo=c(BallsOneTwo,paste0(thisfirstBall,"_",thissecondBall)) # Create a pair for later use
}
dtBall=data.frame("gBallsC"=gBalls,"rBallsC"=rBalls,"firstBall"=firstBall,"secondBall"=secondBall,"BallsOneTwo"=BallsOneTwo) # build a data frame
dtBallS=dtBall[dtBall$firstBall=="R",] # we are told that the first ball is red, so, just keep these outcomes.
redRatio=nrow(dtBallS[dtBallS$secondBall=="R",])/nrow(dtBallS) # simply compute the ratio of counts of second ball color= R to the total count (with B1=R)
xv=c(0:Nballs);
if(drawMode=="uniform"){
titleText="Uniform Preparation"
yv=2*xv/(Nballs*(Nballs+1))
rAVGtheory=2/3
yv0=rep(1/(1+Nballs),length(xv ))
} # theoretical formula; https://tetraquark.netlify.app/post/redballgreenball/redballgreenball/index.html#eq:bayesf?src=githubrepo2
if(drawMode=="binomial"){
titleText="Binomial Preparation"
yv=sapply(xv ,wbinomProb)
yv0=sapply(xv ,binomProb);yv0=yv0/sum(yv0)
yv=yv/sum(yv)
rAVGtheory=probBalls
} # theoretical formula; https://tetraquark.netlify.app/post/redballgreenball/redballgreenball/index.html#eq:bayesf?src=githubrepo2
m <- list(l = 20,r = 10,b = 20, t = 20,pad = 8)
figOut <- plot_ly(alpha = 0.5) %>%
add_histogram(x = dtBallS$rBalls, name="Simulation", histnorm = "probability", nbinsx =Nballs)%>%
add_trace(x = xv,y=yv, name="Revised", type='scatter', mode="markers+lines")%>%
layout(hovermode = 'x',xaxis=list(hoverformat = '.0f',title="TeX($$\\Large{\\text{Number of Red Balls}}$$)"),
yaxis=list(hoverformat = '.2r', title="TeX($$\\Large{\\text{prob. density}}$$)")) %>%
layout(title=list(y=0.95,x=0.2,text=paste0( titleText),font = t2))%>%
layout(legend = list(x = 0.1, y = 0.85,orientation = 'h',font = t2),margin=m) #,plot_bgcolor = "rgba(0, 0, 0, 0)",paper_bgcolor = "rgba(0, 0, 0, 0)"
#figOut%>%config(mathjax = "cdn"). # enable this to render MathJax
;
figOut<-figOut%>% add_trace(x = xv,y=yv0, name="Original", type='scatter', mode="markers+lines")
output<-list(figOut,redRatio, rAVGtheory)
return(output)
}
if(FALSE){
drawMode="uniform";
#drawMode="binomial";
returns=simulator(drawMode)
redRatioFormatted=formatC(signif(returns[[2]], digits=3), digits=3, format="fg", flag="#")
redRatioThFormatted=formatC(signif(returns[[3]], digits=3), digits=3, format="fg", flag="#")
print(paste0("Nballs:",Nballs,";",drawMode, "-->probability of second ball being red is:",redRatioFormatted,". Predicted value=",redRatioThFormatted, "."))
returns[[1]]
}
The code simulates the original problem with 100 balls and repeats it 40000 times.
We can now simply count the cases of second ball being red, and the simulation result is \(0.663\pm 0.005\) with \(95\%\) confidence level which is very close to the \(0.667\) value we predicted in Eq. (5).
Below is the tweaked problem where the balls are decided with coin flip.We can now simply count the cases of second ball being red, and the simulation result is \(0.499\pm 0.005\) with \(95\%\) confidence level which is very close to the \(0.5\) value we predicted in Eq. (11). The color of the first ball isn’t really a useful information, it barely moves the distribution. In fact, the shift towards the higher values of red ball count exactly cancels the fact that we threw away a red ball in the first draw. It is a perfect cancellation!