We all love wine right. We also hate it when it gets poisoned. Looks like the wine is ruined again. If you take any math, cs, or statics course you probably have come across some variant of the poison wine problem. It goes like this: some king named Henry XXXIIIV is throwing a massive party for his very important daughter Elizabeth IIVIXYZ (Roman numerals aren't my strong point). He then finds out that one of his 1000 wine bottles is poisoned, but the poison takes a while to kill. Obviously he can't stop the party or throw out all the wine away from just one poisoned bottle. But he also doesn't have much time or people to go have them each taste a bottle of wine. So he wants to figure out the least number of people necessary to taste the bottles and pinpoint which one is poisoned. He doesn't care how many die (they're commoners after all) but he doesn't have a hundred people to spare for these shenanigans.
If you haven't seen this problem before, you should definitely take some time to figure it out. Otherwise I'll just go with a quick answer and explanation. The king only need 10 people to pinpoint which bottle is poisoned! If he had a million bottles he would only need 20 people. If this seems counter intuitive, you could think of it this way: with one person sipping a little bit from 500 bottles either dies or doesn't. Either way you can pinpoint that the wine is in either the 500 bottles he drank or the 500 he didn't so you narrow it down to 500. Then you can take a second person to drink 250 from the 500 the first person drank and 250 he didn't. That way if they both die you know its one of 250 bottles, and there are all the other cases. Thus each person halves the number of possible bottles and so we get a logarithmic relationship. Another more elegant way of thinking about this problem is expressing 1000 in binary and have each nth person drink all the numbers in binary with a 1 in the nth digit.
Anyway, that's kinda cool in my opinion. However, usually kings don't have magical powers of knowing that exactly one bottle is poisoned. A much more interesting question given bottles of wine can you estimate the proportion
of them that are poisoned. Just like the original problem, you want to minimize the number of people sampling the wine.
Before I go deep into the math I want to provide a real-life scenario where this could happen and it is useful to have this information. Let's say that you have a certain set of documents that could have been corrupted. Bob also has the same set of documents 1000 miles away. You might want to confirm with Bob to find out if you have the correct set of uncorrupted documents.
One way to do this is just to send him each of your documents and then he can respond with the ones that are corrupted (the ones that differ). However that ends up being fairly expensive in many cases. Often, you want to be able to figure out exactly which documents are corrupted and one way to do this is to combine say documents at a time with some function and then send the hashed value to Bob who does the same on his end and compares. Thus if he ends up with different hashed values he can be very confident at least one of the documents was corrupt which could help narrow things down better than just sending one document at a time.
I'm going to set the metaphor back to wine because I like it more. Let's say you could just ask about one bottle at a time. One big question/topic in statistics is how many bottles you need to sample to be 95% confident that your answer is within 1% (or any epsilon) of the true answer. Luckily for us this question has been solved for hundreds of years. Let's assume to simplify things that the population is very very large (in general we only care about minimizing people if the inputs are large but we will go back to this assumption in the end). Then we know that by the central limit theorem the distribution becomes normal with known values for variance. We get the following equation
Where m is the sample size we need, Z is the confidence value corresponding to a certain confidence level (in our case constant at 1.96), is the predicted population proportion, and
is the % error we want to be inside of (originally 1%).
So what are two ways we can lower the number of required samples here? Either we can push to be close to 1 or 0 or we can increase the value of epsilon. Well, what if you could combine say n bottles of wine and drink them at a time. Using this model, I got some really cool results.
To figure out exactly how the relationship between drawing one sample and multiple worked, it is important to remember how the probability of one wine bottle being poisoned x is related to the probability that at least 1 of n items is poisoned. We do this elegantly with complements. The probability all of them aren't poisoned is and we just take the complement of that to get
Taking advantage of the fact that f is an increasing function on the interval [0,1] we just need to figure out where the endpoints of the confidence interval map to
Now we know that the new value of is the distance from the midpoint to the endpoint which ends up being
And the new value of average of the two endpoints
Which gives the overall equation for the mapping:
So this is all swell and good but what do these convoluted equations mean. Well basically we have mapped one confidence interval to another, and one of them may require more or less samples. Now we know that as we make n large, the value of p-hat gets small so the number of samples required gets small. Also another great thing is that as we make n large up to the (total size)/2 the number of subsets of that size grow basically exponentially which mean that the central limit theorem holds for relatively large values of n. Now I experimented with various values in a Jupyter notebook below and a link to it is posted here. We get some interesting results
First lets just look at sampling one bottle vs 2 bottles. The x axis is the percentage value of p-hat and y is the necessary number of samples. We assume epsilon is .02
There are a couple interesting things here. First is that sampling two bottles performs better up to around 65%. But why does it get much worse afterwards. This is because a) we not begin to trend close to .5 rather than away from it with squaring and b) epsilon starts to get smaller going towards 0. However, the performance when p is small has two bottles at an advantage. The trend continues as we increase values. Looking only at smaller values of p we get the following graph
Ok so we see that performing these operations is definitely helpful with the number of poisoned bottles is relatively small. But why do we care? It is much more common for corruptions in data to be relatively sparse. If your data is significantly corrupted then it could be more practical to just throw it away than trying to reconstruct it. Similarly with wine, it most of it is poisoned one might as well dump it out. However, when the poison is few and far between we care a lot more about exactly how many bottles there are. Thus we set epsilon to be a lot smaller and required sample size gets large.
So that concludes my initial exploration into poisoned bottles. I have some other ideas about extending this to interesting ideas of hashing and minimalistic encoding as well as creating an algorithm that, given any proportion, can narrow down which proportion it is with the minimum amount of samples by using variable sample size.
Thanks for bearing with me!