How do I determine the probability that one running mean is higher/lower than another running mean?

Let's say I have two bags, each with an infinite number of marbles.

In the "red" bag, exactly 2/3 are red, the rest are blue.
In the "blue" bag, exactly 1/3 are red, the rest are blue.

Let's say I draw 1 marble from each bag and keep a running average the number of red marbles in each.

Suppose after the first draw, I have:

Red bag: 0% red, 1 draw (got a blue)
Blue bag: 100% red, 1 draw (got a red)

After 1 marble, blue (100%) > red (0%), which is obviously wrong since we know the real distribution. However, if we didn't know the actual distribution, we couldn't say for sure we were wrong --- the probability that blue > red is not 0%.
What I am trying to do is to figure out when I can stop drawing marbles. That is, I want to keep drawing marbles until I am 99.9% certain that mean_red > mean_blue. (Or, if I get really unlucky, 99.9% certain that mean_red < mean_blue!)
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Thibault St john Cholmondeley-ffeatherstonehaugh the 2ndCommented:
From the red bag you should expect two out of three draws to be red. One third of the bag is blue, but the bag contains an infinite number of marbles. One third of infinity is still infinite so it is possible that you will never draw a red marble from the red bag.
To decide that you want a certain level of confidence, 99.9, I think you have to state a finite population.
Define a test as taking one ball from the Red bag and one ball from the Blue bag.

There are four possible results from the test.  
Start by assuming we know the distributions in each bag.
  P( R, B)  = 4/9   ==> Correct
  P( R, R)  = 2/9   ==> No Information
  P( B, B)  = 2/9   ==> No Information
  P( R, B)  = 1/9   ==> Incorrect

Note that the difference in P(R) for the these two bags is  dP = 1/3.

lf all you need to know is which bag has more Red balls, then I think you would have the correct answer after N = (1/dP) ²  = 9 tests.

Look at another case, where the first bag has P(R) = 0.9  and the second bag has P(R) = 0.8.
There are four possible results and probabilities are:
  P( R, B)  = 0.18   ==> Correct
  P( R, R)  = 0.72   ==> No Information
  P( B, B)  = 0.02   ==> No Information
  P( R, B)  = 0.08   ==> Incorrect

I think 100 trials would give you the correct answer with significant confidence.

If you don't know the difference in P(R) for the two bags, you can't say how many tests you have to run to determine which bag has the larger value.

But you can say something like:
If I run N tests, I will be able to detect a difference in P(R) of 1/sqrt(N) with TBD confidence level.

This problem would be an excellent candidate for Monte Carlo methods.
Here is a Monte Carlo analysis of the problem in Excel.

You can set the probability of drawing a Red ball from each bag.  Bag 1 should be higher that Bag 2.
The spread sheet is set to run 1000 Trials of 100 Tests each.

If you read across a row,  cell entries keep track of the cumulative results for each test.

A cell entry of 14.07 in the Test=50 column would mean
     14 tests have produced the expected/correct [Red, Blue] result
       7 tests have produced the incorrect [Blue, Red] result.
     29 tests have produced inconclusive [Red, Red] or [Blue, Blue] results.

You can rerun the 1000 trials by hitting Alt-Ctrl-F9 which does a RECALC.

For mid range values with a difference of 0.1:
     0.55 vs 0.45  ==>  100 Tests give the correct result approx 90% of the time.

For extreme values with a difference of 0.1:
     0.95 vs 0.85  ==>  100 Tests give the correct result approx 99% of the time.
     0.15 vs 0.05  ==>  100 Tests give the correct result approx 99% of the time.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

Is this question resolved?

You can not tell in advance how many draws it will take to achieve 99.9 confidence that one bag has more or less red marbles.

But if you pay attention while you are drawing, you will decide when you have enough information.

I have rearranged my Excel sheet to calculate the confidence level after 1, 2, 3, ... 50 draws.

With your original probabilities of 2/3 and 1/3, you need 9 or 10 draws to determine which bag has more red balls with 90% confidence.

You would need 23 to 25 draws to determine the answer with 99% confidence.  
But in the 1% of cases where you don't have the correct answer after 25 draws, you don't necessarily have the wrong answer.

Find a column headed by 0.990 and look down at the elements.  Many of the 0's (which indicate a failure to get the answer right) will be the result of ties (6.06 or 3.03).
cwm9Author Commented:
I found someone with a background in statistics to answer the question for me.

The correct solution is to use the 'Two-Sample t-test for Equal Means'.  (In my actual use case, I really want the Unknown Variances version, 'Welch’s t-test'.)

cwm9Author Commented:
I've requested that this question be closed as follows:

Accepted answer: 0 points for cwm9's comment #a40950972

for the following reason:

Found an IRL expert to answer the question.  Posted her answer here.
The t-test can be used for testing simple hypotheses on existing data.

But you original question concerned how much data you had to collect for a particular, very high confidence level:
         I want to keep drawing marbles until I am 99.9% certain that mean_red > mean_blue.

I don't see how the t-test answers that question.

The Monte Carlo technique I described gives a specific answer in the case where you think you know the probabilities for each bag
     9 or 10 draws for 90% confidence
     23 to 25 draws for 99% confidence
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Math / Science

From novice to tech pro — start learning today.