Informática geek, matemáticas, pensamiento crítico y alguna otra cosa de vez en cuando.

2010-01-28

Solution to the probability of lengths problem

In the previous IMAP upload post, I asked about the probability of each possible length of the resulting string in the random words generator. The loop had this structure:

  for ($i = 0; $i < 7 + mt_rand(0, 4); $i++) /* add one character */ ;

The key is that the loop exit condition is checked against (probably) different values each time. It's different from a similar loop in which the random number is fixed beforehand, because here the intersection of events comes into play, Let's see why and how.

As soon as $i reaches 7, if the value of mt_rand is 0 then the loop finishes; if it's any other value, it goes on. It's obvious that the probability of the length being 7 is 1/5 = 20%.

When the value of $i is 8, it's necessary for mt_rand to return either 0 or 1 for the loop to end, which will happen one two out of five times. However, for the length to reach the value 8 it's a condition that first mt_rand returns a value between 1 and 4, which will happen 4 out of 5 times. They are not independent events. Therefore, the combined probability is 4/5 · 2/5 = 8/25 = 32%.

For $i to reach 9, it must happen simultaneously that mt_rand returns a value between 1 and 4 the first time, and a value between 2 and 4 the second time. Furthermore, the loop will end when mt_rand is between 0 and 2, which wil happen 3 out of 5 times. Combining the probabilities, we have P(length 9) = 4/5 · 3/5 · 3/5 = 36/125 = 28.8%.

By the same reasoning, the probability for it to reach 10 is 4/5 · 3/5 · 2/5 · 4/5 = 96/625 = 15.36%.

The remaining one can be calculated either following the same reasoning, now that we got the trick, or by subtracting all the previous probabilities from 1: 4/5 · 3/5 · 2/5 · 1/5 · 5/5 = 24/625 = 3.84% = 100% - 20% - 32% - 28.8% - 15.36%.

To sum up:

  • P(length 7) = 20%
  • P(length 8) = 32%
  • P(length 9) = 28.8%
  • P(length 10) = 15.36%
  • P(length 11) = 3.84%

That is, it will be 8 almost one out of three times, closely followed by 9, then 7, then 10, and a few times it will reach 11. As I have always said, probability is not intuitive.

This program helps verifying the correctness of the numbers:

<?php

  $MAX = 10000000;

  $arr = array(0, 0, 0, 0, 0);

  for ($n = 0; $n < $MAX; $n++)
  {
     for ($i = 0; $i < mt_rand(0, 4); $i++)
       ;
     $arr[$i]++;
  }

  $l = strlen($MAX);
  for ($i = 0; $i < 5; $i++)
  {
    printf("%2d -> %{$l}d/$MAX = %f%%\n", $i+7, $arr[$i], $arr[$i]/$MAX*100);
  }

?>

Sample output:

 7 ->  2000626/10000000 = 20.006260%
 8 ->  3198181/10000000 = 31.981810%
 9 ->  2879481/10000000 = 28.794810%
10 ->  1538148/10000000 = 15.381480%
11 ->   383564/10000000 = 3.835640%

No comments: