Unfortunately, Google’s implementation is quite buggy. Here are a few problems that I ran into while testing:
While testing, Google Analytics shows a “Probability of Outperforming Original”. This number can is often incorrect, and does not match with what is actually used for the homepage A/B split ratio. It seems that the numbers in the Analytics UI is calculated based on data that is only partly up-to date. In reality the probability should is at least 99%, but the UI shows 84.1%. The number is updated about twice a day, and it jumps around wildly even though there is not much change in the relative ratio between A and B test.
The above graph is sometimes not up to date. It seems that the graph data can be delayed for 1-2 days. Sometimes it shows the current days as no data, even though it’s already recorded in the “Adsense” view.
By the way, when optimizing for Adsense Revenue, it seems that the “AdSense” view is much more accurate than the “Conversions” view. What really counts is the “AdSense eCPM” value (which by the way should be renamed to RPM).
Incorrect/Inaccurate Results with Multiple Variants
So even though the screenshot says a 84.1% probability of outperforming, the file has a selection probability of the variant of 98.1% which should be more correct. In my test with 20 variants, about 10 variants had a weight of 0, which means that they were never shown to the user. When I noticed this, I added my own selection code (in 50% of the cases, just randomly present a variation to the user) so that the 0 probability variants have a chance to be actually visible to some users, and while this had some effect on the results, it still seemed that the multi-armed bandit just can not deal with so many variants at the same time. Some variants which are actually very similar got a completely different probability of outperforming the original.
All in all Google Analytics is really excellent for A/B testing, but has a few quirks that one should be aware of. When A/B testing, a Chi-Square test for comparison is always helpful. E.g. a Chi-Squared test of click through ratio of the above example (ok, it’s not the same as testing AdSense Revenue like google does, but the difference should be marginal in my case) shows that with a confidence level of 99% we can say that the variant is more successful.