# Glossary

# Experiment

To observe A/B testing data results of two different solutions adopted for a feature, you need to create an experiment.

- For example, you need to create an experiment to figure out which of the two text variations for the Sign Up button ("Sign up in One Click", and "Sign up Now") delivers the higher conversion rate.

# Orthogonal array test

- Every test/experiment is running on an independent traffic layer. All traffic layers has the same amount of traffic.

**Understanding orthogonal array test**

For example, now I have two tests need to run. Experiment A(experimental variation is marked as A1 while control variation is marked as A2) in layer 1, and we are using 100% of the traffic in this layer. Experiment B (experimental variation is B1 while control variation is B2) is on layer 2, and we also use 100% traffic on this layer. (Note, layer 1 and layer 2 has exactly same traffic, that means the same users are both in layer 1 and 2. Layer 2 just reused all the traffic in layer 1)

Let's imagine to separate traffic in A1 into two groups, one group send to B1 while the other group send to B2. And separate traffic in A2 into two groups as well, send one group to B1 while send the other group to B2 as the picture shown below. Now, we say experiment A and experiment B are **orthogonal**.

**Why is orthogonal array test needed?**

- We can see since half of the A1 is in B1, and the other half is in B2. That means event A1 strategy has effect on experiment B, this effect will be separate into two variations in experiment B evenly;
- In this case, if metrics of B1 has risen, we can exclude the possibility that the rise of metrics in B1 is due to the effect of A1. This is why orthogonal array test is needed.

# Mutual exclusive tests

Mutual exclusive tests: All experiments in the same mutual group will not share any users. That means if one user/device is participating in Experiment A, it will not be participating in any other experiment within this mutual group.

- For example: You want to do experiments on both the button's color and the button's shape, you would want to put these two experiments into one mutual group.

# Variations

DataTester supports only one control variation and one experimental variation in an experiment.

- For example, if the first text solution of the Sign Up button ("Sign up in One Click") is the control variation, the second solution ("Sign up Now") will be the experimental variation.

In DataTester, multiple control variation and multiple experimental variation in an experiment are supported.

# Parameters, parameter types, and parameter values

When creating an experiment, you need to differentiate the experimental and control variations with an identifier. We use parameters to do this. In an A/B test, each control variation or experimental variation may contain one or more parameters, and each parameter belongs to a type (String, Number, and Boolean types are currently supported). Each parameter is also assigned with a value.

- For example, for the Sign Up button text case, we can create a string parameter (named "register_name"). The parameter value in the control group is "Sign up in One Click", and that in the test group is "Sign up Now".

# Metrics

The objective of creating an experiment is to make comparisons on one or more metrics between the control and experimental varaition.

- For example, click events of the Sign Up button need to be reported to analyze the button clicks, and then the metric can be configured in DataFinder.

# Significance

When an experimental variation's metrics are significant, it indicates that the test result has a high probability to be trustable. If the confidence level stands at 95%, it means that the system is 95 percent confident that the test result is accurate.

Baidu Baike, the online encyclopedia of Baidu, defines statistical significance as follows: Being statistically significant means that any deviations in attitude between two groups are a result of system factors instead of chance factors. We suppose that all the other deviation-inducing factors between two groups are under control, and the left explanation will be the contributing factor as we infer. However, our confidence in the factor is not 100 percent—that's where the probability, or significance level, comes in.

# Traffic allocation

A/B testing usually starts with low-traffic tests. After an experimental group proves effective, high-traffic tests follow until the final full launch.

# Filter criteria

If you need to initiate a test for target users, you can configure the desired users to be involved in the experiment in the filter criteria. System default properties include the operating system, system version, channel, phone brand, device model, resolution and operator.If you need to filter users by more filter properties, such as all custom user properties, please contact with your Customer Success Manager or click the "Customer Service" button on the lower-right corner to activate DataTester.

# Retention rate

Retention rate in the reports refers to the "retention rate by inclusion time to group", and the statistical pattern is as follows:

Rule | Processing logic |
---|---|

Grouping | Users who participate in the variation for the first time (not necessarily new users) |

Attribution | Retained users are divided by their participation time into the group with their first participation time to the group attributed respectively |

Return rule | A revisit to the app is regarded as a return action |

To name an example:

- The number of users in Variation A on Day 1 is: 10,000, and the "base_user" of Day 1 is "10000".
- The number of users in Variation A on Day 2 is : 10,400, including 9,200 who are among the users on Day 1, and 1,200 who are newly participated users. The "base_user" of Day 2 is "1200" and the Day 2 retention rate of Day 1 is 9200/10000 = 92%.
- The number of users in Variation A on Day 3 is: 10,200, including 8,000 who are among the users on Day 1, 1,100 who are newly participated on Day 2, and 1,100 newly included users on Day 3. The "base_user" of Day 3 is "1100", the Day 3 retention rate of Day 1 is 8000/10000 = 80%, and the Day 2 retention rate of Day 2 is 1100/1200 = 91.67%.
- Then we can work out the weighted mean of the metrics for each day using the "base_user" values to get the Day 2 retention rate, Day 3 retention rate, etc.

# Statistical significance

- Statistical significance refers to the probability of a real performance deviation between your
**experiment variation and the control variation**, and the deviations in the measurable goals (that is, the configured metrics) between the test and control variations are not a result of random probability; - Statistical significance helps understand when the results are correct. For most companies,
**the test result can be regarded as correct when the confidence level is above 95%**.

# Significance level

Will the user experience be improved or impaired after a button is changed from the blue color to the red color, or after a window is moved from the left side to the right side? We are not sure about the answer, so we try to use A/B testing to help convert this "uncertainty". To do this, we observe the performance of the old and new strategies in a small flow experiment to determine the pros and cons of the strategies.

But is this approach enough to fully eliminate uncertainty? The answer is no, because **sampling errors** may exist.

- It is known that Swiss per capita income is ten times that of China's (data source). If we randomly select three Swiss and three Chinese people, can we guarantee that the average income of the three Swiss in the sample is ten times that of the three Chinese people? What if the three Chinese people are Jack Ma, Wang Jianlin and me?
- Think the other way around. We assume that with a traffic weight ratio of 1%, Group A (red button) has a higher purchase rate than Group B (blue button). Then we increase the traffic weight to 100%. Can we guarantee that Strategy A still out-performs Strategy B? Obviously, the uncertainty remains.

**The uncertainty brought about by sampling errors prohibits any guarantees on the total correctness of small flow experiment conclusions. Fortunately, we have statistical methods to quantify the extent of sampling uncertainty, and this is where significance level (α) plays its part.**

# Statistical power

The statistical model behind A/B testing is a (two-sample) hypothesis test. Because sampling is required during the testing, and sampling introduces sampling errors, our experiment will always "make mistakes". Statistics help tell what mistakes we might make and the odds of these mistakes.

In the hypothesis testing process, we may make two types of mistakes: the Type I error of rejecting a null hypothesis when it is actually true, and the Type II error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature.

**A manifestation of the Type I error is that my strategy is actually non-helpful, but the test result demonstrates otherwise.**- Significance level depicts the probability of such errors by the tester, that is, the probability of rejecting the null hypothesis when it is actually true.

**A manifestation of the Type II error is that my strategy is actually helpful, but the test fails to demonstrate so.**- That is to say, the tester might accept the null hypothesis when the null hypothesis is actually false. The probability of such errors is marked by β;
- Statistical power (also known as test power) is defined as 1-β, indicating "the probability of test-supported effectiveness of my new strategy, with the new strategy supposed to work".

Updated 2 months ago