Table of Contents
What does this article contain?
This article contains a brief explanation of the value and methodology behind statistical hypothesis testing, as well as an introduction to how you can leverage and interpret the results of KnowledgeHound’s statistical testing features available on the analysis page. This article also contains some explanation about the types of statistical tests we perform (column proportion and means testing), and in what situations we do and do not perform them.
Who should read this article?
You should read this article if you are a user of KnowledgeHound interested in learning about how to interpret the results of statistical tests available to you during analysis, or if you would like to learn how to better leverage the KnowledgeHound platform. Users of the platform interested in the details of what statistical tests we perform and how KnowledgeHound can best serve your needs should also read this article.
A brief introduction to statistical testing
In inferential statistics, statistical hypothesis testing is a class of methods used to evaluate hypotheses about the properties of a statistical population, often based on the extent to which a population might conform to a particular statistical distribution and how probabilities derived from sample data might be used to infer trends in the greater population.
Put more simply, using statistical testing, we can identify differences in our sample data that have a statistically significant chance of being representative of the population as a whole that our sample represents.
Let’s take a look at a simple use case. In the following example, we have a survey containing two questions: The first asks about gender and the second asks what your favorite color is.
Question | Type | Possible Answers |
Gender of Respondent | Single Choice | Male, Female |
What is your favorite color? | Single Choice | Red, Yellow, Blue, Green, Purple |
When analyzing the results we might be looking to understand if gender influences having a specific favorite color. To do that we can crosstab/compare the results of the favorite color question by gender:
Gender of Respondent | |||
Male | Female | ||
Base | 672 | 426 | |
What is your favorite color? | Red | 31.99% | 32.86% |
Yellow | 13.24% | 16.43% | |
Blue | 8.93% | 13.38% | |
Green | 10.12% | 24.18% | |
Purple | 35.71% | 13.15% |
In the above we can see that the percentage of women whose favorite color is red is higher than that of males, but just barely. Now bearing in mind that the above data is just a sample of a larger population, is this difference likely to be representative of the greater population, or is it just noise?
To help us understand what differences are more significant, or not just noise, we use hypothesis testing. This is also commonly referred to as stat testing.
TIP: It is also important that we don’t always just look at the percentages. Oftentimes actual counts can be too low to guide us to a decision we can trust. For example, 75% can easily be 3 out of 4. A base size of 4 is too low to draw any real conclusions from so keep an eye on base sizes and use the count view as a backup to ensure enough people are included in the population.
With a statistical test, we frame the question in terms of a null hypothesis compared to an alternative hypothesis. We consider our null hypothesis to be true until we discover evidence against it. In that case, we accept the alternative hypothesis to be true. Our null hypothesis is that the proportions are the same. Our alternative hypothesis is that they are not the same. What test to apply comes down to several factors. Learn more about those factors in our blog post What Is the Difference Between a T-Test and a Z-Test?
After running our calculations, a T-test in this example, we might find for this case that our test indicates a probability that the null hypothesis is true - that is, that the proportions are the same - of 0.764. This probability is referred to as a p-value.
On the other hand, if we apply the same test to the proportion of men and women whose favorite color is blue, we get a smaller probability - or a p-value of 0.020.
It is up to us to determine the threshold where we reject the null hypothesis. This threshold is referred to as the significance level. The default significance level in KnowledgeHound for analysis is 0.05. In other words, we must be 95% confident.
Using our two test cases above we have one comparison of men and women whose favorite color is red that gives us a p-value of 0.764 and our other comparison of men and women whose favorite color is blue that gives us a p-value of 0.020. Remembering that our threshold is set at .05 we look for other values that are lower than .05. In this case the comparison of men and women whose favorite color is blue. This smaller probability of 0.020 means it is far less likely that the proportions are the same: therefore making the difference significant.
For more information on p-value and significance level, see this article by Jim Frost.
In the following use case we add another question to our existing survey asking respondents to rate their experience using a mobile app to order food:
Question | Type | Possible Answers |
Rate your experience on a scale of 1 to 5. | Single Choice | 1 - Lowest, 2 - Low, 3 - Normal, 4 - High, 5 - Highest |
When we crosstab/compare these results we see the following:
Gender of Respondent | |||
Male | Female | ||
Base | 672 | 426 | |
Rate your experience on a scale of 1 to 5. | Mean | 3.45 | 3.72 |
5 - Highest | 31.40% | 42.72% | |
4 - High | 25.89% | 13.38% | |
3 - Normal | 7.89% | 20.89% | |
2 - Low | 25.89% | 18.78% | |
1 - Lowest | 8.93% | 4.23% |
Because the response options to this question use a Likert scale, we can assign numeric values to them as shown above (5,4,3,2,1). KnowledgeHound extracts these scale values automatically which enables us to calculate Means, commonly referred to as averages. Means are calculated for each column and can be compared in statistical tests.
In the example above, we might compare the mean experience rating for men to the mean rating for women. After running the test we get a p-value of 0.0015 - small enough that we can conclude that the mean rating for women is significantly higher than the mean rating for men.
KnowledgeHound uses a T-test when comparing Means. Continue reading to learn more about column means testing.
Leveraging statistical tests in KnowledgeHound
While analyzing your data in KnowledgeHound, you can enable column proportions and means testing at any time. Simply click the Stat Testing button, marked with a Σ symbol, in the action bar. Statistical tests will immediately begin running for your analysis and will continue to update as you add additional variables or select and deselect response options. Click the button again to turn off statistical testing.
If you are in any of the visual chart type views, you will also need to ensure labels are enabled by clicking the Data Labels button in order to see the results of statistical testing.
You must have at least one series variable added to your analysis to leverage statistical tests in KnowledgeHound, either as a breakout or as an individual variable.
Bear in mind that testing will not be performed if the number of comparisons needed to test all columns in your analysis is over 10 million. The count is calculated as the number of cell pairs (comparisons) across all rows, multiplied by the number of respondents. You will find this information displayed above the study sample size and current base size.
Interpreting the results of statistical tests in KnowledgeHound
In the spreadsheet view, the results of statistical testing are displayed as letters, highlighted in yellow, that sit below the percent or count.
The yellow-highlighted letter tells you which other column(s) is statistically significantly different from that highlighted cell in the same row. For convenience, column IDs at the top of the spreadsheet can help identify which letter corresponds to which column.
In the example above, cell A6 is the intersection between the column for Male and the row for Purple. Therefore, the percentage shown in the cell, 35.7%, represents the proportion of men in the sample whose favorite color was purple. The letter b shown highlighted in yellow indicates that there was a statistically significant difference between this cell and the cell in the same row in the column marked b and that this cell contains the greater of the proportions. We can see that the corresponding cell is B6, the intersection of Female and Purple.
In other words, the proportion of men whose favorite color is purple is significantly larger than the proportion of women whose favorite color was purple at a 95% confidence level.
Similarly, we can also conclude that the proportion of women whose favorite color was blue was larger than that of men whose favorite color was blue, and that the proportion of women whose favorite color was green was larger than that of men whose favorite color was green.
We perform pairwise statistical tests for column proportions and means. In other words, we only determine statistically significant differences between pairs of cells in the same row.
For visual chart types, the letters are shown in the data labels. For example, in this image, which shows the same data and statistical test results as above, the data labels applied to the columns indicate whether or not the column is significantly different from another column in the same grouping. You’ll notice in the legend each option has a letter to help with context.
Details about column proportion testing
The most common statistical test performed in KnowledgeHound is a pairwise column proportion test. The specific test performed on each pair of column proportions is a … The test performed on each pair of column proportions is a T-test as defined below.
r1 | Weighted count for column 1 |
w1 | Weighted base for column 1 |
r2 | Weighted count for column 2 |
w2 | Weighted base for column 2 |
wo | Weighted base for overlap |
p12is a pooled proportion estimate for columns 1 and 2. Fis the cumulative distribution function of the t-distribution applied to our t-score and DOFdegrees of freedom.
Note that this expression for t-score includes a term to account for cases in which there is overlap between column 1 and column 2. This can happen, for instance, when the column variables you have added to your analysis contain a ‘multiple response’ or ‘pick many’ variable. In these cases, the weighted base for the respondents contributing to the overlap is included as wo. In cases where there is no overlap, wo=0 and the expressions reduce to:
Using our example from earlier (men and women whose favorite color is blue), for column proportions 60/672 and 57/426 and 0 overlap, using the above proportions test, we arrive at a t-score absolute value of roughly 2.33 and 1096 degrees of freedom. We can use these values to arrive at the p-value we derived earlier, roughly 0.02.
Details about column means testing
For rows that contain arithmetic means on numeric data or response options for which numeric values are available and configured, we perform pairwise column means testing.
The test performed on each pair of column means is a T-test as defined below.
w1 | Weighted count of responses included in the mean for column 1 |
q1 | Sum of the squared weights included in the mean for column 1 |
X1 | The weighted sum of the values included in the mean for column 1 |
Y1 | The weighted sum of the squared values included in the mean for column 1 |
w2 | Weighted count of responses included in the mean for column 2 |
q2 | Sum of the squared weights included in the mean for column 2 |
X2 | The weighted sum of the values included in the mean for column 2 |
Y2 | The weighted sum of the squared values included in the mean for column 2 |
wo | Weighted count of the responses contributing to overlap |
qo | Sum of the squared weights contributing to overlap |
S2is a pooled population variance estimate for columns 1 and 2. Fis the cumulative distribution function of the t-distribution applied to our t-score and DOFdegrees of freedom.
In cases where there is no overlap, wo=e0=0, and the expressions reduce to:
Using our means example from above (mean ratings for men and women), using the above means test, we arrive at a t-score absolute value of roughly 3.17 and 1096 degrees of freedom. We can use these values to arrive at the p-value we derived earlier, roughly 0.0015.
Unsupported statistical tests
We do not currently have support for the following tests and/or variable types.
Statistical testing means on grid(matrix) variables.
ANOVA tests
Regression analysis
Conjoint analysis
Factor analysis
Cluster analysis