Sample Size Matters

An App Exploring Sample Size and Power

This app features several simulators, accessible on the top navigation bar, or by clicking below:

Simulator 1: The effect of sample size on normality tests and the accuracy of summary statistics

Simulator 2: Power Exploration

Simulator 3: Effect Sizes

Simulator 4: Publication Bias

Click on the blue text in each simulator to show additional content, including questions and visualizations.

This simulator was supported by Grant Number UL1 TR002377 from the National Center for Advancing Translational Sciences (NCATS). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. This app and the demonstrations herein were developed by Ethan Heinzen, Dr. Tracey Weissgerber and Dr. Stacey Winham, except where other sources are cited.

Do you like what you see here? This app is part of an online course on data visualization and statistical analysis for small sample size studies. To learn more, check out Sample Size Matters: Misconceptions about Graphs and Statistical Analyses in Lab and Clinical Research.

App version 3.1.8

Simulator 1: The effect of sample size on normality tests and the precision of summary statistics

How many data points do you need to determine the data distribution? A visual approach

The data distribution is one factor that we consider when deciding which statistical test to use. When we have many observations, it is easy to determine the data distribution. As the sample sizes decreases, it becomes increasingly difficult or impossible to identify the data distribution. This tool allows you to visually examine how our ability to identify the data distribution changes with sample size.

Show questions >>

Change the values of 'n' in the boxes to view samples of different sizes.

  1. How many observations do you need to be confident that you can distinguish between the different data distributions?
  2. At what sample size does it become impossible for you to tell the difference between the data distributions?
  3. What sample sizes do you usually work with? With samples of this size, can you tell the difference between the distributions?

Population

How many data points do you need to determine the data distribution? A statistical approach

In the first tab, you saw how our ability to visually identify the data distribution changes with sample size. One way to statistically determine whether the data distribution is normal is to use a normality test, such as Shapiro-Wilk, Kolmogorov-Smirnov, etc. (Show details >>) This tool allows you to look at how the percentage of samples that fail a Shapiro-Wilk normality test changes depending on the data distribution and the sample size.

Show questions >>

The Shapiro-Wilk normality test is used here. The null hypothesis is that the data is from a normal distribution.

  1. Start by thinking about what results you would expect. If the data distribution is normal, what percentage of samples would you expect to fail a normality test (p < 0.05)? What if the data distribution is skewed or bimodal?
  2. How many observations do you need to get results that look similar to your expected values?
  3. How do your results change when your sample size falls below the number that you identified in question 2?
  4. When you have very small samples (e.g., n = 5 or 10), can the normality test distinguish between normal, skewed and bimodal distributions?
  5. Enter values for the range of sample sizes that you typically use in your research. How useful are normality tests for these sample sizes? If the p-value for the normality test is >0.05 in your sample, can you be confident that your data are normally distributed?

Expected Results

Sample size and the precision of summary statistics

Visualizations 1 and 2 allow you to examine how summary statistics, such as the mean and standard deviation (SD), would change if you repeated the same experiment 100 times. Visualization 3 examines how the cumulative mean changes as your sample size increases. Explore these three visualizations to examine the relationship between sample size and the precision of summary statistics; then use what you've learned to answer the questions below.

Show questions >>

The visualizations below show how your summary statistics might change if you repeated the same experiment over and over again. Most of the time we only perform one experiment.

  1. Let's assume that you performed one experiment with a sample size of 250 per group.
    1. How confident would you be that your summary statistics are close to the true value for the population?
    2. How confident would you be that you would get very similar summary statistics if you repeated the same experiment a second time, with the same number of observations?
  2. How would your answers change if you performed one experiment with an n of 5?
  3. What would your answers be for the smallest and largest sample sizes that are used in your lab?

Show visualization 1 >>

  • Enter two values for different sample sizes in the text boxes. One hundred samples of each of the two sizes that you specified are drawn from a dataset with a normal distribution.
  • The graphs show the mean (black dots) and SD (black bars) of each of the 100 samples. The horizontal line in the middle of the graph shows the true mean for the population, while the shaded region shows the true SD.

The shaded area here represents the true population SD.


Show visualization 2 >>

100 samples of size N are drawn from the distribution indicated in the drop down menu, and the resulting sample statistics from the second dropdown are computed. The red line represents the population mean or median. Adjust the sample sizes across the top of the table to examine how N impacts these statistics.


Show visualization 3 >>

  • This figure shows how cumulative means (or medians) change with increasing sample size. You start of by recording measurements for three participants; then calculate the mean for your sample (n = 3). You then add a fourth participant and calculate the mean (n = 4). You keep adding one new participant to your sample and calculating the mean each time, until your sample includes the full sample size 'n'. You then create a line graph showing how your sample mean changed each time you added another observation. The black line represents the mean or median for the population.
  • The figure below shows what would happen if you repeated this experiment 100 times. 100 samples of the size that you specify are drawn and the cumulative means are calculated and plotted.
Loading...

The terms "Corridor of Stability" and "Seas of Uncertainty" were borrowed from previous papers examining the stability of correlation coefficients
(Schönbrodt FD, Perugini M. At what sample size do correlations stabilize?. J Res in Personality 2013; 47(5):609-612. doi:10.1016/j.jrp.2013.05.009)
and effect sizes
(Lakens D, Evers ERK. Sailing From the Seas of Chaos Into the Corridor of Stability: Practical Recommendations to Increase the Informational Value of Studies. Perspectives on Psychological Science 2014; 9(3): 278-292. doi:10.1177/1745691614528520).
We encourage users to consult these papers for more information on how sample size affects the uncertainty surrounding different types of estimates.

Simulator 2: Power Exploration

What is Power?

Power is the probability that you will detect a significant effect if your hypothesis, also called the alternative hypothesis, is true. This simulator repeats the same experiment 100 times,under two different scenarios:

  1. There is no effect (the null hypothesis is true)
  2. There is an effect (Your hypothesis, also called the alternative hypothesis, is true)

The simulator records the p-values for each of the 100 experiments; then creates a histogram showing the distribution of the p-values for each scenario. You can adjust power by changing the number in the power box. The red bar in the top panels shows you the percentage of p-values that are less than the significance level for each scenario. The zoomed in panels show you the distribution of p-values between 0 and the significance level.

Show questions about power >>

  1. Set power to 80%. Click "Draw new sample" 10 times.
    1. If there is no effect (the null hypothesis is true), approximately what proportion of experiments give p-values <0.05? What proportion of samples give p-values >0.95?
    2. If your hypothesis is correct (the alternative hypothesis is true), approximately what proportion of experiments give p-values <0.05? What proportion of samples give p-values >0.95?
  2. Repeat question 1 with 50% power. How do the answers change?
  3. Repeat question 1 with 20% power. How do the answers change?
  4. Based on your experiments, what is power?

Show questions about type 1 error rate >>

  1. Change the significance level, or type 1 error rate, to 0.10. Set power to 80%. Click "Draw new sample" 10 times. If there is no effect (the null hypothesis is true), approximately what proportion of experiments give p-values <0.05?
  2. How does this compare with your previous observations, when the type 1 error rate was 0.05?
  3. What does the type 1 error rate represent?

Show questions about distribution of significant p-values >>

If there is an effect (the alternative hypothesis is true), which p-values are more common: p<0.005 or 0.045

To answer this question using the simulator, show the lower set of graphs that present p-values between 0 and 0.05. Does your answer change if power is 80% vs. 50% vs. 20%?

Show questions about p-curve shape >>

How would you describe the shape of the p-curve in each of the following scenarios?

  1. There is no effect (null hypothesis is true)
  2. There is an effect (alternative hypothesis is true) and you have high power
  3. There is an effect (alternative hypothesis is true) and you have low power

How Can I Increase Power?

Use this simulator to explore different strategies for increasing power. The tool is comparing two independent groups using a 2 sample unpaired t-test. Within each section of the simulator, you'll be able to enter values for group A and group B. As in the previous tab, the simulator performs 100 experiments and creates a histogram of p-values for two different scenarios:

  1. There is no effect (the null hypothesis is true)
  2. There is an effect (Your hypothesis, also called the alternative hypothesis, is true)

The first section examines the effect of sample size. The other factors that affect power are fixed and cannot be changed - we'll look at these factors in later sections. The second section examines changes in the effect size, or difference between the means. The third section examines changes in variability, or the standard deviation (SD) for each group. The fourth section allows you to adjust the sample sizes, difference between means (effect size) and variability (standard deviations) for each group simultaneously.


Show effects of sample size >>

Show questions >>

  1. Enter sample sizes of 5 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  2. Enter sample sizes of 50 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  3. Enter sample sizes of 100 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  4. Enter a sample size of 50 for group A and 150 for group B, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  5. What happens to power as sample size increases?

Show effects of differences in means (effect size) >>

Show questions >>

  1. Enter an effect size (difference in means) of 1, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  2. Enter an effect size (difference in means) of 1.5, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  3. Enter an effect size (difference in means) of 2.5, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  4. What happens to power as effect size (difference in means) increases?

Show effects of variability (standard deviation) >>

Show questions >>

  1. Enter SDs of 2 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  2. Enter SDs of 1.5 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  3. Enter SDs of 1 for each group, then click "Draw new sample" 10 times. What is the approximate power for this sample size?
  4. What happens to power as standard deviation increases?

Show effects of all three >>

Simulator 3: Effect Sizes

Effect Size: how much is the difference between groups?

An effect size measures the magnitude of the difference between groups. Examples include the difference between group means, or the difference between group medians. The effect size can be measured as the raw difference, or as a standardized effect size, which scales the difference relative to the variability. An example of a commonly used standardized effect size measure is Cohen's D (d=difference in group means divided by the standard deviation); a value of d=0.2 standard deviations is a small effect, d=0.5 standard deviations is a medium effect, and d=0.8 standard deviations is a large effect.

This simulator draws a sample from two independent groups each of size n=25, from a population with a given true effect size (raw difference in means). The first tool allows you to visually assess how the effect size measures the difference between groups; you can adjust the raw difference in means in the boxes labeled 'Effect Size'. The second tool allows you to visually assess how the estimate of the effect size is related to the sample size; you can adjust the standardized difference in means in the boxes labeled 'Effect Size'.

Show questions >>

What is effect size?

  1. What is an effect size?
  2. What is the difference between an effect size and a standard effect size?
  3. Why is it important for scientists to think about effect sizes?

Smaller p ≠ bigger effect

P-values are impacted by sample size and standard deviation in addition to effect size. This simulator allows you to explore the relationship between p-values and sample size, standard deviation, and effect size. Each figure displays the results of one experiment to compare two independent groups. The p-value of a two-sample independent t-test to compare the means for Group A and Group B is displayed in blue above the graph. The input boxes allow you to control the sample size, standard deviation or effect size. Values of the effect size, standard deviation, and sample size are displayed on the right.

Show questions >>

Click "Show effects of sample size" below.

  1. How do the effect sizes compare between the three different examples (n=5, n=25, and n=50)?
  2. What happens to the p-values as the sample size increases?

Click "Show effects of variability (standard deviation)" below

  1. How do the effect sizes compare between the three different examples (SD=3, SD=2, and SD=1)?
  2. What happens to the p-values as the standard deviation increases?

Click "Show effects of differences in means (effect size)" below.

  1. What happens to the p-values as the effect size increases?

After examining these three simulators, why should we avoid assuming that a smaller p-value means that we've found a larger effect?

In basic biomedical science, scientists routinely report p-values; however, they seldom report effect sizes. Why is this a problem? How might we interpret data differently if scientists focused on effect sizes, rather than p-values?


Show effects of sample size >>

Assumptions:
Effect Size 0.5
SD 1.0

Show effects of variability (standard deviation) >>

Assumptions:
n per group 25
Effect Size 0.5

Show effects of differences in means (effect size) >>

Assumptions:
n per group 25
SD 1.0

Effect size estimates for small samples: Winner's curse

When a study is underpowered, samples with significant results tend to overestimate effect size ("winner's curse"). This simulator allows you to examine the relationship between effect size, p-value, and sample size. The simulator repeats an experiment 100 times to compare two independent groups. The figure shows the estimates of the effect size (mean difference) plotted against the p-value from a two-sample t-test for each of these 100 experiments. The true effect size (mean difference) is 1.0, denoted by the red line. You can control the sample size for each group using the input boxes. The power of the test is displayed about each graph. You can select the "Show Histograms" option to see the distribution of p-values and effect sizes. Samples in red denote p < 0.05. Below the graph is a table to summarize the samples with p < 0.05 compared to those with p ≥ 0.05.

Show questions >>

  1. Look at the power for each of the three examples (n=5, n=50, n=100). Based on what you learned about power in simulator 2, what percentage of samples would you expect to have p-values < 0.05 for each sample? Click the "Draw new sample" button a few times to see if the reported estimates shown in the p-value histogram are close to your prediction.
  2. Look at the effect size histograms. What happens to the shape of the histogram as sample size and power increase? How do observed effect sizes for the simulated samples compare to the true effect size of 1.0?
  3. When n=5 and power = 8.7%, how does the average observed effect size for samples in which p < 0.05 compare to the true effect size of 1.0?
  4. Consider samples where p < 0.05. How does the difference between the average observed effect size and the true effect size change as sample size and power increase?
  5. Consider samples where p > 0.05. How does the difference between the average observed effect size and the true effect size change as sample size and power increase?
  6. You are reading a paper where the authors report a significant difference in a variable that you are interested in. However, the study had a small sample size and was underpowered. Is the reported effect size most likely to be larger, smaller, or the same as the true effect size?
  7. Your colleague wants to use the data from the study described in question 6 to perform a power calculation for a future study. Is it more likely that the future study would be adequately powered, underpowered, or overpowered?

Simulator 4: Publication Bias

Publication Bias

Publication bias is the idea that studies that show statistically significant results get published more easily than those that don't, which leads to overestimates in effect sizes reported in the literature. In this activity, 20 studies that investigate a specific hypothesis to compare the means of two groups are simulated, of various sample sizes. A standardized effect size is calculated, and if it's statistically significant with p < 0.05, the study is "published". If it's not statistically significant, the study gets published with the probability specified below. A meta-analysis is a technique used to combine the results across studies, and results are usually visualized using a "forest plot". The forest plot below displays the results from all studies, as well as the combined estimates from the meta-analysis of all the studies (in black) and only the published studies (in red) to illustrate the effect of publication bias. For more information on how to read this forest plot, please see this video explanation.

It's recommended that you reset to the defaults when you're done with each question.

Show questions >>

  1. Reduce the probability that a not significant study gets published. What happens to the effect size of all studies and the published studies?
  2. Change the effect size from a small effect (d=0.2) to a large effect (d=0.8). How do the effect size estimates of all studies and the published studies change?
  3. Change the effect size to no true effect (d=0), and draw a new sample and republish studies multiple times. How do the effect size of all studies and the published studies change?
  4. When the proportion of small sample size studies decreases (i.e., there are more large sample size studies), what happens to the effect size of all studies and the published studies?

How does sample size differ between:

  • significant studies vs. not significant studies?
  • published vs. not published studies?

How does the percent of significant studies differ by sample size?

How does the percent of published studies differ by sample size?

DISCLAIMER: The content on the site is NOT medical advice. Although some content may be provided by medical professionals, users acknowledge that access or use of the content does not create a provider-patient relationship and does not constitute medical advice, treatment, diagnosis or services of any kind. The information is provided for educational purposes only and as such is not a substitute for professional medical attention and treatment by medical professionals. Users are solely responsible and accept all liability resulting from use of the content and any related services or products.