We received a lot of questions during this webinar and weren't able to answer them all live, but Bob and Kate wanted to get to as many possible and have answered more here!
How can we advocate for estimation statistics when faced with pushback from proponents of traditional statistical tests?
Bob: One strategy that can be effective is to point out that an effect size with a confidence interval contains all the information needed to conduct a statistical test, so by reporting them you are providing more information while still enabling the reader to conduct statistical tests against any desired null hypothesis. There are also some very good professional guidelines for reporting, many of which emphasize the importance of reporting and interpreting effects sizes and uncertainty (the APA, for example, suggests “wherever possible, base discussion and interpretation of results on point and interval estimates” (APA Publication Manual, 6th edition, p. 35). Perhaps
Kate: I tend to report effect sizes and confidence intervals alongside my p-values as each provides its own information and then people looking for p-values are placated. The important thing is that effect sizes and confidence intervals are reported as these are important for others to be able to interpret your results and wishing you use your results to inform their own research. A p-value is the least informative of these pieces of information. I also encourage my students to report descriptive statistics such as means and standard deviations as these are really helpful for anyone wishing to use or replicate their research.
When doing power analysis, sometimes one might want to estimate the effect size based on studies of similar phenomena. How similar is "similar enough"?
This is a hard question to answer and as an expert in your field you may be best placed to judge. I would approach this by looking at the range of effect sizes published on a similar phenomenon giving more weight to the estimates from studies which seem closest to the one I'm planning. It’s also important to consider publication bias so recently I’ve started reducing published effect sizes by ⅓ in my sample size calculations to account for this. I would then run a series of sample size calculations based on the upper and lower effect estimates. In my sample size justification I would include a statement justifying my choice of effect size and if there were few relevant studies and I had to cast the net quite widely to find ‘similar’ studies then I would state this in my justification. The important thing is to be transparent in your reporting and then the reader/reviewer can follow your thinking.
Could you address the question of power, pilot studies, and how to address the issues you mentioned in grant applications?
There is a tension between current culture and the utility of pilot studies. Funders want to be sure the research they fund will be fruitful so encourage proof-of-concept or pilot data. This can be useful for showing the feasibility of the research procedure, a new method, or of being able to recruit relevant participants. What these small pilot studies are less informative for is estimating likely effect sizes as the results they produce will be imprecise and may be misleading. So I would focus on using the pilot study to evidence the feasibility of the study procedure, and then draw more widely on previously published effect sizes and data and/or minimal clinically important differences to base my sample size calculation on.
Many top Journals publish so much normalized data it is impossible to get a feel for the data. How do we fix this?
It is important to talk with editors and to advocate for clear and forward-looking policies on reporting and sharing data. It can be helpful to point out the NIH’s guidelines on pre-clinical reporting (https://www.nih.gov/research-training/rigor-reproducibility/principles-guidelines-reporting-preclinical-research). It can also be helpful to share policies from other journals, to help spur thought and discussion. For example, The Journal of Neurophysiology has extensively updated their author guidelines (and published accompanying tutorials/perspectives) for strong data transparency (https://journals.physiology.org/author-info.promoting-transparent-reporting ). Remember that editors and reviewers are often overtaxed and contributing essentially volunteer work to running the journal--so it can be helpful to partner and ally with them in helping to envision improved policies for reporting.
How valuable can it be to discuss trends in data even if they miss statistical significance (happy to hear an answer from all) thank you all, this talk is all very interesting!
Bob - I would argue that it is essential that we banish the ‘signficant’ vs. ‘non-significant’ dichotomy. We should report and interpret all of our results, countenancing both the effect observed in the sample and our uncertainty about generalizing. Just because a finding has 0 in its confidence interval does not mean we should ignore it (or worse) omit it from our manuscript (or even worse) claim it as proof of a negligible effect. By the same token just because a finding does *not* have 0 in its confidence interval does not mean we should take it as meaningful, certain, or established.
Kate - I tend to follow the approach advocated here https://www.bmj.com/content/322/7280/226.1 and report the point estimate and confidence interval and the associated p-value (always the exact p-value unless it is very small p<0.0001). I then try and avoid using the words ‘statistical significance’ and instead focus on describing what the point estimate and confidence interval is telling me.
To develop an initial estimate of sample size needed to achieve a certain power, can one use sequential statistical analysis to achieve an a priori defined significance level to efficiently identify a minimum sample size needed for power analysis?
I think there are two things collated here. We can design a study based on a given effect size and calculate the sample size required to test that effect with a given alpha and power. This gives us certainty in planning and cost as we know how many participants or samples we will need. But if we are uncertain about the likely effect size and we can be flexible about how many participants we can test we could choose a different approach and opt for a sequential testing method where we recruit and test until we reach some predefined stopping criterion. This is often done in a Bayesian framework where repeated testing doesn’t incur a type I penalty. For example, we might set the criterion to a Bayes factor of 3 or ⅓ which could be interpreted as substantial evidence of your alternative hypothesis over the null or vice versa respectively.
What are the best inferential statistics and descriptive statistics methods for describing your datasets analysis results characteristics of patient’s and controls biomarkers if not methods like p-values for clinicians?
It can be helpful to think about how a statistic can be used by both clinicians and other researchers. In this regard a p-value is arguably the least informative statistic we can provide. It simply tells us whether there is statistical evidence against the null; e.g., is treatment A better than treatment B. A clinician might also be interested in the size of a treatment effect, so how much better is drug A than B as they may wish to weigh any gain in effectiveness against any side-effects and so on. So at a minimum I would always report the point estimates from my statistical test and the corresponding confidence intervals as this tells the clinician how precise the estimate of the treatment effect is. I would then think what other statistics I could use to put the effect into context. One example is Numbers Needed to Treat (NNT), that is the average number of patients who need to receive treatment A for one additional positive outcome. The closer it is to 1, the more effective the treatment is. For example, if the NNT for drug A compared with drug B for recovery from depression is 4, it means you have to treat 4 people with the drug A to get one additional recovery. As for descriptive statistics I always encourage reporting means and standard deviations for continuous variables, or for categorical variables, numbers in a given category and percent of overall sample. These descriptive statistics can be invaluable to other researchers, particularly for meta-analyses and sample size calculations.
For people who do memory work with mice, would it be plausible to borrow effect sizes from human memory work?
Probably not. If the measurements are on different scales, you would be relying on standardized effect sizes. One difficulty with standardized effect sizes, though, is that they are normalized to the variation observed in the study. This can make it challenging to compare standardized effect sizes across contexts, because not only the effect but also the variation could be different. For example, eye-blink conditioning in humans might be more variable than eye-blink conditioning in an inbred strain of mice, and this would make comparison of standardized effect sizes problematic.
What if you want to report on a phenotype that you keep observing but having issues qualifying it- when is it okay to report observations- will there be scientist that completely disregard your data
Is it always mandatory to use the "estimation method" comparing with the "frequentist"?
Frequentist statistics is the philosophy of statistics that defines probability in terms of event counts and outcomes. A p value is a frequentist statistics. But so is a confidence interval. That is, frequentists can conduct tests *and* they can make estimates. So it’s not an either/or--it’s just how you want to apply frequentist statistics.
When reviewing literature, what statistical information is most important to record? I'd like to collect relevant information the first time I read a paper rather than missing something important and having to dig through the study a second or third time.
The Prisma Guidelines might be useful for helping you think through what you want to extract from each study as you work through a literature review: http://www.prisma-statement.org/
What is the difference between bootstrap and jackknife techniques?
Are there advantages for bootstrapping over permutations testing?
Is there a significant difference between bootstrapping and generating artificial data based on the data you have available?
There is a lot of confusion over the names/procedures for resampling-based techniques. I like this taxonomy by Rodgers (1999) to help clarify the space of different resampling approaches:
Wikipedia also has a pretty clear explanation of the different approaches: https://en.wikipedia.org/wiki/Resampling_(statistics)
I’ve found Hesterberg (2015) to be the most clear and complete single source for understanding more about bootstrap techniques.
Rodgers, J. L. (1999). The bootstrap, the jackknife, and the randomization test: A sampling taxonomy. Multivariate Behavioral Research, 34(4), 441–456. https://doi.org/10.1207/S15327906MBR3404_2
Hesterberg, T. C. (2015). What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. American Statistician, 69(4), 371–386. https://doi.org/10.1080/00031305.2015.1089789
In neuroscience research, we can only afford to use small numbers of animals or slices, 4 pairs to 12 pairs at most. Therefore, the sample size is always small. Bayesian and bootstrap do not apply since they assume big sample size
Changing statistical approaches cannot change the fact that, all else being equal, smaller samples provide less certain information than larger samples.
The goal in research is to match the sample size to the research question--to obtain enough data to reliably generalize from your results. When you study patterns/trends that are abundantly clear and qualitative it can be possible to use relatively small sample sizes. There is not a firm definition of what counts as a qualitative effect, but a rule of thumb is that there would be essentially no overlap in the distributions between groups and that group means would be 3 or more standard deviations apart. With very very large and qualitative differences like this, it can be reasonable to use relatively small samples, though knowing that subtle differences and effects will not be resolvable.
In general, the best proof that a sample size is reasonable is consistency of results across independent replications. If you can regularly obtain the same result with n = 6 across independent replications, then n = 6 is fine. If you see substantive fluctuations in results across independent replications, then either there are procedural problems or the sample size is not adequate.
what is the best statistical approach when you have two populations unbalanced? In other words two population with a simple size very different between the two?
Regarding Christophe's question about cultural change, what role should funding agencies play?
Funders could have a tremendous influence, not only by setting standards for proposing and reporting research, but also by funding the training needed to improve rigor and reproducibility in science. For example, in the U.S. the NIH has a program to fund development of free training programs: https://www.nigms.nih.gov/training/pages/clearinghouse-for-training-modules-to-enhance-data-reproducibility.aspx
Kate: Funders can be hugely influential in promoting culture change. What they say, we often do. Many funders in the UK have updated their funding application procedures to ask for evidence of reproducible research methods, and funders such as the MRC in conjunction with the NC3Rs offer training for their grant review panelists which focus on what to look for in a sound sample size justification.
Do you know if reviewers in NIH or NSF panels are trained in Estimation Statistics? This could be a starting point.
Bob: I’m not aware of any specific effort to ensure NIH or NSF panels know estimation statistics. Anecdotally, I’ve emphasized estimation in my recent neuroscience papers and NIH proposals and this has never been a problem. I do often mention how to interpret a confidence interval as a statistical test to help those who are focused on thinking only in terms of statistical significance.
Why is an interval null superior?
Could you comment on the important on null models when testing hypotheses?
could you give an example of how one would use interval nulls?
Many researchers who use hypothesis testing use a point null hypothesis. For example, in comparing the means of two groups, it is common to test against a null of exactly no difference (Hnull: Mdiff = 0). It is possible (though currently less common) to use an interval null--a range of values to be considered ‘essentially 0’. For example, we could compare the means of two groups using an interval null of +/- 10% (Hnull: -10% < Mdiff < +10%).
Interval nulls are better:
Point nulls are overwhelmed by sample size--with very large sample-sizes even extremely trivial differences will be flagged as statistically significant. Interval nulls allow the researcher to specify the range of effects that are trivial and to test only for effects that are clearly non-negligible. They work well even with very large sample sizes.
Interval nulls unify statistical and practical significance - you know the old warning that ‘statistical significance does not mean practical significance’. That’s true when we test against an interval null because we’re only rejecting a single parameter value. But with an interval null, the test is significant only if the data is incompatible with the entire null interval--only if it is reasonable to rule out all negligible/trivial effects. Note that defining what counts as negligible/trivial can be a challenge, but that would still have been an issue when conducting a point-null test, and with an interval null one is challenged to think through and define this *before* seeing the results.
Testing against point nulls provides no way to accept the null. A null of exactly 0.00000 is probably never true, and also there is no procedure with a point null to ‘accept’ the null. This makes for a poor set of decision-making tools for scientists. With an interval null, we get expanded options for testing-- when the entire CI is outside the interval null we have clear evidence the effect is meaningful (non-neglible), when the entire CI is inside the interval null we have clear evidence the effect is negligible, and when the CI is partially in the interval null we have an indeterminate result. So an interval null gives us 3 decision-making options: demonstrate meaningfulness, demonstrate negligible, or indeterminate test.
Can you use the variability of a well-known method to predict the minimum effect size you can detect for a given sample size?
Bob: Yes. If you have a good estimate of variability in a measure, you could use that to a) estimate the margin of error for any given sample of that measure and/or b) obtain power curves (expected power by effect size for a range of different sample sizes)