Section 8 Reliability and Open Science
Sadly, it has become clear that many published scientific studies fail to replicate–that is, other scientists are unable to produce the same results, such that reported findings cannot be relied upon by others (and, therefore, that these “findings” do not represent a genuine contribution to human knowledge). While this problem partly reflects inherent challenges in scientific discovery, repeated and pervasive failures of reproducibility call the value of science into question (see Organizing Principles). Lack of reproducibility has many sources, which we should understand so as to minimize their influence in our work (and also to avoid being misled when reading papers by other labs):
- Dependence on unreliable methods. This includes “p-hacking” (abuse of excessive researcher degrees of freedom in study methods) “HARKing” (hypothesizing after results are known), and widespread misunderstandings of statistical approaches appropriate to exploratory as opposed to confirmatory analyses; also, use of underpowered study designs and related confusions about the difference between power and positive predictive value. For more backfground see Simmons, Nelson & Simonsohn 2011, Button et al 2013, and Ioannidis 2005 in the Resources\Papers-methods-and-reliability folder of the R: drive.
- Human error. Every time someone touches or manipulates a dataset, this introduces the possibility of unintended errors. We expect human beings to make errors, so we have to be vigilant about checking our own work and asking other members of our team to help check our work. This means that we have to use methods that make it possible to identify and diagnose errors, such as by tracking and logging transformations of data. This is one reason that we strongly prefer that manipulations be performed in code, rather than manually in programs like Excel (see again Deprecated/discouraged tools in Section 3). We also always want to preserve and archive our raw, unmanipulated datasets so that we can always start over if absolutely necessary.
- “Bad” luck. Suppose that 20 different research teams are investigating the same false hypothesis. Even if the null hypothesis holds (there is no true finding), given a conventional significance value of 0.05 it is likely that one of these teams will reach a statistically significant result by chance. Given a publishing bias for positive over negative studies (see point 4 below), this false finding is likely to be published while the 19 correct, negative studies would be ignored. This is often a hollow accomplishment, as labs tend to base future studies on previous findings, so publishing a paper based on a false chance finding can lead to years of wasted work pursuing scientific dead ends.
- Perverse incentives in science. Researchers face tremendous career pressures to publish in order to keep their jobs or secure funding, and publishers generally favor positive over negative results and “surprising” findings over those that follow more directly from what is already known. All of these incentives promote the production and dissemination of false positive findings.
Finally, there’s one more sense of “reproducibility” that is important for the organization of our lab. As emphasized earlier, most members of the lab only stay for 2-3 years, so much of the work you’re doing now will be completed by future members. You should try to document your work in a way that it will be intelligible to future lab members, who can then pick up your work without needing to rebuild your analyses from scratch.
8.1 Approaches to promote reliability
There are some broad principles and practices that we use in the lab to promote the reliability of our work, that everyone (particularly those engaged in quantitative research) should understand:
Distinguishing exploratory vs. hypothesis-driven design. One of the major problems in quantitative research is that many scientists’ aims are in fact exploratory (e.g., trying to look for new things and elucidate processes that haven’t yet been discovered), but our accepted statistical thresholds for publication (e.g., p values and confidence intervals) implicitly assume confirmatory (hypothesis-driven) methods. So many scientists, often without really understanding the statistical issues involved, report exploratory work as if it were hypothesis-driven (e.g., reporting on p values) even though the hypotheses being tested were not actually formulated prior to looking at the data. This violates the assumptions of significance testing and so often contributes to false positive reports. Unfortunately, how to properly report exploratory research remains an open question that the scientific community is still working out. Still, we should remain careful in our thinking about when our aims are exploratory and when they’re hypothesis-driven, and we should not try to pass off exploratory work as hypothesis-driven.
Study pre-registration when feasible; in all cases, pre-planning. In general, if we are conducting a hypothesis-driven study, it makes good sense to pre-register our study design. Generally we use the Open Science Framework. Pre-registration has been standard practice in randomized controlled trials for many years, and is now being extended to other kinds of experimental research. Even when we are not pre-registering studies or we are planning purely exploratory work, it is generally a useful exercise to think in advance about what questions we are interested in, and what would be the best approaches to answer these questions, before we start poking around in datasets. Hopefully this can help us avoid treating noise in our dataset as signal.
Documentation. This is something I struggle with, but is crucially important for other members of the lab and also for yourself. It’s natural to assume, when you’re totally cognitively engaged in a problem, that it will be very easy for you to recall later what you did and why. In fact we all tend to overestimate our recall for these details, once we move on to solve new problems. It’s really important to use our electronic lab notebooks in the private Github repo to document our thought processes. (This will help you in the future as well.) This is particularly crucial for recording a paper trail of analytic decisions, data cleaning, and other decisions affecting findings (e.g., leaving out outliers and why).
“Literate programming.” This refers to a different way of thinking about work that we do in code. As expressed by the computer scientist Donald Knuth:
“Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.”
- In other words, think of your code (whether in R, Stata, Matlab, or any other language) as like an essay intended to communicate to other human beings. This should follow the order and logic of human thought, and not just the forms imposed by the computer. It is not only important that the code “works” (in that it makes the computer perform the operations that you intend); it’s also important that people who read your scripts (including your future self) can understand the flow of your thinking and what each piece of code is intended to do. Section 9: RStudio Analysis Pathway provides a detailed tutorial to one implementation of literate programming.
- Publishing our code. Increasingly, it will be an
expectation that when someone publishes a scientific study, that they also
make their datasets and code publicly accessible so that other scientists can
check their work. After all, if you’re not confident enough in your code to
allow other scientists to review it, then should you be confident enough to
publish findings based on this code? If you’re reading a study but the authors
are not willing to publish their code, should you believe what they report? So
as noted in Section 3.7: Open Science Framework, we are publicly posting our
code when we publish papers. So, this is another reason to work hard to keep
your code clean and readable–we expect reviewers, editors, and readers to be
looking at it.
- Coming soon: code review. As above, human error is common and expected. We should all be diligent about checking our own work, but in most cases we also need other people to help us find errors that we can’t find ourselves. Software developers don’t rely solely on individual programmers to find errors in their code, but instead use collaborative tools so that teams can check one another. This is one of the reasons why all of our code is intended to be shared in the Github repository, rather than just being distributed across different people’s desktop computers. In the coming year, I’m hoping to start using collaborative tools in Github that facilitate review of our code by other lab members. This is another reason why it’s important to write “literate code” that can be readily understood by others, and not just a series of computational operations.
8.2 Open science
The term “open science” is used in many different ways by different scientists, but in general the ideas above follow many of the broad principles of open science. We want to produce work that can be relied upon, used, and extended by other scientists. When feasible, we will pre-register studies, particularly for hypothesis-driven work. We will also aim to post our code when we publish our studies, in part to encourage ourselves to check our code carefully. When permitted by publishers’ policies, we will post free versions of our papers on our lab website so that scholars everywhere can read them.
One respect in which our science may not be fully open has to do with the openness of our datasets. Much of our work involves stigmatized disorders such as dementia, or potentially sensitive topics such as money management or vulnerability to financial errors in aging. For published studies, we should review the consent forms for particular studies as well as MAC data sharing policies to see what forms of data sharing are consistent with center policies and participants’ reasonable expectations regarding the privacy of potentially sensitive data. When feasible and ethically permissible, we should aim to share our datasets (potentially also in redacted form).