Double Dipping in Machine Learning: Problems and Solutions (2024)

Journal List
HHS Author Manuscripts
PMC7422774

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Biol Psychiatry Cogn Neurosci Neuroimaging. Author manuscript; available in PMC 2020 Aug 12.

Published in final edited form as:

Biol Psychiatry Cogn Neurosci Neuroimaging. 2020 Mar; 5(3): 261–263.

Published online 2019 Sep 16. doi:10.1016/j.bpsc.2019.09.003

PMCID: PMC7422774

NIHMSID: NIHMS1615959

The Problem of Double Dipping

Building valuable predictive models requires finding the right level of complexity to avoid overfitting or underfitting the data. A modeling procedure that closely matches every nuance of the data (i.e., overfitting) will have high accuracy for the original dataset to which it is applied but will fail to replicate in new datasets. In contrast, an overly general model (i.e., underfitting) may perform consistently across datasets yet provide low accuracy and thus low utility. Double dipping is a term for overfitting a model through both building and evaluating the model on the same data-set, yielding inappropriately high statistical significance and circular logic.

Making predictions with or without machine learning involves two steps. The first step is to determine which variables to use in predicting the outcome (i.e., feature selection). The second step is to assess how accurately a model predicts the outcome (i.e., model evaluation). Double dipping occurs when features are selected using the same criteria, and in the same sample, as model evaluation. In other words, if one first searches for predictor variables that relate to the outcome within a sample of subjects, and then builds a predictive model containing only those variables, the model will unavoidably demonstrate high accuracy in that sample. This accuracy may be based on chance relationships specific to the sample, and such a double-dipped model will therefore be highly biased, consistently overestimating how the model will perform outside the specific sample (2).

Double Dipping in Random Forests: An Example

One promising and widely used machine learning algorithm is random forest modeling (3), which protects against double dipping by building hundreds of decision “trees” to predict an outcome. Each tree is created using a different random subsample of predictor variables and a different random subsample of subjects. Crucially, each tree is evaluated using the subjects that were not included in its creation. In this way, the performance of the full “forest,” aggregated across all trees, is estimated without double dipping.

However, even random forest’s internal protection against double dipping cannot prevent all double-dipping issues. For example, double dipping will occur if one applies random forest modeling twice using the same subjects: first on the full set of variables to identify those that perform best and then on the subset of the best-performing variables to evaluate the performance of a more parsimonious model. This, in fact, was what we initially tried when applying random forest using clinical, demographic, neuropsychological, and magnetic resonance imaging variables to predict which youths would transition to moderate to heavy alcohol use (4). We use that dataset and procedure as an example to show how to detect and avoid double dipping in machine learning–based predictive models.

Recommendations to Detect and Avoid Double Dipping

Given that issues of double dipping can plague even the most well-intentioned researcher, it is important to implement checks to detect inadvertent double dipping (e.g., random data testing and permutation testing) and strategies to avoid double dipping (e.g., feature selection only, model evaluation only, and cross-validation). We describe each of these in detail using the example dataset and procedure described above.

Strategies to detect double dipping include random data testing and permutation testing. Running the analysis using completely random data (i.e., random data testing) is a quick way to detect whether an analytic procedure involves double dipping. A hallmark of double dipping is that a greater number of predictor variables yields greater accuracy, because double dipping capitalizes on chance associations, which is easier with a greater number of variables (Figure 1A).

Open in a separate window

Figure 1.

Strategies for detecting double dipping. (A) Results of random data test generated using a dataset of entirely random numbers representing a varying number of “predictor variables” (first column), and a random binary “outcome,” evenly distributed in 136 “subjects.” Because the data are random noise, model performance should be #50% and should not improve dramatically with an increasing number of random predictors, as in the fair model with all variables (second column). However, with a 2-step random forest procedure that includes double dipping to select a subset of variables (third column), the model based on fully random data shows high accuracy, especially with a large number of predictors (final column). (B) Results of a permutation test on a random forest analysis procedure that included double dipping. The red line indicates expected average accuracy of permuted outcome data if no double dipping were present (outcome base rate). The blue line indicates average accuracy of permuted data using double-dipped analysis procedure. The green line indicates observed accuracy in double-dipped analysis with real data. The black line indicates range of accuracy with 2-tailed p < .05.

A more computationally intensive test to detect double dipping is a permutation test, in which real data are used but the outcome variable is permuted (i.e., scrambled), resulting in an estimate of the null distribution of predictive accuracy based on one’s procedure and data structure. In our example, we used real data from 137 subjects and 203 clinical, demographic, neuropsychological, structural magnetic resonance imaging, and functional magnetic resonance imaging variables, permuted the outcome variable (here, the initiation of alcohol use in adolescence), and ran the double-dipped analysis procedure. We repeated this process 1000 times, repermuting the outcome and recording the accuracy each time (Figure 1B).

This permutation test can provide evidence either for or against double dipping. If there is no double dipping, the null distribution obtained through permutation testing should roughly match the base rate of the outcome in the sample. In this example, 50% of adolescents in our sample became moderate to heavy alcohol users (Figure 1B, red line). If there is double dipping, the null distribution obtained through permutation testing will not match the base rate (Figure 1B; compare blue and red lines). Permutation testing also has the benefit of providing a range of accuracy values that can be considered better than chance at p < .05 (Figure 1B, black arrow). This allows the accuracy estimates obtained on real, nonpermuted data (Figure 1B, green line) to be placed in context.

We suggest three strategies to avoid double dipping: feature selection only approaches, model evaluation only approaches, and cross-validation approaches.

Double dipping can be avoided by focusing exclusively on feature selection. In our example, we corrected our manuscript to focus entirely on the best variables selected by the random forest models (4). Random forest modeling is an extremely useful tool for feature selection because of its ability to identify complex interactions among variables. A thorough understanding of the variables and combinations of variables that most contribute to predicting an outcome allows future research to focus on evaluating predictive models grounded in this understanding.

When “feature selection only” approaches have already been conducted, or when there are strong theoretical grounds for feature selection, model evaluation only approaches may be appropriate. By selecting features based on either theory or previous research, models can be evaluated in a single sample. The preregistration of selected features and the analytic plan is particularly important for model evaluation approaches to ensure that models are not being altered based on the same data on which they are evaluated.

Cross-validation approaches are a powerful empirical approach to avoid double dipping (1). In cross-validation, the dataset is split into a training set that is exclusively used for feature selection and a held-out validation set that is exclusively used for model evaluation. This is often done iteratively (k-fold cross-validation), avoiding double dipping by splitting the sample into k subsamples and running the analysis k times, each time using k − 1 subsamples for feature selection and the final subsample to evaluate the resulting model. Although k-fold cross-validation generates multiple slightly different models rather than a single model, it provides a set of consensus predictors for future model building and evaluation and an unbiased cross-validated estimate of predictive accuracy (averaged across the k models). This procedure, however, requires sufficient power for a stable estimate of accuracy in only a fraction of the sample. One could decrease k to increase the proportion of the sample on which accuracy is estimated; however, this results in decreased proportion of the sample to identify consensus variables. Thus, with a greater k, the feature selection stabilizes, while the model accuracy estimates begin to vary widely. In contrast, with smaller k there is less consensus regarding the best-performing variables. Because of this, we recommend that k-fold cross-validation of random forest analysis be attempted only in samples of ≥200 subjects, with smaller samples focusing their power on feature selection.

The Bigger Picture

Overall, machine learning offers great potential to contribute significantly to the most difficult problems in psychiatry by leveraging complex datasets to predict disorder onset and recovery. However, enthusiasm for machine learning should be tempered by caution and consultation to ensure that models are methodologically sound and can generalize outside the data on which they were developed. Machine learning has the promise to generate clinically relevant predictive models, but this is useful only if models are valid, sensitive, specific, and generalizable, and yield actionable predictions that have real-world benefits.

Acknowledgments and Disclosures

This work was supported by National Institute of Mental Health Grant Nos. T32-MH019938 and K23-MH113708 (to TMB).

We thank Matthias Guggenmos for identifying the double-dipping issue in our random forest analysis and for bringing it to our attention.

The authors report no biomedical financial interests or potential conflicts of interest.

References

1. Kriegeskorte N, Simmons WK, Bellgowan PS, Baker CI (2009): Circular analysis in systems neuroscience: The dangers of double dipping. Nat Neurosci12:535–540. [PMC free article] [PubMed] [Google Scholar]

2. Fortmann-Roe S (2012): Understanding the bias-variance tradeoff. Available at: http://scott.fortmann-roe.com/docs/BiasVariance.html.AccessedSeptember 16, 2019.

3. Breiman L (2001): Random forests. Machine Learning45:5–32. [Google Scholar]

4. Squeglia LM, Ball TM, Jacobus J, Brumback T, McKenna BS, Nguyen-Louie TT, et al. (2017): Neural predictors of alcohol use initiation during adolescence. Am J Psychiatry174:172–185. [PMC free article] [PubMed] [Google Scholar]

Double Dipping in Machine Learning: Problems and Solutions (2024)

FAQs

What is the problem with double dipping? ›

The primary concern with double-dipping revolves around the potential transfer of oral bacteria and saliva from an individual's mouth back into the communal dip, via some sort of dunking vessel.

Discover More Details ›

What is the best way to solve a problem using machine learning? ›

When approaching machine learning problems, these are the steps you will need to go through:

Setting acceptance criteria.
Cleaning your data and maximizing ist information content.
Choosing the most optimal inference approach.
Train, test, repeat.

Learn More ›

What is the problem not to be solved in machine learning? ›

Overfitting is a common problem in machine learning, where the model performs well on the training data but poorly on the testing data. To address this issue, machine learning professionals use regularization techniques like L1 or L2 regularization or use more data for training.

How to understand data in machine learning? ›

Quality data is fundamental to any data science engagement. To gain actionable insights, the appropriate data must be sourced and cleansed. There are two key stages of Data Understanding: a Data Assessment and Data Exploration.

Learn More Now ›

How can you prevent double dipping? ›

Ensuring customer service and logistics operations are on point also helps to prevent double dipping and to dispute transactions when it occurs.

Keep Reading ›

What are the disadvantages of double dipping? ›

The Transfer of Bacteria: One of the main concerns with double dipping is the potential transfer of bacteria from the mouth back into the communal dip. Research has shown that when a person takes a bite of a chip and then re-dips it, bacteria from their mouth can indeed end up in the dip.

Learn More ›

What are four typical problems to be solved using machine learning approach? ›

PROBLEMS THAT CAN BE EASILY SOLVED BY MACHINE LEARNING

Manual data entry. ...
Detecting Spam. ...
Product recommendation. ...
Medical Diagnosis. ...
Customer segmentation and Lifetime value prediction. ...
Financial analysis. ...
Predictive maintenance. ...
Image recognition (Computer Vision)

Jul 24, 2019

View Details ›

What is the best methodology for machine learning? ›

Top Machine Learning Algorithms You Should Know

Linear Regression.
Logistic Regression.
Linear Discriminant Analysis.
Classification and Regression Trees.
Naive Bayes.
K-Nearest Neighbors (KNN)
Learning Vector Quantization (LVQ)
Support Vector Machines (SVM)

More items...

Get More Info Here ›

What is the most common issue when using ML? ›

The number one problem facing Machine Learning is the lack of good data. While enhancing algorithms often consumes most of the time of developers in AI, data quality is essential for the algorithms to function as intended.

Read The Full Story ›

What are the 3 basic types of machine learning problems? ›

Machine learning involves showing a large volume of data to a machine to learn, make predictions, find patterns, or classify data. The three machine learning types are supervised, unsupervised, and reinforcement learning.

Know More ›

What problems Cannot be solved by AI? ›

The moral reasoning problem

While AI can make decisions based on objective criteria, it struggles with making decisions based on moral or ethical considerations. This includes tasks like determining the best course of action in a difficult moral dilemma.

Read The Full Story ›

Why the problem of machines learning is so difficult? ›

The machine learning method heavily relies on data. The lack of high-quality data is one of the significant problems that machine learning experts encounter. It may be exceedingly taxing to analyze noisy and erratic data. We don't want our system to produce unreliable or flawed predictions.

Get More Info Here ›

How do you explain machine learning in layman's terms? ›

In simpler terms, machine learning enables computers to learn from data and make decisions or predictions without being explicitly programmed to do so.

Get More Info Here ›

What are the 7 steps of machine learning? ›

Machine Learning Steps

Collecting Data: As you know, machines initially learn from the data that you give them. ...
Preparing the Data: After you have your data, you have to prepare it. ...
Choosing a Model: ...
Training the Model: ...
Evaluating the Model: ...
Parameter Tuning: ...
Making Predictions.

Aug 21, 2023

Get More Info ›

What is the difference between AI and machine learning? ›

Artificial Intelligence (AI) is an umbrella term for computer software that mimics human cognition in order to perform complex tasks and learn from them. Machine learning (ML) is a subfield of AI that uses algorithms trained on data to produce adaptable models that can perform a variety of complex tasks.

Keep Reading ›

Why is double dipping not allowed? ›

The lower viscosity means that more of the dip touching the bitten cracker falls back into the dipping bowl rather than sticking to the cracker. And as it drops back into the communal container, it brings with it bacteria from the mouth of the double-dipper.

Show Me More ›

Why should you never double dip food? ›

Research has shown that double-dipping significantly increases the number of bacteria in the dip, which can then be passed on to the next person who uses it. This is especially concerning if anyone sharing the dip is sick, as it can spread illnesses like the common cold or flu.

Learn More Now ›

Why is double dipping rude? ›

It's no secret that dunking something you've already taken a bite out of back into the communal dip can spread germs — and is just pretty ick. Research has all but proven that double-dipping is downright dirty.

Discover More Details ›

Why is double dipping in the salon not allowed? ›

Some salons are trained to believe that it is safe to double dip, unless there is blood on the spatula or going for a bikini wax. However, the truth is that double dipping compromises on the integrity of your skin and takes a toll on your health simultaneously.

Show Me More ›