Machine Learning: Finding the signal or fitting the noise?
Before machine learning came along, a typical approach to building a predictive model was to develop a model that best fit the data. But will a model that best fits your data provide a good prediction? Not necessarily. Fortunately, there are machine learning practices that can help us estimate and optimize the predictive performance of models. But before we delve into that, let’s illustrate the potential problem of “overfitting” your data.
Fitting the Trend vs. Overfitting the Data
For a given dataset, we could fit a simple model to the data (e.g., linear regression) and likely have a decent chance of representing the overall trend. We could alternatively apply a very complex model to the data (e.g. a high-degree polynomial) and likely “overfit” the data – rather than representing the trend, we’ll fit the noise. If we apply the polynomial model to new data, we can expect it to make poor predictions given it’s not really modeling the general trend. The example above illustrates the difference between modelling the trend (the red straight line) and overfitting the data (the blue line). The red line has a better chance of predicting values outside of the dataset presented.
Due to the powerful ability of machine learning techniques to model complexity within a dataset, we need to be extremely careful about overfitting a model to the data. We want to find the “signal” in the data, rather than fitting the noise.
The Goal: Optimize Predictive Performance
Machine learning, or learning algorithms, are well suited to deal with large, complex datasets and provide the ability to automatically learn a task – and improve with experience – without being explicitly programmed for that task. In oil and gas, which involves large and complex datasets (with lots of uncertainty), what we really want out of a (machine learning) model are good predictions on new data. Predictive performance is the true measure of a model’s usefulness in the real world. The problem is that we can’t really know how well a model will perform on new data until we actually test it. Thankfully, there are machine learning techniques that can help us prevent overfitting, and reliably estimate the predictive performance of a model.
A Real World Example
To illustrate the importance of striving for a good prediction, as opposed to a best fit, we applied some of our machine learning techniques to the following problem:
Target = predict the production performance (12-month cumulative production) of a horizontal well using U.S. public production and completion data (data source: DrillingInfo)
Study Area = North Dakota Bakken (Upper Three Forks formation)
- Train the model: using wells from 2011 to 2014 to train the model
- Model 1: optimized to best fit the data (note: we kept some scatter to simulate an empirical model)
- Model 2: optimized for predictive performance
- Test the predictive performance of the model: using wells from 2015 and 2016 to see how well the model can predict production performance (note: we can measure predictive performance because the model has never seen this data and we know the 2015 and 2016 answers)
The chart below shows the distribution of 12-month cumulative oil production for the Training dataset and the Testing dataset, and indicates that the Testing dataset adequately represents the geographical range and range of production values used in the Training dataset. This is also a great illustration of how visual analysis tools can help you understand your data and build confidence in the design of your machine learning pursuits.
The results are clear:
Training results: The “fit-optimized” model’s predictions have a better fit to the target values while the “prediction-optimized” model had noticeably more scatter.
Testing results: When the models are applied to new (i.e. unseen) data – the 2015 and 2016 wells – the “prediction-optimized” model yielded a much better result.
The fit-optimized model learned too much from the noise in the training data, while the prediction-optimized model did a better job of generalizing the data and identifying the signal.
Here are a few tips that can help protect you from overfitting:
- Get more data
- More sample points give the model more support in learning to separate the signal from the noise.
- While too many inputs (sometimes called “features”) can contribute to overfitting, additional inputs, like geological, petrophysical, seismic, pressure or fluid composition data, can contribute to the predictive capabilities of a machine learning model.
- Make sure the data you are using is representative of the objective/target values
- Data in oil and gas is inherently noisy and imperfect, but we should at least do everything we can to make sure the data we are using to develop our models is consistent and representative of the predictions we’re trying to make.
- Keep it simple
- If a simple model fits the data quite well, adding complexity for a small improvement in predictive power could increase the chance of overfitting.
- Sanity checks (the importance of domain expertise)
- Do the predictions appear too good to be true? If so, they probably are – especially with the noisy and inconsistent data we often see in Oil & Gas.
- Does it make sense that the inputs we fed to the model should be useful in predicting our target? If not, the model is probably overfitting.
Acknowledgements: Thank you to Drillinginfo for North Dakota production and completion data used in the analysis presented in this blog.