For Data Scientists and ML Engineers
Data scientists often accept AutoML outputs and AI-generated code without testing them against domain knowledge, then deploy models that pass metrics but fail in production. The pressure to move fast with ChatGPT and GitHub Copilot can override the statistical intuition that catches when a model is technically correct but practically broken.
These are observations, not criticism. Recognising the pattern is the first step.
AutoML platforms and your own experiments show you accuracy, F1, or AUC numbers. You pick the winner and move on. This misses that the metric itself might be optimising for something unrelated to your actual business outcome, leaving you with a model that looks good in isolation but fails when it matters.
The fix
Before selecting any model, write down what business outcome you need and which metric actually measures that outcome, then check if the winning model makes sense for that specific measurement.
Tools like H2O or TPOT test dozens of algorithms and show you rankings. You trust the comparison because the tool is scientific. But each algorithm carries different assumptions about your data (linearity, independence, class balance) that the benchmark never examines.
The fix
After AutoML ranks models, manually inspect the top three to understand what each one assumes about your data, then test whether those assumptions hold in your actual dataset.
A 2 percent improvement looks good on a slide and the code ran without errors. You do not ask whether that improvement is large enough to matter in production or whether it might reverse on data the benchmark never saw.
The fix
Calculate the business impact of a 2 percent change in your metric (how many decisions change, how much money moves, how many users are affected), then decide whether the improvement justifies the added complexity.
You split data into five folds and the model scores well across all of them. But if your data has time dependence, geographic clusters, or repeated measurements from the same user, standard cross-validation lies to you about real performance.
The fix
Before validating any model, identify whether your data has structure (time series, spatial clusters, repeated subjects), then choose a validation strategy that respects that structure.
Your model shows that feature X is most important, so you explain this to stakeholders and build decisions around it. But feature importance can shift dramatically when you retrain on new data or slightly change hyperparameters, making your explanation fragile.
The fix
Train the same model type multiple times on different subsets of your training data and check whether the top five features stay the same across runs; if they do not, note that instability when you report importance.
You ask GitHub Copilot to write preprocessing code or Claude to generate feature engineering logic. The code runs, so you assume it is correct. But the tool may normalise when it should standardise, or drop rows when it should impute, and you only notice when your model performs worse than expected in production.
The fix
For any preprocessing or feature generation code from an AI assistant, write a simple test case with known input and manually verify the output is statistically what you intended.
You use ChatGPT to generate dozens of polynomial, interaction, and lag features to feed into your model. More data seems like it should help. The model fits well on training data but overfits badly, and you end up with a brittle model that breaks on new data.
The fix
Generate features intentionally based on domain knowledge, then use at least one formal feature selection method (permutation importance, recursive elimination, or stability analysis) to keep only the ones that matter.
You know trees ignore scaling, so you skip normalisation when you build an ensemble. Later you need to add a linear model or neural network to your pipeline, but your features are still unscaled and the performance suffers.
The fix
Scale all numerical features at the start of your pipeline in a way that fits only on training data, then apply that same scaling to validation and test sets.
AutoML tools create derived features automatically and you trust them because the tool is designed for this. But you have no idea what features were actually created or whether they introduce data leakage from your target variable.
The fix
After running AutoML, demand a list of all engineered features, inspect the top ten for leakage (whether they could have been calculated only after knowing the outcome), and remove any that leak information.
You set a random seed to make your work reproducible, but you use the same seed for data splitting, model initialisation, and feature selection across every experiment. This creates a hidden dependency where your results all come from one particular random shuffle.
The fix
Use the same seed for data splitting so splits stay reproducible, but vary seeds for model initialisation and sampling to check whether your results are stable across different random starting points.
Your model validated well so you deploy it to production. Six weeks in, you notice it performs worse than expected, but you have no mechanism to identify which types of examples it fails on or whether your data has shifted.
The fix
Before deploying, define which metrics you will monitor in production (accuracy on subgroups, prediction confidence distribution, feature value ranges), and set alerts for when they change by more than five percent.
Your model relies on features A and B which are highly correlated. Feature importance says A matters more, so you tell stakeholders A is the driver of the decision. But the model might equally well have used B, making your explanation misleading.
The fix
When reporting feature importance, also report the correlation between top features, and flag any high correlations as a sign that feature importance may not be stable.
You choose to optimise accuracy because it is the default metric in scikit-learn, even though your business cares about precision (reducing false alarms) or recall (catching all cases). Your model hits high accuracy but fails at the actual goal.
The fix
Define your metric before you start modelling by asking stakeholders what a false positive and false negative cost in real terms, then choose or weight your metric to match that cost.
You train on six months of historical data and validate on the following month. This looks like a realistic test. But production data comes from a different source, a different time window, or a different user population, and your model fails because the distribution shifted.
The fix
Before deploying, sample a small batch of real production data if possible and test your model on it; if production data is unavailable, explicitly document what distribution shifts you are assuming will not happen.
Your model outputs a probability for each prediction. A stakeholder sees 87 percent confidence and assumes it is right 87 percent of the time. But many models are not calibrated, meaning their confidence scores do not match their actual accuracy.
The fix
After training any model that outputs probabilities, plot calibration curves (predicted probability versus observed frequency) to check whether a 70 percent prediction is actually correct 70 percent of the time.
You train a loan approval model on data from one region or demographic, then deploy it everywhere. The model performs well on average, but fails for groups it never saw during training because those groups have different feature distributions.
The fix
After deploying, measure performance separately for each major demographic or geographic subgroup, and if performance differs by more than ten percent, retrain on a balanced sample or adjust the model for each subgroup.
Worth remembering
The Book — Out Now
Read the first chapter free.
No spam. Unsubscribe anytime.