40 Questions Data Scientists Should Ask Before Trusting AI Model Recommendations
AutoML tools and code-generation AI can produce models that pass benchmarks but fail in production. Your judgement about whether a model should ship depends on asking the right questions before you trust what the tools tell you.
These are suggestions. Use the ones that fit your situation.
1Does the AutoML platform report which test set it used to rank models, and is that test set truly separate from the training data or did it leak through cross-validation folds?
2When AutoML selects a neural network over a gradient boosted tree, can you see the actual metric difference, or does the platform hide the comparison because the difference is smaller than your tolerance?
3Has AutoML tested your model on the specific edge cases your business cares about, or only on aggregate metrics like accuracy and AUC?
4If AutoML chose a complex ensemble, do you understand which base models are in it and whether the ensemble would still work if you removed the slowest component?
5Does the benchmark test set include the same distribution of rare classes or rare feature combinations that will appear in production?
6When the tool compares models, does it account for inference latency in your serving environment, or only for training speed?
7Has AutoML actually tried simple baselines like logistic regression or a decision tree, or does the platform skip them because they appear weak on paper?
8Can you extract the exact hyperparameters AutoML selected, or is the model locked inside a proprietary format that prevents you from understanding what it chose?
9Does the platform report confidence intervals or error bars on its performance metrics, or only point estimates that hide the noise?
10If you retrain the exact same AutoML configuration on new data next month, will you get the same model or a different one due to randomness?
Questions About Code Generation and LLM-Suggested Models
11When GitHub Copilot or ChatGPT suggests a model architecture, does the code it generates match your actual data types, or did it assume tabular data when you have text or time series?
12If Claude generated a feature engineering pipeline, have you verified that the transformations it applied do not leak information from the test set into the training set?
13Does the LLM-suggested preprocessing code handle missing values in a way that makes sense for your domain, or did it pick a common default that is wrong for your problem?
14When an AI tool recommends a specific library or algorithm, have you checked whether that tool is actually maintained and compatible with your production environment?
15Has the code generated by Copilot or Gemini been tested on a hold-out validation set, or does it only work on the examples in your prompt?
16If the LLM suggested using a pre-trained model, did it tell you which training data that model was built on and whether that data matches your use case?
17Does the generated code include any statistical tests to check assumptions, or does it assume those assumptions hold without verification?
18When ChatGPT wrote your training loop, did it include early stopping or regularisation to prevent overfitting, or only the minimum code to train?
19Have you run the suggested model code against your actual production feature schema, or only against sample data in a notebook?
20If the AI tool suggested a loss function, do you understand why that loss function is appropriate for your business outcome, or did you accept it because the code looked authoritative?
Questions About Model Behaviour and Statistical Intuition
21If your model's accuracy improved by 2 percent when you added a new feature, did you check whether that feature is actually predictive or just correlated by chance in your training set?
22When the model performs better on one customer segment than another, have you investigated whether the training data was imbalanced or whether the model genuinely cannot learn that segment?
23Does your model make predictions that contradict your domain knowledge in ways that worry you, even if the metric is good?
24Have you tested what happens when you swap the sign of a key feature (for example, reversing true to false) to see if the model is truly using the feature or just using correlated noise?
25If the model's performance dropped sharply between development and production, did you check whether the production data distribution shifted or whether you made a mistake during deployment?
26When you look at the features the model weights most heavily, do they make causal sense or are they just statistically correlated with the target?
27Have you examined a few individual predictions your model makes and asked yourself whether you would explain those decisions the same way if a customer asked you?
28Does the model's confidence (predicted probability or score) match how often it is actually correct across different ranges, or does it over-confident on some types of examples?
29If your model performs differently on subgroups of your data, have you checked whether those differences reflect real patterns or sampling variation?
30Have you calculated what baseline accuracy you would get by always predicting the most common class, and is your model's improvement actually meaningful compared to that baseline?
Questions About Production Readiness and Real-World Failure
31Does your model handle categorical features with new values that never appeared in training data, or will it crash when production sends an unexpected category?
32If a feature your model depends on becomes unavailable or unreliable in production, do you have a fallback prediction strategy or will the model fail silently?
33Have you measured model performance separately for the tail of your distribution (the 1 percent of examples the model rarely sees), or only on aggregate metrics?
34When you sampled data to speed up training, did you check whether that sampling introduced bias toward examples that are easier for the model to learn?
35Does your monitoring system detect when the model is making predictions on data that is too different from its training set, or do you only monitor overall accuracy?
36If the model's prediction latency varies wildly depending on the input, have you checked whether it will meet your serving time budget on all inputs or only typical ones?
37Have you intentionally retrained the model on stale data to see how quickly performance degrades, so you know when retraining is actually necessary?
38When a business stakeholder asks why the model made a specific prediction, can you give them an answer that makes sense in their language, or only a SHAP value they cannot act on?
39Does the model depend on features that are expensive or slow to compute, and have you weighed whether the performance gain is worth that cost?
40If you had to explain to a regulator or auditor why you chose this model over simpler alternatives, would your reasoning hold up or would you struggle to articulate it?
How to use these questions
Before you run AutoML, write down the decision rule you would use to choose between models. If AutoML picks something different, stop and ask why rather than assuming the tool is right.
Treat LLM-generated code as a first draft, not a solution. Run it on a small hold-out set and deliberately look for ways it could fail before you scale it.
When a model's metric looks suspiciously good, your first instinct should be to suspect data leakage. Check feature definitions, preprocessing steps, and data split logic before you celebrate.
Keep a notebook of predictions your model makes that surprise you. That surprise is your domain knowledge telling you something the metric missed.
Ask your business stakeholder to name one prediction the model could make that would be so wrong they would reject it, no matter how good the accuracy. If they cannot answer, you do not understand the actual requirement.