For Data Scientists and ML Engineers

40 Questions Data Scientists Should Ask Before Trusting AI Model Recommendations

Q: Does the AutoML platform report which test set it used to rank models, and is that test set truly separate from the training data or did it leak through cross-validation folds?

A key question for Data Scientists to ask when reviewing AI outputs.

Q: When AutoML selects a neural network over a gradient boosted tree, can you see the actual metric difference, or does the platform hide the comparison because the difference is smaller than your tolerance?

A key question for Data Scientists to ask when reviewing AI outputs.

Q: Has AutoML tested your model on the specific edge cases your business cares about, or only on aggregate metrics like accuracy and AUC?

A key question for Data Scientists to ask when reviewing AI outputs.

Q: If AutoML chose a complex ensemble, do you understand which base models are in it and whether the ensemble would still work if you removed the slowest component?

A key question for Data Scientists to ask when reviewing AI outputs.

Q: Does the benchmark test set include the same distribution of rare classes or rare feature combinations that will appear in production?

A key question for Data Scientists to ask when reviewing AI outputs.

AutoML tools and code-generation AI can produce models that pass benchmarks but fail in production. Your judgement about whether a model should ship depends on asking the right questions before you trust what the tools tell you.

These are suggestions. Use the ones that fit your situation.

Download printable PDF

Questions About AutoML and Model Selection

1 Does the AutoML platform report which test set it used to rank models, and is that test set truly separate from the training data or did it leak through cross-validation folds?

2 When AutoML selects a neural network over a gradient boosted tree, can you see the actual metric difference, or does the platform hide the comparison because the difference is smaller than your tolerance?

3 Has AutoML tested your model on the specific edge cases your business cares about, or only on aggregate metrics like accuracy and AUC?

4 If AutoML chose a complex ensemble, do you understand which base models are in it and whether the ensemble would still work if you removed the slowest component?

5 Does the benchmark test set include the same distribution of rare classes or rare feature combinations that will appear in production?

6 When the tool compares models, does it account for inference latency in your serving environment, or only for training speed?

7 Has AutoML actually tried simple baselines like logistic regression or a decision tree, or does the platform skip them because they appear weak on paper?

8 Can you extract the exact hyperparameters AutoML selected, or is the model locked inside a proprietary format that prevents you from understanding what it chose?

9 Does the platform report confidence intervals or error bars on its performance metrics, or only point estimates that hide the noise?

10 If you retrain the exact same AutoML configuration on new data next month, will you get the same model or a different one due to randomness?

Questions About Code Generation and LLM-Suggested Models

11 When GitHub Copilot or ChatGPT suggests a model architecture, does the code it generates match your actual data types, or did it assume tabular data when you have text or time series?

12 If Claude generated a feature engineering pipeline, have you verified that the transformations it applied do not leak information from the test set into the training set?

13 Does the LLM-suggested preprocessing code handle missing values in a way that makes sense for your domain, or did it pick a common default that is wrong for your problem?

14 When an AI tool recommends a specific library or algorithm, have you checked whether that tool is actually maintained and compatible with your production environment?

15 Has the code generated by Copilot or Gemini been tested on a hold-out validation set, or does it only work on the examples in your prompt?

16 If the LLM suggested using a pre-trained model, did it tell you which training data that model was built on and whether that data matches your use case?

17 Does the generated code include any statistical tests to check assumptions, or does it assume those assumptions hold without verification?

18 When ChatGPT wrote your training loop, did it include early stopping or regularisation to prevent overfitting, or only the minimum code to train?

19 Have you run the suggested model code against your actual production feature schema, or only against sample data in a notebook?

20 If the AI tool suggested a loss function, do you understand why that loss function is appropriate for your business outcome, or did you accept it because the code looked authoritative?

Questions About Model Behaviour and Statistical Intuition

21 If your model's accuracy improved by 2 percent when you added a new feature, did you check whether that feature is actually predictive or just correlated by chance in your training set?

22 When the model performs better on one customer segment than another, have you investigated whether the training data was imbalanced or whether the model genuinely cannot learn that segment?

23 Does your model make predictions that contradict your domain knowledge in ways that worry you, even if the metric is good?

24 Have you tested what happens when you swap the sign of a key feature (for example, reversing true to false) to see if the model is truly using the feature or just using correlated noise?

25 If the model's performance dropped sharply between development and production, did you check whether the production data distribution shifted or whether you made a mistake during deployment?

26 When you look at the features the model weights most heavily, do they make causal sense or are they just statistically correlated with the target?

27 Have you examined a few individual predictions your model makes and asked yourself whether you would explain those decisions the same way if a customer asked you?

28 Does the model's confidence (predicted probability or score) match how often it is actually correct across different ranges, or does it over-confident on some types of examples?

29 If your model performs differently on subgroups of your data, have you checked whether those differences reflect real patterns or sampling variation?

30 Have you calculated what baseline accuracy you would get by always predicting the most common class, and is your model's improvement actually meaningful compared to that baseline?

Questions About Production Readiness and Real-World Failure

31 Does your model handle categorical features with new values that never appeared in training data, or will it crash when production sends an unexpected category?

32 If a feature your model depends on becomes unavailable or unreliable in production, do you have a fallback prediction strategy or will the model fail silently?

33 Have you measured model performance separately for the tail of your distribution (the 1 percent of examples the model rarely sees), or only on aggregate metrics?

34 When you sampled data to speed up training, did you check whether that sampling introduced bias toward examples that are easier for the model to learn?

35 Does your monitoring system detect when the model is making predictions on data that is too different from its training set, or do you only monitor overall accuracy?

36 If the model's prediction latency varies wildly depending on the input, have you checked whether it will meet your serving time budget on all inputs or only typical ones?

37 Have you intentionally retrained the model on stale data to see how quickly performance degrades, so you know when retraining is actually necessary?

38 When a business stakeholder asks why the model made a specific prediction, can you give them an answer that makes sense in their language, or only a SHAP value they cannot act on?

39 Does the model depend on features that are expensive or slow to compute, and have you weighed whether the performance gain is worth that cost?

40 If you had to explain to a regulator or auditor why you chose this model over simpler alternatives, would your reasoning hold up or would you struggle to articulate it?

How to use these questions

Before you run AutoML, write down the decision rule you would use to choose between models. If AutoML picks something different, stop and ask why rather than assuming the tool is right.
Treat LLM-generated code as a first draft, not a solution. Run it on a small hold-out set and deliberately look for ways it could fail before you scale it.
When a model's metric looks suspiciously good, your first instinct should be to suspect data leakage. Check feature definitions, preprocessing steps, and data split logic before you celebrate.
Keep a notebook of predictions your model makes that surprise you. That surprise is your domain knowledge telling you something the metric missed.
Ask your business stakeholder to name one prediction the model could make that would be so wrong they would reject it, no matter how good the accuracy. If they cannot answer, you do not understand the actual requirement.

40 Questions Data Scientists Should Ask Before Trusting AI Model Recommendations

Questions About AutoML and Model Selection

Questions About Code Generation and LLM-Suggested Models

Questions About Model Behaviour and Statistical Intuition

Questions About Production Readiness and Real-World Failure

Cognitive Sovereignty: How To Think For Yourself When AI Thinks For You