For DevOps Engineers

40 Questions DevOps Engineers Should Ask Before Trusting AI Infrastructure Outputs

When Datadog AI suggests new alert thresholds or GitHub Copilot generates a terraform module, your split-second approval shapes what runs in production. Asking the right questions before accepting AI outputs protects your infrastructure from becoming a black box that only machines understand.

These are suggestions. Use the ones that fit your situation.

Download printable PDF

Infrastructure Configuration and Code Generation

1 When GitHub Copilot suggests a security group rule or IAM policy, can I explain to another engineer why each permission is necessary for this specific workload?
2 Does the terraform or CloudFormation generated by AWS CodeWhisperer match the actual blast radius if this resource fails, or have I just accepted defaults?
3 If ChatGPT generates a Kubernetes manifest with resource requests and limits, do those numbers reflect my actual traffic patterns or are they generic placeholders?
4 When Copilot auto-completes a Docker build step or installation command, do I know what version it pulled and whether that version has known vulnerabilities?
5 Does the networking configuration AI suggested for my microservices actually match how my services communicate, or does it assume a different deployment topology?
6 If AI generated a load balancer configuration, have I verified that the health check endpoints and timeouts work for my actual application startup time?
7 When accepting AI-suggested storage configurations (S3 bucket policies, EBS volume types, RDS parameter groups), do I understand the availability and durability trade-offs?
8 Does the logging and tracing configuration generated by AI actually send data to the observability tools my team actually uses?
9 If AI suggested a backup strategy, do I know how long recovery would actually take and whether that meets my RTO requirements?
10 When CodeWhisperer suggests a CI/CD pipeline step, does it assume secrets management practices that match my organisation's actual policies?

Incident Response and Runbook Automation

11 When PagerDuty AI suggests an automated remediation action, what happens if the root cause is something the automation was not designed for?
12 Does the incident runbook generated by ChatGPT include the specific command syntax for my version of the tools I actually run, or generic examples?
13 If AI auto-generates steps to restart a service or clear a queue, have I tested those steps in staging to confirm they do not leave the system in a broken state?
14 When an AI suggests rolling back a deployment, does it account for data migrations or state changes that happened during the incident?
15 Does the incident response runbook AI suggested include the decision points where a human needs to assess whether to continue or stop?
16 If Datadog AI recommends scaling up infrastructure during an incident, do I understand the cost implications and whether that is the correct response?
17 When AI suggests a database query to diagnose a problem, do I know whether that query will lock tables or impact production performance?
18 Does the escalation path in an AI-generated runbook match the on-call rotations and expertise levels my team actually has?
19 If AI suggests disabling alerts as a temporary measure, do I know exactly which alerts, and do I have a reminder set to re-enable them?
20 When an incident runbook tells me to check a specific log file or metric, have I verified that log file exists and is populated in my actual environment?

Monitoring, Alerting, and Observability

21 When Datadog AI suggests new alert thresholds, does it account for the seasonal or time-based traffic patterns specific to my application?
22 If AI recommends adding a metric or changing how you collect data, does that change break any dashboards or reports other teams depend on?
23 When AWS CloudWatch or Datadog AI suggests anomaly detection, do I understand what constitutes an anomaly in my system, or am I just trusting a black box?
24 Does the alert fatigue reduction AI promised actually result in my on-call engineers responding to fewer false positives, or just different noise?
25 If AI suggests a new SLI or SLO based on my data, does that metric actually capture what matters to my users or internal stakeholders?
26 When Copilot suggests monitoring code snippets, do those snippets report to the right observability backend and in the right format?
27 Does the log aggregation configuration AI suggested actually capture errors from all the services I care about, or just the obvious ones?
28 If Datadog AI recommends correlating two metrics to predict failures, do I understand the causation or am I just acting on correlation?
29 When AI suggests a new alerting rule, have I confirmed that it will not trigger during maintenance windows or expected operational events?
30 Does the observability setup AI recommended scale to my actual volume of logs, metrics, and traces, or will costs spiral if traffic grows?

System Understanding and Operational Continuity

31 After accepting AI-generated infrastructure changes, can I draw a diagram of how data flows through the system without consulting the AI again?
32 If the engineer who wrote the original system is no longer on the team, does accepting AI-generated modifications mean no one fully understands it now?
33 When GitHub Copilot suggests a pattern or approach, have I considered whether it matches the patterns already established in my codebase?
34 If AI generates configuration for a critical path service, can a junior engineer on my team understand it well enough to troubleshoot at 3am?
35 Does the infrastructure AI suggested have failure modes that are invisible until something goes wrong in production?
36 When accepting AI-optimised configurations, am I losing the design constraints and trade-off decisions that the original architect made?
37 If I stop using AI tools tomorrow, would my team still be able to modify and debug the infrastructure AI helped build?
38 Does the AI-suggested approach introduce dependencies on cloud provider features that might not be portable to other platforms?
39 When AI generates incident response procedures, does it document the reasoning and assumptions, or just the steps?
40 Have I created a process where new team members learn system reliability from understanding real decisions, or from reading AI-generated explanations?

How to use these questions

Related reads

The Book — Out Now

Cognitive Sovereignty: How To Think For Yourself When AI Thinks For You

Read the first chapter free.

No spam. Unsubscribe anytime.