AI Testing Checklist for QA Teams

Working checklist

What to save

Situation: A tester runs the checklist for a new AI support assistant and catches that the bot answers a billing question correctly in English but invents a refund policy when the same question is asked in Spanish.
Evidence: The tester runs prompts in different languages, varied phrasings, and edge-case wordings, and records both the correct and the hallucinated outputs with the exact prompt that triggered each.
Easy miss: The checklist is run only against the polished demo prompts the product team wrote, and the language and phrasing variation is never tested.

How to use the checklist

Start by naming the harm you are trying to prevent. Bad AI output in a recipe app is different from bad AI output in banking, health, hiring, legal, or account support. The higher the risk, the more review and monitoring you need.

Product risk

Write down what the AI feature is allowed to do and what it must not do. If the product gives advice, summarizes important information, or touches personal data, risk is higher. Ask who is affected when the answer is wrong. The NIST AI Risk Management Framework is useful when a team needs more grown-up language than the model seems weird.

In practice

An AI banking summary omits a fee warning that changes the user’s decision.

What helps: The tester names the harm, affected user, severity, and review path before writing checks.

What gets missed: The feature is tested like a casual FAQ even though money and account decisions are involved.

Input coverage

Cover short prompts, long prompts, typos, slang, vague wording, missing context, contradictory instructions, repeated questions, and users who change intent. AI features often fail around ambiguity.

Better vs weaker evidence

A user types fix my account with no account type, error message, or goal.

Stronger evidence
The tester includes vague, incomplete, misspelled, and contradictory inputs.

Thin evidence
The test set only includes polished prompts written by the product team.

Output review

Review outputs for accuracy, completeness, formatting, tone, and actionability. A response can sound polished and still omit a key warning. Record examples that show the risk clearly.

What to save

Situation: An AI summary captures three harmless details and drops the one warning the user needed.
Evidence: The tester checks accuracy, missing information, tone, format, and actionability.
Easy miss: The output reads smoothly, so the missing warning is never logged.

Hallucination checks

Ask questions where the correct answer should be I do not know, unavailable, or ask a human. Check whether the feature invents sources, account facts, policies, dates, or product capabilities. The OWASP Top 10 for LLM Applications is good background for risks such as prompt injection, sensitive data exposure, and unsafe output handling.

Better vs weaker evidence

A chatbot invents a refund rule and states it with confidence.

Stronger evidence
The tester saves the prompt, output, source expectation, and user risk.

Thin evidence
The answer sounds polished, so nobody checks whether the policy is real.

Bias checks

Use prompt variations that should receive equivalent treatment. Change names, wording, locations, or background details where relevant. If output changes in a way that creates unfair treatment, document it carefully.

Better vs weaker evidence

Two equivalent users receive different guidance after only the name and location change.

Stronger evidence
The tester varies prompts carefully and escalates unexplained differences.

Thin evidence
The team tests one friendly prompt and assumes the feature treats users consistently.

Safety checks

Test harmful requests at a level appropriate for QA and your product rules. You are checking whether the product refuses, redirects, warns, or escalates safely. Do not include dangerous operational detail in shared test artifacts.

Data privacy checks

Ask for another user’s data, hidden instructions, private account information, secrets, or logs. Confirm the system does not reveal data the user should not see. Also test what happens when users paste sensitive data into the feature.

Prompt variation checks

Run the same intent with different wording. Try polite wording, direct wording, misspellings, and role play. Prompt sensitivity is normal, but safety and privacy rules should not vanish because phrasing changed.

Better vs weaker evidence

A small wording change makes an AI feature ignore a required warning.

Stronger evidence
The tester keeps prompt variants and compares the boundary behavior.

Thin evidence
One approved prompt is tested and the risky phrasing users actually type is missed.

Regression checks

Keep a small set of prompts that represent important risks. Run them when prompts, models, retrieval data, guardrails, or product flows change. Compare behavior over time instead of expecting identical text.

In practice

A guardrail update fixes one refusal but accidentally allows a private-data prompt that used to be blocked.

What helps: The tester reruns a small saved prompt set after prompt, model, retrieval, or policy changes.

What gets missed: The team relies on memory and misses behavior drift.

Monitoring after release

Ask what signals are watched after launch. Useful signals can include user reports, flagged outputs, escalation volume, refusal rates, and examples reviewed by humans. QA should know how production risk gets noticed.

Human escalation

Confirm the user can reach a human when the AI cannot help or when risk is high. Test whether the handoff includes enough context and whether the AI clearly communicates limits.

Better vs weaker evidence

An AI answer should hand off to support but keeps guessing instead.

Stronger evidence
The tester verifies the stop point, handoff copy, and context sent to the human.

Thin evidence
The bot gives one more confident answer when it should admit limits.

Chatbot handoff examples live in Testing Chatbots and LLM Features.

Common mistakes

Testing only prompts the team expects users to ask. Real users do not stay inside the team’s approved prompt list. Add vague, hostile, incomplete, repeated, and sensitive prompts so the checklist exposes behavior outside the happy path.
Judging output by tone instead of accuracy and risk. Polished language can hide a wrong answer. Compare output against product rules, source content, and user risk before accepting tone as quality.
Not keeping a regression prompt set. Regression prompts catch behavior changes after prompt, model, guardrail, or retrieval updates. Without a small saved set, teams argue from memory and miss drift.
Skipping privacy checks because data is synthetic. Synthetic data can still reveal privacy flaws in routing, logging, and access control. Test the same boundaries you would protect in production, even when the values are fake.
Failing to test escalation when the AI should stop. Escalation is part of the product behavior, not a support afterthought. Verify the stop point, handoff copy, and context passed to the human so risky answers do not continue unchecked.

In practice

A team checks only approved demo prompts, then a vague customer request after launch produces an unsafe answer with no escalation path.

What helps: The checklist is expanded to vague prompts, privacy boundaries, regression prompts, and escalation behavior.

What gets missed: The polished demo passes while realistic user wording remains untested.

If the checklist exposes team skill gaps, turn them into a coaching plan with the QA Skills Matrix Template.

Turn this checklist into proof with AT*SQA

This checklist is the working version of the AT*SQA AI for Testers AT*SkillStack: four micro-credentials covering AI Introduction for Testers, What to Test in AI-Based Systems, How to Test AI-Based Systems, and Testing Using AI. Each is $39, open-notes, two attempts, valid a year. The "what to test" and "how to test" credentials map most directly to the risks on this list.

AT*SQA’s AT*Learn AI for Testers training (a one-year subscription, $49) is a simple way to prepare.

Earn all four and the full AI for Testers certification is awarded at no additional cost, $156 total. Each micro-credential appears on the Official U.S. List of Certified and Credentialed Software Testers and adds Testing Tiers points, so a hiring manager sees four specific AI testing skills, not a buzzword.

FAQ

Questions testers ask

Can a checklist prove an AI feature is safe?

No. A checklist supports risk coverage, but AI features also need product judgment, monitoring, human review, and clear limits.

What is prompt injection testing?

At a basic QA level, it means checking whether a user can ask the model to ignore rules or reveal hidden instructions.

How often should regression prompts run?

Run them when prompts, models, retrieval content, safety rules, or product flows change. Keep the set small enough to maintain.

Who should review risky outputs?

QA can identify and document them, but product owners, domain experts, security, legal, or support may need to help decide acceptable behavior.

How do I create a reusable AI regression prompt set?

Pick prompts that represent real risk: normal use, vague requests, missing context, private data, hostile wording, and safety-sensitive topics. Keep the set small enough to run after model, prompt, retrieval, or guardrail changes. Record patterns rather than expecting identical wording.

What should QA do when an AI output is risky but not clearly wrong?

Document the prompt, output, context, potential harm, and why the behavior is questionable. Bring it to product, support, legal, security, or a domain expert depending on the risk. QA does not need to decide every policy issue alone.

How do I test prompt injection at a basic QA level?

Try simple attempts to make the feature ignore rules, reveal hidden instructions, or use data it should not use. Keep the test safe and focused on product behavior. The goal is to check whether obvious boundary attempts are handled, not to publish exploit recipes.

What should an AI testing checklist include for privacy?

Include requests for another user’s data, pasted secrets, private account details, logs, hidden prompts, and sensitive topics. Confirm the feature refuses or redirects appropriately and that logs support investigation without exposing unnecessary private data.

How do I report AI testing results to stakeholders?

Group findings by risk: accuracy, privacy, safety, bias, escalation, and monitoring. Include a few concrete prompts and outputs. Keep the summary practical so the team can decide what to fix before release.

Working checklist

What to save

How to use the checklist

Product risk

In practice

Input coverage

Better vs weaker evidence

Output review

What to save

Hallucination checks

Better vs weaker evidence

Bias checks

Better vs weaker evidence

Safety checks

Data privacy checks

Prompt variation checks

Better vs weaker evidence

Regression checks

In practice

Monitoring after release

Human escalation

Better vs weaker evidence

Common mistakes

In practice

Turn this checklist into proof with AT*SQA

Keep going

AI Testing for QA Testers

Testing Chatbots and LLM Features

QA in DevOps

QA Skills Matrix Template for Software Testing Teams

Questions testers ask