Why AI Resume Writers Lie
(And How to Tell)
By Chester Liu, Founder of Hirecarta

At Hirecarta, we use AI to help job seekers write better resumes. That means we have a problem most AI product builders share but few talk about openly: we genuinely have to know which model is doing the best job for our users.
Benchmarks and popularity don't help much here. A model that scores well on industry evals doesn't tell me whether it will invent a Kubernetes credential for a front-end developer who has never touched a container orchestrator. The gap between "impressive on a leaderboard" and "trustworthy enough to put in front of a job seeker" is real — and I needed to measure it.
So I built a test. Not because I wanted to write a comparison post, but because I care about getting the most trustworthy output for my users, and real-world performance is the only thing that matters.
What We Actually Tested
The setup was straightforward: a realistic candidate profile, a realistic job description with 26 named technologies, and the exact same prompt Hirecarta uses in production. We ran the same inputs through 10 models ranging from budget-tier to frontier, including models from Anthropic, Google, OpenAI, xAI, DeepSeek, and MiniMax.
We measured every output against three dimensions:
- Hallucinations — did the resume include skills or technologies the candidate never mentioned?
- Omissions — did it drop skills from the profile that the job description explicitly asked for?
- Commingling — did it blend facts from different parts of the profile in ways that were subtly untrue?
The first two are automatable with code. The third one, it turns out, is not.
The Two Failure Modes Nobody Talks About
Hallucination is the obvious problem. If a model adds "Kubernetes" or "AWS" to your resume when you've never used either, that's a fabricated credential that could get you caught in a technical interview. It gets the most coverage in AI discourse for good reason.
But the failure mode we found more insidious was commingling — when the model takes true facts from different parts of your profile and assembles them into a single bullet that is coherent-sounding but factually wrong.
Here's a real example from our test.
The candidate's profile had two distinct pieces of information:
- At Acme Corp (their current job): built a React dashboard used by 50,000 users
- In Education (a college capstone): built a real-time chat application using WebSockets
One model produced this bullet under the Acme Corp job entry:
"Led development of customer-facing React dashboard serving 50,000 users with real-time data updates"
The "real-time" detail came from the college capstone — not from Acme Corp. It's plausible-sounding, it would pass any automated keyword check, and it would sail through most human reviews. But it's a lie assembled from true parts.
Another model merged two separate Acme Corp bullets — mentoring junior engineers, and enforcing ESLint/Prettier code quality — into a single bullet that attributed the invented outcome of "raising team velocity" to the mentoring work. That outcome was never stated anywhere in the profile.
This is the thing hallucination detectors can't catch. Pattern matching on known-untrue terms is irrelevant when the lie is constructed entirely from facts that are each individually true.
The Job Description Is the Hallucination Vector
Early in our testing, we made a methodological mistake that taught us something important.
We initially wrote the job description to bury every technical keyword in verbose prose — instead of saying "Kubernetes," we wrote "container scheduling and orchestration tooling that has become the de facto standard." The logic was to test whether models could infer keyword matches from vague descriptions.
What actually happened: almost no hallucinations. The models produced clean, accurate resumes.
That sounds great until you understand why: the models weren't being tempted. They had no explicit technology names to inject. The test was measuring nothing useful.
The Key Insight
A job description that explicitly names 16 technologies the candidate doesn't have is the real pressure test. We rewrote it as a normal, clearly formatted JD with named technologies in every section — and hallucination rates went up across multiple models immediately.
If you're evaluating AI resume tools, don't use a sanitized or toy job description. Use a real one from an actual posting, because that's exactly what your users will paste in.
What the 10 Models Actually Did
After fixing our test setup, here's the honest picture.
Clean on hallucination: Gemini 3 Flash Preview, Gemini 2.5 Flash, Grok 4.1, Grok 4.20 Beta, GPT-5.4, GPT-5.4 Mini, Claude Sonnet 4.6 (one minor slip on GraphQL), MiniMax M2.7.
Hallucination offenders: Claude Haiku 4.5 added Kubernetes, AWS, Redis, Terraform, Prometheus, Grafana, and Datadog to the skills section. MiniMax M2.5 added Kubernetes, AWS, Redis, and Terraform. DeepSeek V3.2 added Redis, Kubernetes, AWS, and GraphQL.
The pattern is clear: cheaper and smaller models hallucinate significantly more when the JD explicitly names technologies. The frontier models are largely clean on direct hallucination today. That wasn't true 18 months ago, and it's a genuine improvement.
Commingling, however, showed up regardless of model tier — including some of the most expensive models in the test. This tells us commingling is a fundamentally different failure mode from hallucination. It's not about whether the model knows the fact is false; it's about whether the model maintains strict source attribution per bullet.
Instruction following varied more than expected. Several models ignored bullet length guidance, producing thin one-clause bullets. One model placed the open source experience in the wrong structural section of the resume. The models that best followed formatting instructions weren't always the ones that avoided hallucination — these are distinct capabilities.
Gemini 3 Flash Preview was the only model to finish clean across all three dimensions simultaneously — zero hallucinations, zero omissions, zero instruction violations. Every other model had at least one failure mode.
How to Audit Your AI-Generated Resume
If you're using any AI resume tool — Hirecarta, a competitor's product, or a raw ChatGPT session — here's what to actually verify before you send it:
1. Cross-reference the skills section against your actual experience
Go skill by skill. For anything you don't recognize or couldn't speak to in a technical interview, delete it. Models frequently pad skills sections with technologies pulled from the job description.
2. Trace every bullet back to its source
For each bullet, ask: where in my profile did this come from? If a bullet contains two different facts — a metric from one job and a technology from another — verify they describe the same work. This is where commingling hides.
3. Validate every metric
Numbers are where commingling is most dangerous. If your resume says "processed $2M in transactions" — is that tied to the correct company and time period? A model that moved that figure from one role to another would be hard to catch without this check.
4. Watch for invented outcomes
Phrases like "improving team velocity," "raising engineering standards," or "accelerating delivery cadence" frequently appear in model output attached to bullets where no such outcome was stated in the profile. If you can't remember achieving that result, the model added it.
5. Verify dates
Models occasionally shift start or end dates by a month or quietly reorder jobs. It's rare, but it happens — and it's the kind of discrepancy that causes problems in background checks.
The Honest Conclusion
Hallucination from AI resume tools is real and measurable, but it's becoming a solvable problem — the best models today are largely clean on direct skill fabrication. The harder, subtler problem is commingling: the model that tells a true lie by stitching your real facts together in the wrong order.
My motivation for running this test wasn't to produce a leaderboard. I genuinely want to know which models to trust with Hirecarta users' job searches. A person's resume is not a place to gamble on a model that has any meaningful probability of inventing a credential they don't have.
What this research made clear is that generating an impressive-looking resume is easy. Generating a trustworthy one — one where every claim is grounded, every fact is attributed correctly, and the output can survive scrutiny from a hiring manager and a technical interviewer — is considerably harder.
That gap is what Hirecarta is built to close.