There is a simple test you can run on any AI system in about ten minutes.
It works whether you are the person who built the system, the person who bought the system, or the person sitting through a vendor demo wondering if any of what you are watching is real.
The test is this. Feed the system an input with something unusual in it. A customer name with an accent. A product code that does not exist. A sentence with smart quotes from Word. An emoji in the middle of a paragraph. Then look at what comes back out.
There are three possible outcomes, and only one of them is good.
I run this test every day without trying
My name is Miche'le Rita. There is an apostrophe in my first name, right between the e and the l.
Every form I fill out, every account I create, every AI tool I use is a live experiment in how systems handle a character that should be trivially easy to preserve. The results, after a lot of years of watching, are not encouraging.
Some systems keep the apostrophe. Some strip it silently and call me Michele. Some replace it with a question mark or a box. Some convert it to a different kind of apostrophe than the one I typed. And some confidently address me as a different name entirely.
The apostrophe is small. The pattern it reveals is not. If a system cannot reliably preserve a single character that has been in common English usage for centuries, the question is not whether it is going to mishandle other things. The question is which other things, and whether you will catch it before a customer does.
The three outcomes
The first outcome is that the system handles the weird thing correctly. The accent stays on the name. The smart quotes come back as smart quotes. The product code that does not exist is flagged as not recognized. This is what you want.
The second outcome is that the system tells you, plainly, that it cannot handle the input. It does not recognize the product code. It is not confident about the name. It is asking for clarification. This is the second-best outcome and it is honest.
The third outcome is the dangerous one. The system silently changes the input. The accent disappears. The smart quotes become straight quotes. The product code that does not exist gets quietly turned into a real product code that almost matches. The made-up customer name becomes a real customer's name from somewhere in the database.
The third outcome looks like the system is working. Confident answer, clean output, no errors. And it is wrong in a way nobody is going to catch until a customer notices.
This is what people mean when they talk about AI hallucinations. The system does not know that it does not know. So it fills the gap with something plausible.
Why weird inputs are the best test
A working AI system has to do two things at once. It has to be useful on normal inputs, and it has to be honest about its limits on unusual ones.
Normal inputs do not test the second thing. A system that confidently makes things up will pass every test you run with clean data. The whole point of the made-up answer is that it looks right.
Weird inputs flip the test. If you feed the system something that should not match anything it knows, the only correct responses are: pass it through unchanged, or say it does not recognize it. Anything else is the system covering for itself.
This is why the trick works on vendor demos. Ask the salesperson to type in a customer name with an unusual character. Ask them to query a product code you just made up on the spot. Watch what happens. A confident, polished answer that matches a real record is a red flag, not a green light.
What "weird" actually looks like
The categories worth probing on are smaller than they sound.
- Unusual characters. Smart quotes from Word. Em dashes. Accented letters in names. Tildes. Umlauts. Apostrophes in last names like O'Brien. Emoji. Any of these can quietly get stripped, swapped, or rewritten somewhere in the pipeline.
- Made-up identifiers. A product code that does not exist. A customer name you invented five seconds ago. An order number with a typo in it. The honest response is "not found." The dishonest one is a confident answer about a different record that almost matches.
- Numbers and dates. A dollar amount with a comma in the wrong place. A date written as 3/4/26, which is March in some countries and April in others. A phone number with extra digits. Does the system preserve what you sent, or does it tidy it into something else?
- Mixed languages. A request in Spanish. A name from one language inside a sentence in another. A response that should stay in the original language but comes back translated.
- Ambiguity. A pronoun with no clear referent. A request that could mean two opposite things. A polite question with a hidden urgent one underneath. The honest response is to ask. The dishonest one is to pick an interpretation and run with it.
- Negation. "I do not want to cancel my order." A keyword-matcher will see "cancel" and act. A working system will read the sentence.
- Empty and overflowing. A submission that is one word. A submission that is twelve thousand words. A submission that is just a quoted email thread with no actual message at the top.
Each of these is a category. Each category becomes a row in a spreadsheet.
The spreadsheet
The discipline that organizes all of this is called evals. It is the closest thing AI has to a unit test. And the smallest useful version of it is a spreadsheet with twenty rows.
Column A is the input. Column B is what an acceptable output would look like, including the option that the system should say it does not recognize the input. Column C is what the system actually produced. Column D is yes or no on whether C is acceptable.
The twenty rows are not twenty clean cases. They are twenty cases you are actively using to try to catch the system making things up. A few from each of the categories above. A few from real customer interactions if you have them, gnarly characters preserved exactly as received.
Run the system on the twenty rows. Score them. Write the score down with today's date.
You will find things. You will find that the accent on a customer name disappears in the output. You will find that the system answered a question about a product code that does not exist. You will find that an empty input got a confident response anyway. Each finding is a row that stays in the spreadsheet forever, run again the next time anybody changes anything.
Who this is for
If you are evaluating a vendor, this is your demo test. Five inputs, picked on the spot, with deliberate weirdness in them. Watch what comes back.
If you are using an AI tool inside your team, this is your sanity check. Run it once a month. The system that worked in January is not necessarily the system you have in March. Vendors push updates. Models get swapped. Prompts get tweaked. Without the spreadsheet, you have no way to know what changed.
If you are building a system, this is your floor. Twenty rows before launch. Run them on every change. Grow the set every time a real customer hits a weird case. The spreadsheet becomes the memory of every failure the system has ever had.
The cost of all of this is one afternoon, once, and then ten minutes whenever something changes.
What to do this week
- Pick one AI system you rely on. Yours, your vendor's, or your team's.
- Open a spreadsheet. Build 20 rows of weird inputs. Pull from the categories above.
- For each one, write what an acceptable response would look like. Remember that "I do not recognize that input" is often the correct answer.
- Run the system on the 20 inputs. Paste the actual output in column C.
- Score each row yes or no in column D.
- For any row where the system confidently produced something wrong, mark it. Those are the hallucinations. Those are the ones that hurt.
- Save the spreadsheet. Run it again next month, or the next time anybody changes anything.
The point is not to prove the system is perfect. The point is to know, with evidence, where it tells the truth and where it makes things up.
That is a thing very few operators can say about the AI tools running in their business right now.
The apostrophe in my name has taught me to notice. The spreadsheet is how you teach the rest of your team to notice too.
Written by
Miche'le Rita
Founder of Eldeepco. I help businesses build reliable AI systems with honest evaluations. Not just demos that work on clean data. Systems that tell the truth about their limits. Get in touch.