I asked ChatGPT about the numbers 1 & 4. Which one is bigger?
Sometimes, 1 was bigger. Othertimes, 4 was bigger. Sharon Zhou ran this experiment at scale to showing the order of yes & no matters in the response.
This is called a non-deterministic or stochastic answer. Similar inputs do not consistently produce identical outputs. The answers have inconsistent logic.
We live with stochastic systems daily : weather reports, ETAs on Google maps, stock portfolio construction. We are stochastic – humans can be moody, err in our calculations, or change our minds with new information.
In these conversations, the robot is sometimes wrong, but never in doubt. When a system produces an answer, we should verify the answer is correct. It’s not just logical errors that occur: hallucinations, when the system invents answers that don’t exist, plagued about half of Bing chat results in this Stanford study.
We haven’t calibrated ourselves to the level of doubt to express, yet. Like working with a new colleague, we need to understand their strengths & weaknesses.
For consumers, the universe of acceptable outcomes can be quite broad. A rabbit on top of a fire truck has many acceptable answers.
But in the B2B world, consistency matters. Businesses using genAI will demand consistent answers to prompts like these : what is the company’s revenue by region? Or how do I reset my password? Or how much would I pay if I used a 1000 units of a product?
GenAI will need to write, create, & calculate with a significantly better error rate than humans.
I’m working with ProductBoard to understand how different B2B startups are planning to leverage AI with a survey. If you’re integrating GenAI into your product & interested to hear others’ plans, please fill it out, & we’ll send you the anonymized raw data. Look for the results to be published in a few weeks.