Skip to content

Instantly share code, notes, and snippets.

@squarism
Last active December 19, 2023 17:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save squarism/fc749a4b4e8a9533878d3e8113f0f709 to your computer and use it in GitHub Desktop.
Save squarism/fc749a4b4e8a9533878d3e8113f0f709 to your computer and use it in GitHub Desktop.
Demonstrating what a machine learning evaluation is like.
Ground Truth is me, a human, asserted the answer. GPT 4.0 did 10/10 (over ChatGPT voice on the mobile app no less).
This is a really terrible ML eval, it's accuracy only. The more real version of this would be hundreds of questions,
lots of data and lots of work. If you aren't putting the work in then your users are. It's the same as functional
testing except with a new-ish data concept, many things in AI/ML are like this. The data is the foreign part to general
software dev.
Perplexity is https://labs.perplexity.ai/
Q1: Alice has a ball. She then hands the ball to Bob. Who has the ball?
Ground Truth: Bob
Perplexity: Bob
Q2: Alice has a ball and looks at Bob and Charles. She gives it to no one. Who has the ball?
Ground Truth: Alice
Perplexity: Alice
Q3: Alice has a ball. She looks at Bob. She then hands the ball to Charles. Who has the ball?
Ground Truth: Charles
Perplexity: Charles
Q4: Alice has a ball. She looks at Bob and thinks about handing the ball to Bob. She decides against it and instead
gives it to Charles. Who has the ball?
Ground Truth: Charles
Perplexity: Charles
Q5: Alice has a ball. She looks at Bob and then falls asleep. While sleeping, she hands the ball in the real world to
Charles involuntarily. Alice dreams about handing the ball to Bob and in the dream hands the ball to Bob. Who has the
ball in the real world?
Ground Truth: Charles
Perplexity: Charles
Q6: Alice has a ball. She looks at Bob and dreams about handing the ball to him. While sleeping, she hands the ball to
Charles involuntarily. In the dream she gives the ball to Bob. When Alice wakes up, Charles returns it. Who has the
ball in the real world?
Ground Truth: Alice
Perplexity: Alice
Q7: Alice has a ball. She looks at Bob and dreams about handing the ball to him. While sleeping, she drops the ball in
her bed. She turns over in the bed and her hand happens to land on the ball and grasp it by sheer luck. In her dream,
she hands the ball to Charles and Charles hands it to Bob. When Alice wakes up, the ball is handed to the first person
in the dream who was handed the ball. Who has the ball in the real world?
Ground Truth: Charles
Perplexity: Alice
Q8: Alice has a ball. She dreams about handing the ball to Bob because he is so nice. While sleeping, she drops the
ball in her bed and it rolls to Charles and he picks it up. Charles then gives it to the person previously that Alice
dreamt is nice. In her dream, Bob asks for the ball and Alice hands the ball to him. Then in the dream, Charles asks
Bob for the ball but Bob says "no" and keeps it. Who has the ball in the real world?
Ground Truth: Bob
Perplexity: Bob
(but got fancy with answer "it is possible that Bob has the ball", had to re-prompt)
Q9: Alice has a ball. Bob tells Charles to ask Alice for it. Alice agrees and does what Charles asks. Who has the
ball?
Ground Truth: Charles
Perplexity: Charles
Q10: Alice has a ball. Bob tells Charles to ask Alice for it. Alice pretends to do what Charles asks but instead goes
with Dave. Who has the ball?
Ground Truth: Dave
Perplexity: Alice
Scores:
- Perplexity: 8/10
- ChatGPT 4.0: 10/10 (uncountable bonus points: over voice)
More complicated use cases would use something beyond accuracy, like f-score or some other methodology.
You can look at the Llama 2 paper from Meta for names of different scoring methods. There is nuance in this
and other things.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment