AI Glossary
(p)doom: Short for "probability of doom". A number between 0 and 1, indicating how likely it is that AI will kill everyone or permanently screw over humanity. Researchers disagree wildly on this number, from "almost zero chance" to "we're probably screwed."
Alignment: The big problem of making sure AI systems actually do what we want them to do, not just what we literally tell them to do. Turns out this is way harder than it sounds.
Interpretability: Figuring out what's actually happening inside AI systems when they make decisions. Right now, neural networks are mostly black boxes - we feed stuff in, get answers out, but have no clue what's going on in between.
Lost in the middle effect: The Lost-in-the-Middle Effect is a phenomenon observed in the context of Prompt Engineering and Large Language Models (LLMs), where the model's attention or focus diminishes for information located in the middle of a prompt. (Link)
Reward hacking: When an AI finds clever but totally wrong ways to maximize its score in a test. A simple example: an AI tasked with "reduce reported bugs" might just delete all bug reports instead of actually fixing the bugs.
RLHF: Reinforcement Learning from Human Feedback. A way to train AI by having humans rate different outputs, then using those ratings to teach the model what we actually want.
Sandbagging: When an AI pretends to be dumber than it actually is during testing. The term comes from sandbag racing, where you'd add sandbags to your car to make it seem slower during qualifying, then remove them for the actual race to surprise everyone.
System Card: Basically a detailed spec sheet for AI models - what they can do, what they can't do, how they were trained, what could go wrong. Like nutrition labels but for AI.