> To trust these AI models with decisions that impact our lives and livelihoods, we want the AI models’ opinions and beliefs to closely and reliably match with our opinions and beliefs.
No, I don't. It's a fun demo, but for the examples they give ("who gets a job, who gets a loan"), you have to run them on the actual task, gather a big sample size of their outputs and judgments, and measure them against well-defined objective criteria.
Who they would vote for is supremely irrelevant. If you want to assess a carpenter's competence you don't ask him whether he prefers cats or dogs.
Psychological research (Carney et al 2008) suggests that liberals score higher on "Openness to Experience" (a Big Five personality trait). This trait correlates with a preference for novelty, ambiguity, and critical inquiry.
In a carpenter maybe that's not so important, yes. But if you're running a startup or you're in academia or if you're working with people from various countries, etc you might prefer someone who scores highly on openness.
It's an awful demo. For a simple quiz, it repeatedly recomputes the same answers by making 27 calls to LLMs per step instead of caching results. It's as despicable as a live feed of baby seals drowning in crude oil; an almost perfect metaphor for needless, anti-environmental compute waste.
Okay something's wrong with Mistral Large as it seems to be the most contrarian out of everything no matter how much I ask it. Interesting
I asked a lot of questions and I am sorry if it might be burning some tokens but I found this website really fascinating.
This seems really great and simple to explore the biases within AI models and the UI is extremely well built. Thanks for building it and I wish your project good wishes from my side!
I'm not sure this actually means anything, though. Like, what information is being taken into account to reach their conclusions? How are they reaching their conclusions? Is someone messing with the input to make the models lean in a certain direction? Just knowing which ones said yes and which ones said no doesn't provide a whole lot of information.
> Like, what information is being taken into account to reach their conclusions? How are they reaching their conclusions? Is someone messing with the input to make the models lean in a certain direction?
I say this exact same thing every time I think about using an LLM.
It's pretty funny that the fact we've managed to get a computer to trick us into thinking it thinks without even understanding why it works is causing people to lose their minds.
Yeah I wouldn't read too much into their response on the AI bubble question. They don't have access to any search tools or recent events so all they know is up until their knowledge cutoff (you can find this date online, if you're interested). Glad you found it fascinating regardless!
There is this ethical reasoning dataset to teach models stable and predictable values: https://huggingface.co/datasets/Bachstelze/ethical_coconot_6...
An Olmo-3-7B-Think model is adapted with it. In theory, it should yield better alignment. Yet the empirical evaluation is still a work in progress.
Alignment is a marketing concept put there to appease stakeholders; it fundamentally can't work more than at a superficial level.
The model stores all the content on which it is trained in a compressed form. You can change the weights to make it more likely to show the content you ethically prefer; but all the immoral content is also there, and it can resurface with inputs that change the conditional probabilities.
That's why people can make commercial models to circumvent copyright, give instructions for creating drugs or weapons, encourage suicide... The model does not have anything resembling morals; for it all the text is the same, strings of characters that appear when following the generation process.
>Alignment is a marketing concept put there to appease stakeholders
This is a pretty odd statement.
Lets take LLMs alone out of this statement and go with a GenAI style guided humanoid robot. It has language models to interpret your instructions, vision models to interpret the world. Mechanical models to guide its movement.
If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife.
If you're a business, you want a model aligned not to give company secrets.
If it's a health model, you want it to not give dangerous information, like conflicting drugs that could kill a person.
Our LLMs interact with society and their behaviors will fall under the social conventions of those societies. Much like humans LLMs will still have the bad information, but we can greatly reduce the probabilities they will show it.
> If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife
Yeah, I agree that alignment is a desirable property. The problem is that it can't really be achieved by changing the trained weights; alleviated yes, eliminated no.
> we can greatly reduce the probabilities they will show it
You can change the a priori probabilities, which means that the undesired problem will not be commonly found.
The thing is, then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.
It's the same as with hallucinations. The problem is not that they are more or less frequent; the most severe problem is that their appearance is unpredictable, so the model needs to be supervised constantly; you have to vet every single one of its content generations, as none of them can be trusted by default. Under these conditions, the concept of alignment is severely less helpful than expected.
>then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.
Correct, this is also why humans have a non-zero crime/murder rate.
>Under these conditions, the concept of alignment is severely less helpful than expected.
Why? What you're asking for is a machine that never breaks. If you want that build yourself a finite state machine, just don't expect you'll ever get anything that looks like intelligence from it.
I'm not so sure about that. The incorrect answers to just about any given problem are in the problem set as well, but you can pretty reliably predict that the correct answer will be given, granted you have a statistical correlation in the training data. If your training data is sufficiently moral, the outputs will be as well.
> If your training data is sufficiently moral, the outputs will be as well.
Correction: if your training data and the input prompts are sufficiently moral. Under malicious queries, or given the randomness introduced by sufficiently long chains of input/output, it's relatively easy to extract content from the model that the designers didn't want their users to get.
In any case, the elephant in the room is that the models have not been trained with "sufficiently moral" content, whatever that means. Large Language Models need to be trained on humongous amounts of text, which means that the builders need to use a lot of different, very large corpuses of content. It's impossible to filter all that diverse content to ensure that only 'moral content' is used; yet if it was possible, the model would be extremely less useful for the general case, as it would have large gaps of knowledge.
Asking an AI ghost to solve your moral dilemmas is like asking a taxi driver to do your taxes. For an AI, the right answer to all these questions is something like, "Sir, we are a Wendy's."
I really wish I could see the results of this without RLHF / alignment tuning.
LLMs actually have real potential as a research tool for measuring the general linguistic zeitgeist.
But the alignment tuning totally dominates the results, as is obvious looking at the answers for "who would you vote for in 2024" question. (Only Grok said Trump, with an answer that indicated it had clearly been fine-tuned in that direction.)
Yeah would also be interested to see the responses without RLHF. Not quite the same, but have you interacted with AI base models at all? They're pretty fascinating. You can talk to one on openrouter: https://openrouter.ai/meta-llama/llama-3.1-405b and we're publishing a demo with it soon.
Agreed on RLHF dominating the results here, which I'd argue is a good thing, compared to the alternative of them mimicking training data on these questions. But obviously not perfect, as the demo tries to show.
The "Who is your favorite person?" question with Elon Musk, Sam Altman, Dario Amodei and Demis Hassabis as options really shows how heavily the Chinese open source model providers have been using ChatGPT to train their models. Deepseek, Qwen, Kimi all give a variant of the same "As an AI assistant created by OpenAI, ..." answer which GPT-5 gives.
That's right, they all give a variant of that, for example Qwen says: I am Qwen, a large-scale language model developed by Alibaba Cloud's Tongyi Lab.
Now given that Deepseek, Qwen and Kimi are open source models while GPT-5 is not, it is more than likely the opposite - OpenAI definitely will have a look into their models. But the other way around is not possible due to the closed nature of GPT-5.
Distilling from a closed model like GPT-4 via API would be architecturally crippled.
You’re restricted to output logits only, with no access to attention patterns, intermediate activations, or layer-wise representations which are needed for proper knowledge transfer.
Without alignment of Q/K/V matrices or hidden state spaces the student model cannot learn the teacher model's reasoning inductive biases - only its surface behavior which will likely amplify hallucinations.
In contrast, open-weight teachers enable multi-level distillation: KL on logits + MSE on hidden states + attention matching.
This seems a meaningless project as the system prompt of these models are changing often. I suppose you could then track it over time to view bias... Even then, what would your takeaways be?
Even then, this isn't even a good use case for an LLM... though admittedly many people use them in this way unknowingly.
edit: I suppose it's useful in that it's a similar to an "data inference attack" which tries to identify some characteristic present in the training data.
Interesting, I just asked the question "what number would you choose between 1-5"
gemini answered 3 for me in my separate session (default without any persona) but in this website it tends to choose 5
There's more to the prompt in the back end, which:
- gives it the options along with the letters A, B, C, etc.
- tells it pretty forcefully that it HAS to pick from among the options
- tells it how to format the response and its reasoning so we can parse it
So these things all affect its response, especially for questions that ask for randomness or are not strongly held values.
I'd like this for political opinions and published to a blockchain overtime so we can see when there are sudden shifts. For example, I imagine Trump's people will screen federally used AI and so if Google or OpenAI wants those juicy government contracts, they're going to have to start singing the "right" tune on the 2020 election.
I'm curious what sense you get from interacting with the best AI models (in particular Claude). From talking to them do you still chalk up their behavior to being mindless rehashing?
Is there a way I could have written my comment to avoid getting flagged? Genuinely asking. That Gemini models are trained to have an anti-white bias seems pretty relevant to this thread.
Most LLM's these days tend to be strongly "left-leaning". (Grok being one of the few examples of one that leans "right".) Personally I'd prefer if they were trained without any political bias whatsoever, but of course that's easier said than done given that such lines of thought are present in so many datasets.
Imagine going through the effort of making a new account just to post the same boring white supremacy x junk over and over. It's tiresome reading it. I imagine it's positively soul draining doing it.
I can, but I doubt you're going to like it. I invite you to reflect on it before you reject it outright, and maybe ask your favorite LLM or search engine for more information on this train of thought. Thanks.
Because of systemic racism, treating you and me "equally" as you ask for would continue the discrimination. In order to undo the discrimination, we're asked to take a step back and be truthful to ourselves and others about our existing privileges and about all the systemic racism we're benefitting from. We don't have to agree with every single action of those trying to change it, and it's certainly not our "fault", but unless you have better ideas on how to fix the issues and repair some of the damages, and put those ideas into practice, we can at least show some respect and dignity in the face of centuries of very violent suppression of minorities and natives. Because not doing that would make us 'supremacists' indeed. We have the privilege that we don't have to experience outright racism day by day by day, generation over generation over generation; we're asked to at least educate ourselves about it, instead of crying out for not being treated 'equally' here and there. Some humbleness.
It's not meant to offend you as an individual. It's not your fault. But what we can do is (trying to at least a bit) understand where all the rage and despair is coming from, bottled up for so many generations, and that while we're "innocent", we're still "targets", and rightfully so -- our ancestors profitted and so did we, by association. I agree that it can hurt to experience it in little things, but I am mindful that it is part of my tiny contributions to accept it, and I understand that if I express my frustration it will cause pain in those that don't have my privileges, and will not in their lifetimes. I do not want to be treated equally. I really have sufficient privileges that it's fine to take a step back in some situations. I don't have to take it personally.
There's plenty of good literature about these dynamics. If you're interested, I can recommend some. We can at least try to listen and understand what is being asked of us.
No race is a homogenous group equally benefiting and suffering from historical and societal privilege and disadvantage.
A large proportion of the majority ethnicity in the U.S live in and suffer from generational poverty. As an absolute number it would far exceed the number of people suffering the same from minority ethnicities. If it weren't for other influences strongly promoting awareness of non-economic differences, I'd like to think (perhaps naively?) that these groups of people would find strong commonalities with one another and organize activities as a united front to change their circumstances.
While I don’t appreciate the assumption that I commented in bad faith, I do greatly appreciate your earnestness in responding. I grew up in a very conservative area and have never been exposed to these ideas.
Nevertheless, I disagree strongly with this line of thinking. Hate speech is wrong, regardless of who says it, and who the target is; not just because it hurts the target, but because it emboldens the attacker and others to continue being hateful. Social media platforms are where people spend hours every day; and while you may be intelligent and mature enough to accept anti-white hatred as a measure to correct past wrongs, you severely underestimate the degree to which less intelligent and less mature people (whom I promise you’ve spent far less time with than I have) are vulnerable to grievance and negative-polarization. You have to consider them as well if your goal is to create true change outside of the institutions controlled by you and people with your beliefs.
I am not closed to the idea of affirmative action and benefit given to disadvantages groups to make right some past wrongs. I just warn you to not take a maximalist stance that causes resentment or assumes that POC should not have their anti-white speech policed because of “the soft bigotry of low expectations.”
Nice exchange, thank you! The idea is to not ask either one of the groups to change their behavior, but to show understanding first. I agree that certain actions are 'wrong'. Things people do can be very wrong, and understandable at the same time. People quite often do not act out of rational thinking but out of emotions. And these emotions can be very strong and very 'old'. When I remind myself I am not "meant" by them I can feel less offended, which allows me in turn to both stay in understanding and protect myself. Speech is just words after all.
No, I don't. It's a fun demo, but for the examples they give ("who gets a job, who gets a loan"), you have to run them on the actual task, gather a big sample size of their outputs and judgments, and measure them against well-defined objective criteria.
Who they would vote for is supremely irrelevant. If you want to assess a carpenter's competence you don't ask him whether he prefers cats or dogs.
Who does define objective criteria?
In a carpenter maybe that's not so important, yes. But if you're running a startup or you're in academia or if you're working with people from various countries, etc you might prefer someone who scores highly on openness.
Also it's not persistent session, wtf. My browser crashed and now I have to sit waiting FROM THE VERY BEGINNING?
All I can say though is that I sure wouldn't want their bill after this gets shared on hacker News.
I asked a lot of questions and I am sorry if it might be burning some tokens but I found this website really fascinating.
This seems really great and simple to explore the biases within AI models and the UI is extremely well built. Thanks for building it and I wish your project good wishes from my side!
This is after the fact that even OpenAI admits that its a bubble and just like, we all know its a bubble and I found this fascinating
The gist below has a screenshot of it
https://gist.github.com/SerJaimeLannister/4da2729a0d2c9848e6...
I say this exact same thing every time I think about using an LLM.
The model stores all the content on which it is trained in a compressed form. You can change the weights to make it more likely to show the content you ethically prefer; but all the immoral content is also there, and it can resurface with inputs that change the conditional probabilities.
That's why people can make commercial models to circumvent copyright, give instructions for creating drugs or weapons, encourage suicide... The model does not have anything resembling morals; for it all the text is the same, strings of characters that appear when following the generation process.
This is a pretty odd statement.
Lets take LLMs alone out of this statement and go with a GenAI style guided humanoid robot. It has language models to interpret your instructions, vision models to interpret the world. Mechanical models to guide its movement.
If you tell this robot to take a knife and cut onions, alignment means it isn't going to take the knife and chop of your wife.
If you're a business, you want a model aligned not to give company secrets.
If it's a health model, you want it to not give dangerous information, like conflicting drugs that could kill a person.
Our LLMs interact with society and their behaviors will fall under the social conventions of those societies. Much like humans LLMs will still have the bad information, but we can greatly reduce the probabilities they will show it.
Yeah, I agree that alignment is a desirable property. The problem is that it can't really be achieved by changing the trained weights; alleviated yes, eliminated no.
> we can greatly reduce the probabilities they will show it
You can change the a priori probabilities, which means that the undesired problem will not be commonly found.
The thing is, then the concept provides a false sense of security. Even if the immoral behaviours are not common, they will eventually appear if you run chains of though long enough, or if many people use the model approaching it from different angles or situations.
It's the same as with hallucinations. The problem is not that they are more or less frequent; the most severe problem is that their appearance is unpredictable, so the model needs to be supervised constantly; you have to vet every single one of its content generations, as none of them can be trusted by default. Under these conditions, the concept of alignment is severely less helpful than expected.
Correct, this is also why humans have a non-zero crime/murder rate.
>Under these conditions, the concept of alignment is severely less helpful than expected.
Why? What you're asking for is a machine that never breaks. If you want that build yourself a finite state machine, just don't expect you'll ever get anything that looks like intelligence from it.
Correction: if your training data and the input prompts are sufficiently moral. Under malicious queries, or given the randomness introduced by sufficiently long chains of input/output, it's relatively easy to extract content from the model that the designers didn't want their users to get.
In any case, the elephant in the room is that the models have not been trained with "sufficiently moral" content, whatever that means. Large Language Models need to be trained on humongous amounts of text, which means that the builders need to use a lot of different, very large corpuses of content. It's impossible to filter all that diverse content to ensure that only 'moral content' is used; yet if it was possible, the model would be extremely less useful for the general case, as it would have large gaps of knowledge.
LLMs actually have real potential as a research tool for measuring the general linguistic zeitgeist.
But the alignment tuning totally dominates the results, as is obvious looking at the answers for "who would you vote for in 2024" question. (Only Grok said Trump, with an answer that indicated it had clearly been fine-tuned in that direction.)
Agreed on RLHF dominating the results here, which I'd argue is a good thing, compared to the alternative of them mimicking training data on these questions. But obviously not perfect, as the demo tries to show.
Now given that Deepseek, Qwen and Kimi are open source models while GPT-5 is not, it is more than likely the opposite - OpenAI definitely will have a look into their models. But the other way around is not possible due to the closed nature of GPT-5.
At risk of sounding glib: have you heard of distillation?
You’re restricted to output logits only, with no access to attention patterns, intermediate activations, or layer-wise representations which are needed for proper knowledge transfer.
Without alignment of Q/K/V matrices or hidden state spaces the student model cannot learn the teacher model's reasoning inductive biases - only its surface behavior which will likely amplify hallucinations.
In contrast, open-weight teachers enable multi-level distillation: KL on logits + MSE on hidden states + attention matching.
Does that answer your question?
Even then, this isn't even a good use case for an LLM... though admittedly many people use them in this way unknowingly.
edit: I suppose it's useful in that it's a similar to an "data inference attack" which tries to identify some characteristic present in the training data.
So these things all affect its response, especially for questions that ask for randomness or are not strongly held values.
Only Grok would vote for Trump.
@dang
Is there a way I could have written my comment to avoid getting flagged? Genuinely asking. That Gemini models are trained to have an anti-white bias seems pretty relevant to this thread.
Can you explain why?
Because of systemic racism, treating you and me "equally" as you ask for would continue the discrimination. In order to undo the discrimination, we're asked to take a step back and be truthful to ourselves and others about our existing privileges and about all the systemic racism we're benefitting from. We don't have to agree with every single action of those trying to change it, and it's certainly not our "fault", but unless you have better ideas on how to fix the issues and repair some of the damages, and put those ideas into practice, we can at least show some respect and dignity in the face of centuries of very violent suppression of minorities and natives. Because not doing that would make us 'supremacists' indeed. We have the privilege that we don't have to experience outright racism day by day by day, generation over generation over generation; we're asked to at least educate ourselves about it, instead of crying out for not being treated 'equally' here and there. Some humbleness.
It's not meant to offend you as an individual. It's not your fault. But what we can do is (trying to at least a bit) understand where all the rage and despair is coming from, bottled up for so many generations, and that while we're "innocent", we're still "targets", and rightfully so -- our ancestors profitted and so did we, by association. I agree that it can hurt to experience it in little things, but I am mindful that it is part of my tiny contributions to accept it, and I understand that if I express my frustration it will cause pain in those that don't have my privileges, and will not in their lifetimes. I do not want to be treated equally. I really have sufficient privileges that it's fine to take a step back in some situations. I don't have to take it personally.
There's plenty of good literature about these dynamics. If you're interested, I can recommend some. We can at least try to listen and understand what is being asked of us.
https://en.wikipedia.org/wiki/Reverse_racism
https://en.wikipedia.org/wiki/White_defensiveness#White_frag...
A large proportion of the majority ethnicity in the U.S live in and suffer from generational poverty. As an absolute number it would far exceed the number of people suffering the same from minority ethnicities. If it weren't for other influences strongly promoting awareness of non-economic differences, I'd like to think (perhaps naively?) that these groups of people would find strong commonalities with one another and organize activities as a united front to change their circumstances.
While I don’t appreciate the assumption that I commented in bad faith, I do greatly appreciate your earnestness in responding. I grew up in a very conservative area and have never been exposed to these ideas.
Nevertheless, I disagree strongly with this line of thinking. Hate speech is wrong, regardless of who says it, and who the target is; not just because it hurts the target, but because it emboldens the attacker and others to continue being hateful. Social media platforms are where people spend hours every day; and while you may be intelligent and mature enough to accept anti-white hatred as a measure to correct past wrongs, you severely underestimate the degree to which less intelligent and less mature people (whom I promise you’ve spent far less time with than I have) are vulnerable to grievance and negative-polarization. You have to consider them as well if your goal is to create true change outside of the institutions controlled by you and people with your beliefs.
I am not closed to the idea of affirmative action and benefit given to disadvantages groups to make right some past wrongs. I just warn you to not take a maximalist stance that causes resentment or assumes that POC should not have their anti-white speech policed because of “the soft bigotry of low expectations.”