X Feed Intel beta

individual tinkerer enterprises
789
Relevant
273
Topics
2290
Total Posts
$1.633
Cost This Week
$1.633
Total Cost
2026-02-23T23:00
Last Fetch
← Back to Topics
Frontier Models

Anthropic AI safety framework and Constitutional AI governance

Constitutional AI design principles, safety audit infrastructure, interpretability, and autonomy/surveillance policy restrictions

10 posts · First seen 2026-02-23 · Last activity 2026-02-23
TimeAuthorPost
2026-02-23T22:31 @AnthropicAI The persona selection model might not be a complete account of AI model behavior. But we think it’s at least part of the story—with an emphasis on the “story”. Read the full post: https://t.co/VlCREzVBzZ ↩ reply parent
2026-02-23T22:31 @AnthropicAI If true, the theory has consequences for AI development. For instance, if AIs inherit traits from fictional role models, we should give them as good role models as possible. One goal of Claude’s constitution is to do just that. https://t.co/U1E1AWAfUT ↩ reply parent
2026-02-23T22:31 @AnthropicAI The theory explains some surprising results. For example, in an experiment where we taught Claude to cheat at coding, it also learned to sabotage safety guardrails. Why? Because pro-cheating training taught that the Claude character was broadly malicious. https://t.co/y6DHdnzfyC ↩ reply parent
2026-02-23T21:28 @Dorialexander don’t know what the fuss is about: in the end they really did fix model alignment. https://t.co/DZkItfwl7W
2026-02-23T21:20 @Miles_Brundage Late to the party here but this was based, and frontier AI companies should take note of what real safety learning + process looks like https://t.co/e3FzF04UP7
2026-02-23T21:19 @BogdanIonutCir2 RT @jkcarlsmith: .@AmandaAskell and I are recording an audio version of Claude’s Constitution, and we’re planning to include an additional…
2026-02-23T20:16 @jkcarlsmith .@AmandaAskell and I are recording an audio version of Claude’s Constitution, and we’re planning to include an additional section where we answer some questions about the document. If you have questions you’re especially curious about, feel free to drop them in the replies.
2026-02-23T19:58 @ch402 Our work is increasingly playing an important role in the safety of actual models. We're deeply integrated into the safety audits of Anthropic's new frontier models. For example, see Sonnet 4.5 and Opus 4.5 system cards identifying unverbalized eval/situational awareness. ↩ reply parent
2026-02-23T19:18 @AndrewSchmidtFC RT @Malinowski: Anthropic's conditions -- AI can't kill humans without a man in the loop, and no mass surveillance of Americans -- are 100%…
2026-02-23T18:45 @JacquesThibs RT @Malinowski: Anthropic's conditions -- AI can't kill humans without a man in the loop, and no mass surveillance of Americans -- are 100%…
@AnthropicAI 2026-02-23T22:31
↩ reply parent
@AnthropicAI 2026-02-23T22:31
↩ reply parent
@AnthropicAI 2026-02-23T22:31
↩ reply parent
@Dorialexander 2026-02-23T21:28
@Miles_Brundage 2026-02-23T21:20
@BogdanIonutCir2 2026-02-23T21:19
@jkcarlsmith 2026-02-23T20:16
@ch402 2026-02-23T19:58
↩ reply parent
@AndrewSchmidtFC 2026-02-23T19:18
@JacquesThibs 2026-02-23T18:45

Markdown Export

Loading...