I’ll be running panels at this event and helping out. There’s still plenty of room and we are looking for any AI/ML students in the bay who want to attend!
And it’s free!
I’ll be running panels at this event and helping out. There’s still plenty of room and we are looking for any AI/ML students in the bay who want to attend!
And it’s free!
Cmon man
This is a pretty important statement about engineering and experimentation speed.
openai.com/index/harnes...
Mossad or not-Mossad, but for model evals needing to be difficult, but not so difficult that they are written off as not a useful measurement.
Great idea
Everybody has a hard eval until gradient descent punches you in the face.
Accountability diffuses at the deployment layer, but dependency concentrates at the model supply layer.
The dominant risk is not what the models can do, but how fast capability diffuses, how it gets wired, and whether misuse feedback loops are actioned post release.
ok takeaways:
This is a huge unmanaged attack surface, 49% tool exposure and a bunch of residential hosts is a problem waiting ot happen.
Prioritizing a release to go far in this ecosystem? Go with 8-14B at 4bit quant.
22% of hosts have custom system prompts - we pulled and classified over 3k prompts the breakdown for the top 4 were:
1. Default Identity
2. Coding Assistants
3. Roleplay
4. Uncensored
Portable weights travel far in this network.
Probably not a huge surprise, but in this dataset 8-14B parameters is the most prevalent model size and 72% of models are 4 bit quantized.
49% of hosts enable tools.
The top 10 Model Families control 85% of the market. Other families in the long tail.
This is an exposure dataset which means we are trying to study something by measuring the shadow that it casts. We can’t poll these systems directly, but we can understand the shape of the ecosystem.
New research from @silascutler.bsky.social and myself.
We tracked 175k exposed Ollama endpoints for nearly a year. Collected and analyzed custom models, sizes, quantizations, system prompts, and more.
*vague posts about upcoming research*
Love getting malware under TLP:AMBER+S, when the S stands for “spite”. 🫖
We about to have some Llama Drama :)
and of course it’s chatgpt slop with the rhetorical flourish of a remedial high school debate club.
“from X to Y — or worse”
“This Isn’t X it’s Y.”
“Replace X with Y and it’s Z.”
“The most sobering part? It’s X.”
“your no longer dealing with X. You’re facing Y”
“Wow this dude has a really strong opinion about code review”
*scans posts*
“Oh that’s his only opinion”
—dangerously-skip-permissions is the only thing keeping claude code installed on my machine.
As a friend said awhile back “we are fine-tuning the models and they are coarse-tuning us in turn”
A deeper problem is that nobody has time for anything but LLM-as-a-judge evaluations (often vendor-on-vendor), creating these Ouroboros loops that are easy to overfit and hard to trust.
Thats a huge gap when we’re being asked to rely on them for SOC automations or enterprise security work.
CyberSOCEval (meta) found models can extract real signal from malware logs & CTI reports, but remain far from reliable.
Most importantly in this domain, reasoning models do not get their usual math/coding uplift suggesting that general capability ≠ analyst capability... yet.
The best “agentic” benchmark we saw (ExCyTIn-Bench) still shows how far we are. Even in a curated Azure-style environment models struggled with multi-hop investigations over heterogeneous logs (data be confusing like that).
Most security evals reduce workflows into MCQs/static Q&A. That bakes in unrealistic assumptions that the “right question” is already asked, evidence is pre-packaged, wrong answers are cheap, and there’s no triage/queue pressure or escalation decisions.
Benchmarks for cybersecurity are everywhere and mostly measuring the wrong thing.
We reviewed evals from Microsoft, Meta and academia and found they don't measure what matters for defenders in real IR situations. 🧵
s1.ai/benchmk1
Reviewing AI cyber benchmarking and evaluations may break me.
Ya’ll will really LLM as a judge anything
Timely presentation from my colleague Jim on the current landscape of Hactivism and War.
youtu.be/sNaORI-k-fY?...
✅ #LLM literacy is table stakes for defenders, CTI analysts, and #cybersecurity professionals of all stripes now.
Still looking for a way into this complex field? 🤔
LABS has got you covered!
Start here:
s1.ai/inside-llm-1
@sentinelone.com
Great post from @philofishal.bsky.social on the initials stages of the LLM training pipeline!
www.sentinelone.com/labs/inside-...