LLMs learn what programmers create, not how programmers work

I ran an experiment to see if CLI actually was the most intuitive format for tool calling. (As claimed by a ex-Manus AI Backend Engineer) I gave my model random scenarios and a single tool "run" - i told it that it worked like a CLI. I told it to guess commands.

it guessed great commands, but it formatted it always with a colon up front, like :help :browser :search :curl

It was trained on how terminals look, not what you actually type (you don't type the ":")

I have since updated my code in my agent tool to stop fighting against this intuition.

LLMs they learn what commands look like in documentation/artifacts, not what the human actually typed on the keyboard.

Seems so obvious. This is why you have to test your LLM and see how it naturally works, so you don't have to fight it with your system prompt.

This is Kimi K2.5 Btw.

32 points | by noemit 19 hours ago

9 comments

  • shomp 15 hours ago
    Great observation. The brain of a programmer is still a "black box" to the feed-forward network of nodes . But in theory, if you pumped a lot of the live-coding videos from something like youtube into the process, you could get a bit of that "what's your approach"-erism to bleed into the model. There might not be enough material there to truly "train it to think" but it would be interesting to try and "fill the gaps" of black-box-ed-ness in the LLM with supplemental "here was the process that got us there" video feeds. The next natural move might actually be recording thousands of hours of footage of developers working with the LLMs directly like in Cursor or another IDE that has LLM live-pair-programming , maybe calling it "pair programming" is generous , but it might be a reasonable foray into teaching the next generation of LLMs the "thought process" behind things. In reality you'd be teaching it which files to inspect, which windows to open/close, which tools to switch to and focus on. And while it might be imperfect, it might just be enough.
  • Areena_28 8 hours ago
    I know even we hit the same thing building internal security tooling. our model kept formatting output like documentation, not like how we would or any person in place of us would read in a terminal at 2am during an incident.

    I am a bit curious, did you find this behavior consistent across models or is it more pronounced with certain ones?

    • stuaxo 5 hours ago
      Literate programming is about to become mainstream in the funniest way possible.
    • noemit 7 hours ago
      I ran into it while building - I should have tested different temps too - I was just trying to get cli style tool calls to be more reliable
  • acters 8 hours ago
    Instead of telling the LLM that "run"works like a cli, maybe just tell the LLM that "run" will execute sh/bash/zsh/etc scripts?
    • noemit 8 hours ago
      I tried over 20 variations of different system prompts. Once I changed my tool to expect the colon, it also felt like it was running/calling tools faster, but I need to do a larger test to be sure.
  • seertaak 9 hours ago
    Is that really true? I would have expected by now that AI companies nowadays are doing RL on git histories, not just on the HEAD.
    • noemit 8 hours ago
      I also expected this. Please run some experiments and maybe other models are different
    • muzani 5 hours ago
      Claude definitely does
  • mpalmer 11 hours ago
    The novice came to the master. "I have figured it out, the rules for how LLMs understand CLIs. It gives the right commands, but adds colons. It was trained on the visual shape of terminals, not keystrokes."

    "Clear the session," the master said. "Run the same prompt again."

    The novice pressed return. The model output: `ls -R /tmp`

    "The colons are gone," the novice said. "But my theory explained them perfectly."

    "You built a cage for a cloud," the master said. "Do not mistake a single roll of the dice for the rulebook."

    • noemit 7 hours ago
      I ran tests of 100 attempts with different prompt/scenario combinations. Each "attempt"/theory had 3 different system prompts wordings. Most of the prompts did not mention a colon, but it kept appearing. When I added negative instructions against using a colon, the quality went down (most of the tool calls were malformed, one common issue was markdown ticks in front) It was only when my system prompt acted like colons were normal that I kept getting 100/100 perfect expected tool calls. I ranked my system prompts by which returned the most consistent commands.
  • allinonetools_ 12 hours ago
    [dead]
  • moyet75472 4 hours ago
    [dead]
  • freelancedata 14 hours ago
    [flagged]
  • Art9681 12 hours ago
    Is "how programmers work" a useful and provable metric? No? Then it belongs in philosophy discussions. How you work and how I work is different. Your work may have ended up in the LLM training and my work did not. Or vice versa.

    Can you objectively analyze how VSCode adapts to your way of working without our interference?

    Did you test your theory with the actual frontier LLMs (which Kimi K2.5 is not BTW?)