Improving Composer through real-time RL

(cursor.com)

73 points | by ingve 1 day ago

16 comments

  • hmartin 2 hours ago
    Step 1: take an open source model with zero acknowledgement.

    Step 2: build on someone else's infrastructure innovations with zero acknowledgement.

    Step 3: Write a blog post with "unprecedented" and "100x" and "trillions" in the first paragraph.

    Seriously, this seems like cool work and enjoyed the post. But my basic trust in them has completely tanked.

    • refulgentis 43 minutes ago
      I’m not familiar with 2 of these 3 stories so I’m mostly just impressed by the RL turnaround (4 hours!?) and sharing knowledge on the fly off using fine tuned models at scale.

      For the gossipy part, I love Kimi, but find it hard to get worked up about them not labelling their model Kimi when Kimi was the base. Especially because Kimi…has had…some issues…being able to distinguish itself from Claude…

  • pillsburycat 1 hour ago
    Important disclaimer for anyone using Cursor: make sure to disable "data sharing" in your account settings, as it is enabled by default and old accounts are automatically opted into it.
    • natpalmer1776 1 hour ago
      Do you have evidence for those claims? I don’t mean to be contrary or subversive, I’d just be interested in seeing how this is actually taking place.
      • pillsburycat 34 minutes ago
        To be very precise about session recording: you can inspect the Cursor binary and see that it comes bundled with rrweb and full telemetry infrastructure setup is in place: mouse movements, clicks, scroll positions, etc. on top of the codebase and prompts being sent over the wire.

        However, I have edited my other claims for now and you can consider them provisionally retracted. My original advice about turning off data sharing stands. You are right to ask for more evidence given the severity of the claims. I think this merits a deeper dive, and a throwaway hacker news comment might not be the best channel for it. Stay tuned ;)

  • CitrusFruits 5 hours ago
    I've been wondering how they've been able to be so generous with Composer usage with it still making business sense. Seems like this is the answer: presumably they think they'll have a competitive advantage in not just the UX space but the model space as well soon. It's a great strategy, but I do wonder if the moat will be big enough with how fast things are moving and how competitive the model landscape is.
    • ketzo 4 hours ago
      After seeing the last few releases for GPT and Claude, I’m not sure how anyone (else) is gonna build a durable advantage on proprietary model quality.

      The capabilities of the top labs’ models have improved so much in just the last few releases, and I definitely foresee a world where they gate those models away behind 1st-party harnesses/tooling.

      • hypercube33 2 hours ago
        Across my 4 different gpt subscriptions (personal, personal cursor, GitHub Copilot and cursor) all gpt5 models are junk compared to v4 - constantly ignore prompts, skills, can't write c# or powershell properly the first go, up to 5 tries. Qwen3 hands down beat it on a ryzen 5800 and 6700xt GPU even though it's slow it got the code right first try.

        I feel like the v5.0 preview did ok but it's slid all the way down the hill to gpt 2 or 3 levels for me.

  • g3dar 31 minutes ago
    The interesting tension here is between fine-tuning the model vs fine-tuning the environment around it.

    Cursor is betting on RL to make the model better at coding tasks. But there's a parallel approach: keep the foundation model general and build better tooling around it. Claude Code and Gemini CLI already work well as coding agents — the bottleneck in my workflow isn't model quality, it's managing the interaction.

    When you're running multiple agents on different parts of a codebase, the hard problems become: knowing which agent needs your attention, preserving context across sessions, and not losing your train of thought while switching between them. Those are UX problems, not model problems.

    I think both approaches will coexist. Cursor's RL makes the in-editor experience better. But terminal-based agents with good orchestration tooling are catching up fast, especially for the kind of work where you want to run 3-4 tasks in parallel and not babysit each one.

  • crazylogger 2 hours ago
    This feels so wrong. the LLM should play the role of a very general (but empty & un-opinionated) brain - you don’t want to perform a coding-specific lobotomy on someone every day. The proper target of their RL should have been their harness. That’s what determines the agent's trajectory as much as the base model.

    I also wonder since they’re doing constant RL on model weights with today's Cursor design, does that mean they can never change their system prompt & other parts of the harness?

    1) Comparison between past trajectories data would be meaningless if they were operating under different instructions.

    2) Performance will be terrible the next time they change their tool design, since the model is now "opinionated" based on how a previous version of Cursor was designed.

    Anthropic is more sensible with their “constitution” approach to safety. The behaviors (and ultimately the values) you want your model to follow should be a document, not a lobotomy.

  • kgeist 4 hours ago
    >We used a Kimi base, with midtraining and RL on top. Going forward, we'll include the base used in our blog posts, that was a miss. Also, the license is through Fireworks. [0]

    And still no mention of Kimi in a new blog post :)

    Also apparently the inference provider they use, Fireworks AI, already has built-in API for RL tuning Kimi [1], so I wonder which parts are Cursor's own effort and where Fireworks AI actually deserves credit, especially since they repeatedly brag about being able to create a new checkpoint every 5 hours, which would be largely thanks to Fireworks AI's API/training infrastructure.

    I mean, I'm genuinely curious how much effort it would actually take me to go from "here, lots of user data" to "the model gains +1% on benchmarks" to produce my own finetune, assuming I already use a good existing foundational model, my inference provider already handles all the tuning infrastructure/logic, and I already have a lot of usage logs.

    [0] https://news.ycombinator.com/item?id=47459529

    [1] https://fireworks.ai/blog/kimi-k2p5

    • fzysingularity 4 hours ago
      What do you think actually happened here in the past week?

      They used Kimi, failed to acknowledge it in the original Composer announcement. Kimi team probably reached out and asked WTF? Their only recourse was to publicly disclose their whitepaper with Kimi mentioned to win brownie points about being open about their training pipeline, while placating the Kimi team.

  • vicchenai 1 hour ago
    the rl loop here is clever but i wonder how the reward signal degrades over time. if you're optimizing for user acceptance of suggestions, you're inevitably training on a mix of "this was actually correct" and "i accepted because editing the suggestion was more work than accepting it." that second case creates a subtle bias toward suggestions that are close-enough-to-not-bother-fixing rather than actually correct.

    also curious whether they see different convergence patterns across languages. my gut says something like python where theres more stylistic variation would be harder to get a clean reward signal vs something like rust where there are fewer idiomatic ways to do things.

  • htrp 2 hours ago
    If the model "improves" every 5 hours, how do you have any guarantee of model consistency across long coding sessions?
  • janalsncm 3 hours ago
    Back in my day we called this real time training from implicit user feedback.

    The engineering challenge here is an order of magnitude bigger though. An LLM is orders of magnitude bigger than a recommender system model. Kudos.

  • fzysingularity 4 hours ago
    Real-time or continuous learning is great on paper, but to get this to work without extremely expensive regression testing and catastrophic forgetting is a real challenge.

    Credit to the team for taking this on, but I’d be skeptical of announcements like this without at least 3–6 months of proven production deployments. Definitely curious how this plays out.

    • ghywertelling 1 hour ago
      Can this be also used as an attack vector? A small seed percentage of users constantly choosing a particular poisoned pypi library to achieve a niche task which gets rled into the model suggestions and recommendations.
  • amazingamazing 3 hours ago
    seems expensive. distillation is inherently impossible to defend against. sit back and let your competitors do the hard work. they'll whine and say it's illegal, but they shouldn't complain, they will reap what they sowed.
  • polishdude20 5 hours ago
    I'd love to see some data for how much it has improved via this process in the last week
    • heliumtera 4 hours ago
      It would be the same as kimi k2.5, the underlying model
  • alcor-z 3 hours ago
    [dead]
  • ax3726 4 hours ago
    [dead]
  • gurachek 3 hours ago
    [dead]
  • nimchimpsky 3 hours ago
    [dead]