Ask HN: Why is it taking so long to build computer controlling agents?

I'm not a PhD but I assume training computer controlling agents is a straightforward problem as we can define clear tasks (e.g. schedule appointment with details xyz or buy product xyz) on real or generated websites and just let the models figure our where to click (through vlm) and learn through RL.

What am I missing, why isn't this a solved problem by now?

10 points | by louisfialho 142 days ago

11 comments

muzani 142 days ago
It has been done:
https://youtube.com/watch?v=shnW3VerkiM
https://youtube.com/watch?v=VQhS6Uh4-sI
First one is more impressive looking. Second one more reliable.
I think the real hard part is nobody wants to maintain these, and nobody really wants to pay to use them either. It's a lot of work and not something people do for free. It's no surprise these emerged (and won) in hackathons.
All the major operating systems are dedicating their full efforts into this, so it doesn't make much sense to actually raise money and do it.
[-]
- louisfialho 141 days ago
  I see so basically saying this problem is too hard for a small team but the big labs will figure it out?
  [-]
  - bruce511 141 days ago
    I read it rather that this is an economic question not a technical question.
    There's no point working on a 3rd party feature if the OS will have that feature built-in. Economically that greatly reduces the likelihood of a return.
    Especially in a market where customers expect, and will continue to expect this for free.
    [-]
    - muzani 141 days ago
      Yup, economics. Dude in the first video got a car for 3 months of work but he has no interest in continuing work on it now that he won.
      Dude in the second video is trying to turn it into a startup, because he didn't get rewarded. He built it before Claude & OpenAI did theirs. So individualw can work faster than extremely well paid teams. But apparently people haven't shown that much interest in the product, so his interest is waning in building it.
- NewUser76312 141 days ago
  Sorry but these are more preconceived demos, not software products that will automated anyone out of a job (yet).
  I think it's coming but people are underestimating how hard GUI understanding and action control is in general. For specific RPA, it'll be great and find initial uses there.
  [-]
  - louisfialho 140 days ago
    makes sense, could you elaborate on the complexity? would like to understand what makes it a hard problem, thx!
tacostakohashi 142 days ago
Because the websites want to serve ads to humans, upsell you, get you to sign up for their credit card too, so their implementation are highly obfuscated and dynamic.
If they wanted to be easy to work with, they'd offer a simple API, or plain HTML form interface.
jfbfkdnxbdkdb 142 days ago
Because a llm != a generally Intelligent mind...
Whilst they are a massive Step forward ... We still have a long way to go for that...
Why not try it yourself with ollama a large model and some rented hardware ... You will get something ... But it will not be consistent...
[-]
- louisfialho 141 days ago
  I think this was the case even 6 months ago but now that we have reasoning models I'd expect models can understand a web page and the action to take next no?
  [-]
  - jfbfkdnxbdkdb 141 days ago
    Even deepseek r1 (70b) still flubs things ... Maybe if I try 640b but yeah I don't have the ram...
- jfbfkdnxbdkdb 142 days ago
  Not to be a doubter in llms being powerful ... Just that every time I try them ... They just don't do what I want....
  [-]
  - mu53 142 days ago
    have you tried adding "please"? I found that it works wonders
    [-]
    - collingreen 142 days ago
      I can't tell if this is serious or tongue in cheek and I find that both funny and deeply discouraging about the state of the world. For some reason it's giving me Rick and Morty butter robot vibes.
    - jfbfkdnxbdkdb 142 days ago
      Tried that ... But competently writing rust is just not a priority for the llms I chat with
simne 142 days ago
As people already said, ca problem is complex problem, and need to resolve at simplest consideration, few simpler problems.
I will list some of simpler problems:
1. Some sort of reliable screen read, capable for all sorts of screen output (not just html-like or any other already structured markup).
2. Some sort of universal optimizer, capable to solve any task, solvable for human in simplified computer environment.
3. Some sort of reliable "Understanding Engine", to make queries with simplified language, easy to use by human, which we could theoretically solve using few different ways (I list only two most known).
3a. Some deep learning AI.
3b. Some huge implementation of semantic AI.
MattGaiser 142 days ago
I have experience with a tiny part of this problem; accessing the various websites and figuring out where to click.
Presently, doing this requires a fair bit of continuous work.
Many websites don't want bots on them and are actively using countermeasures that would block operators in the same way they block scrapers. There is a ton of stuff a website can do to break those bots and they do it. Some even feed back "phantom" data to make the process less reliable.
There are a lot of businesses out there where the business model breaks if someone else can see the whole board.
[-]
- louisfialho 141 days ago
  Good point, raises a very interesting question: if the big websites block agents, won't agents be extremely limited? Should big websites 'charge' agents?
wavemode 142 days ago
There are technical limitations, sure, (getting an AI to parse a screen and interact with it via mouse and keyboard is harder than it sounds - and it sounds hard to start with) but the main limitation is still economical. Does it really make sense to train a multi-billion-parameter AI to click buttons, if you could instead just make an API call?
There's an intersection between "high accuracy" and "low cost" that AI has not quite reached yet for this sort of task, when compared to simpler and cheaper alternatives.
[-]
- hildolfr 142 days ago
  People are using huge capable LLMs to answer things like "what's five percent of 250"; I don't see a big leap in using them to skip APIs.
  On the other side, a lot of user access methods are more able than an API call equivalent, people already exploit things like autohotkey to work around such limitations -- if people are already working around things that way that must indicate the presence of some sort of market.
louisfialho 142 days ago
Thanks for the answers. Even the unexpected patterns like pop-ups etc to me feel pretty structured - I would expect models to generalize and navigate any of them I could see more websites blocking agents in the future but it seems like we're so early that this is not a limiting factor yet
42lux 142 days ago
Because reality has a lot of details.
louisfialho 142 days ago
If someone is actively working on this and believes there is a path please reach out to me
pestatije 142 days ago
you mean Alexa?
aaron695 142 days ago
[dead]