Ask HN: Why is it taking so long to build computer controlling agents?
I'm not a PhD but I assume training computer controlling agents is a straightforward problem as we can define clear tasks (e.g. schedule appointment with details xyz or buy product xyz) on real or generated websites and just let the models figure our where to click (through vlm) and learn through RL.
What am I missing, why isn't this a solved problem by now?
First one is more impressive looking. Second one more reliable.
I think the real hard part is nobody wants to maintain these, and nobody really wants to pay to use them either. It's a lot of work and not something people do for free. It's no surprise these emerged (and won) in hackathons.
All the major operating systems are dedicating their full efforts into this, so it doesn't make much sense to actually raise money and do it.
There are technical limitations, sure, (getting an AI to parse a screen and interact with it via mouse and keyboard is harder than it sounds - and it sounds hard to start with) but the main limitation is still economical. Does it really make sense to train a multi-billion-parameter AI to click buttons, if you could instead just make an API call?
There's an intersection between "high accuracy" and "low cost" that AI has not quite reached yet for this sort of task, when compared to simpler and cheaper alternatives.
People are using huge capable LLMs to answer things like "what's five percent of 250"; I don't see a big leap in using them to skip APIs.
On the other side, a lot of user access methods are more able than an API call equivalent, people already exploit things like autohotkey to work around such limitations -- if people are already working around things that way that must indicate the presence of some sort of market.
Because the websites want to serve ads to humans, upsell you, get you to sign up for their credit card too, so their implementation are highly obfuscated and dynamic.
If they wanted to be easy to work with, they'd offer a simple API, or plain HTML form interface.
I can't tell if this is serious or tongue in cheek and I find that both funny and deeply discouraging about the state of the world. For some reason it's giving me Rick and Morty butter robot vibes.
Thanks for the answers.
Even the unexpected patterns like pop-ups etc to me feel pretty structured - I would expect models to generalize and navigate any of them
I could see more websites blocking agents in the future but it seems like we're so early that this is not a limiting factor yet
I have experience with a tiny part of this problem; accessing the various websites and figuring out where to click.
Presently, doing this requires a fair bit of continuous work.
Many websites don't want bots on them and are actively using countermeasures that would block operators in the same way they block scrapers. There is a ton of stuff a website can do to break those bots and they do it. Some even feed back "phantom" data to make the process less reliable.
There are a lot of businesses out there where the business model breaks if someone else can see the whole board.
https://youtube.com/watch?v=shnW3VerkiM
https://youtube.com/watch?v=VQhS6Uh4-sI
First one is more impressive looking. Second one more reliable.
I think the real hard part is nobody wants to maintain these, and nobody really wants to pay to use them either. It's a lot of work and not something people do for free. It's no surprise these emerged (and won) in hackathons.
All the major operating systems are dedicating their full efforts into this, so it doesn't make much sense to actually raise money and do it.
There's an intersection between "high accuracy" and "low cost" that AI has not quite reached yet for this sort of task, when compared to simpler and cheaper alternatives.
On the other side, a lot of user access methods are more able than an API call equivalent, people already exploit things like autohotkey to work around such limitations -- if people are already working around things that way that must indicate the presence of some sort of market.
If they wanted to be easy to work with, they'd offer a simple API, or plain HTML form interface.
Whilst they are a massive Step forward ... We still have a long way to go for that...
Why not try it yourself with ollama a large model and some rented hardware ... You will get something ... But it will not be consistent...
Presently, doing this requires a fair bit of continuous work.
Many websites don't want bots on them and are actively using countermeasures that would block operators in the same way they block scrapers. There is a ton of stuff a website can do to break those bots and they do it. Some even feed back "phantom" data to make the process less reliable.
There are a lot of businesses out there where the business model breaks if someone else can see the whole board.