Integration with existing computer agent systems and further development #12

James4Ever0 · 2024-03-14T13:25:28Z

James4Ever0
Mar 14, 2024

This project aims to be the first one to tackle the long-standing problem of general computer control. It is quite fascinating yet challenging to accomplish this great achievement. It would be only a matter of time to let this self-evolving agent to rewrite its code and become truely conscious, and an artificial life living inside computer hardware.

However many other projects are doing similar things around the field, like Tencent's AppAgent, OthersideAI's self-operating-computer, THUDM's CogAgent, Microsoft's UFO, Cognition's Devin, CheatLayer, and Cybergod created by myself.

For embodied intelligence, there are Mobile ALOHA, Open X-Embodiement dataset, Robotics Transformer (RT1), RT2, RT-X from Google.

For those "narrow" gaming agents, there are Ghost In the Minecraft, MineDojo's Voyager, PokemonRedRL, Tencent Solo trained within Honor of Kings Arena, and many more to be mentioned.

Meanwhile, there are a wide range of existing benchmarks and datasets, like Android In the Wild from Google, MiniWoB++ from Faramas, Mind2Web from OSU-NLP.

For unsupervised video to action space training, there is Google's Genie.

Human play games with computer, write text with computer, and so many things are bounded to computer that we forget what makes us human. Is it about winning, fun, or something else? If such almightly computer controlling agent exists, then maybe this situation will change.

What I want to propose is that one can simply create a dataset by randomly typing and clicking computer GUI and terminal interfaces, then train an agent over this dataset, establishing a world model over computer environments.

Then this agent is bridged to an external banking system, powered by distributed blockchains and smart contracts. The agent can only survive by accomplishing microtasks, in exchange of time and AI credits. Furthermore, if the agent can earn real world currencies, it can get AI credits for handling them out.

The agent is initially designed by human, but in order to survive it must learn how to rewrite its code and evolve complex structures and modalities. Collaboration is a must, so is efficiency. This regime can bridge agents to real world applications and self-adapt to latest changes.

I would like to point out the ineffectiveness of solely relying on a single-sourced reward, or self-rewarding systems. Human are creatures of evolution, and they know how to collaborate and compete. The words we use, the actions we take, are learned from others and memorized by the whole community. This is a result of both external and internal rewarding systems, of truely conscious and autonomous intelligence, which accurately represent the latent values of the real world, and are constantly evolving.

Correct me if I am wrong about how to build this general computer control system. I am constantly monitoring every Cybergod-related project and will not be surprised if my bet is right, because computer itself is a reflection of civilization, and the only way to prosperity is to integrate it further.

I have posted similar issues at UFO and self-operating-computer.

WeihaoTan · 2024-03-15T03:09:24Z

WeihaoTan
Mar 15, 2024
Collaborator

Thanks for reaching out and sharing your interesting thoughts. I agree that AI agents need to have something like the meta motivation for survival and self-improvement. And a world model is also indeed necessary for the agent. But I am not sure whether just randomly typing and clicking is enough to obtain a world model, which is the notorious drawback of traditional reinforcement learning. Efficient exploration is needed to let the agent go deeper, which is also the intuition of this project that we want to leverage the power of LMM to efficiently collect high-quality data and interact with various environments.

Mimicking the evolution of creatures is an interesting idea and I also have been thinking about it. However, it requires a huge amount of resources for the experiments, which I think is a little earlier for the current stage. A more efficient way for interacting with the envs and self-improvement is still the bottleneck for now.

1 reply

James4Ever0 Mar 17, 2024
Author

Random actions are a good starting point in order to get a massive and extensive dataset over different kinds of computing devices. It also serves as a baseline dataset and model for comparison. Alternatively, you could crank up the temperature setting and also regulize the decoding process with GBNF or state machine to make it grammatically correct, in order to invoke computer controlling functions.

Efficent exploration is a learned objective, which is hard to get right manually or personally in the context of GCC. If what you want is mere performance, you might design specialized agents and assistant systems instead of such a general agent. In fact, you cannot get both at the same time yet, while the mast majorities are not.

In the area of GCC, I would prefer practical approaches targeting generality over performance. It would be nice to setup and use such a hardware, with video and audio capture cards, HID controlling chips, robot arms and webcams, which operates independently of target machines, able to observe BIOS interface and fastboot mode.

Currently we need a lot of tweaks to force ChatGPT-like LLM using computer interfaces, with poor and inconsistent results. Cradle may lower the barrier of creating such computer controlling agents, letting users tweak less to get what they want from a wide range of devices.

In conclusion, two protocols are required for GCC (would be a fish without water if not):

General AI-computer interface protocol
General distributed reward/finance system

James4Ever0 · 2024-03-20T16:32:07Z

James4Ever0
Mar 20, 2024
Author

I have created an automated documentation here. Will review the code later.

1 reply

James4Ever0 Aug 9, 2024
Author

@WeihaoTan Developed a terminal interaction environment for agents, capable of converting all info from terminal into meaningful text, including cursor and styling information.

Terminal environment can be captured as image with cursor denoted in red:

OpenDevin is working on this right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with existing computer agent systems and further development #12

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Integration with existing computer agent systems and further development #12

James4Ever0 Mar 14, 2024

Replies: 2 comments · 2 replies

WeihaoTan Mar 15, 2024 Collaborator

James4Ever0 Mar 17, 2024 Author

James4Ever0 Mar 20, 2024 Author

James4Ever0 Aug 9, 2024 Author

James4Ever0
Mar 14, 2024

Replies: 2 comments 2 replies

WeihaoTan
Mar 15, 2024
Collaborator

James4Ever0 Mar 17, 2024
Author

James4Ever0
Mar 20, 2024
Author

James4Ever0 Aug 9, 2024
Author