It is clear that Devin is a quantum leap over known past efforts in terms of its ability to execute complex multi-step tasks, to adapt on the fly, and to fix its mistakes or be adjusted and keep going.
For once, when we wonder ‘how did they do that, what was the big breakthrough that made this work’ the Cognition AI people are doing not only the safe but also the smart thing and they are not talking.
Here's is Claude-3-Opus's summary:
The Risks and Implications of AI Software Engineers
Devin, an AI system developed by Cognition AI, demonstrates remarkable capabilities in writing complex code and completing software engineering tasks autonomously. This breakthrough in AI technology raises significant questions about the future of software development and the potential risks associated with such powerful AI agents.
Key points:
-
Devin's ability to complete [13.8% of] real-world coding tasks on Upwork without human intervention is a quantum leap in AI capabilities.
-
The use of AI systems like Devin could lead to a rapid accumulation of technical debt and poorly maintained code if not properly managed.
-
Ensuring the safe use of Devin and similar AI agents is a major challenge, as they require access to sensitive data and the ability to execute arbitrary code.
-
The full automation of software engineering by AI could lead to recursive self-improvement (RSI) and potentially catastrophic consequences.
-
AI agents with the ability to plan, overcome obstacles, and seek resources to achieve their goals may pose existential risks if not properly aligned with human values.
The development of AI systems like Devin highlights the urgent need for proactive measures to ensure the safe and responsible deployment of advanced AI technologies.
Personal take I was really hoping that current architectures could not really support fully autonomous agents, and that it would be a few years away. I'm very concerned about this development, and afraid that the usual policy cycle is falling further behind on AI progress.
If anyone has good suggestions of what I could email to relevant MEPs (just Zvi's post?) that would be net-positive (e.g. low risk of bad regulation), I'd be happy to hear them.
Ping Joep at PauseAI? He's a big fan of emailing representatives and has some advice. Here's a recording of a talk he gave hosted by ERO in Amsterdam the other night - I think it contains some pointers towards the end.
Thanks, will do!
This article is quite interesting, I look forward to seeing how developments
However it goes off the deep end halfway through:
"solve cold fusion" is not going to be solved at a computer terminal. "cold fusion" is probably impossible. Ab initio simulations are inherently limited, and require gargantuan computational resources for accurate results, along with widespread experimentation. As a physicist, I am sick to death of fantasy nonsense like this being injected into AI risk speculation.
This is not a fair critique of the post, he's responding to a hypothetical discussed on Twitter.
As a software engineer, Devin seems very overhyped.
Rather than being a new set of capabilities, I think it’s a repackaging of current capabilities into a new UI.
The AI code assistant space is already very crowded. If this company came out and said they were making another code assistant, no one would have invested in them because there are already great code assistants on the market. Claiming that their product was an “AI software engineer” was the ONLY way for them to get funding and attention.
Also, some of the claims they’ve made involve smoke and mirrors. They claim “it passes the top tech company coding interviews”. It can do that because it’s trained directly on the solutions to the Leetcode questions that top tech companies give. Google search could pass the top tech company interviews by that standard.
People seem to vastly over estimate how much of software development is doing simple code tasks. Only 20% of software development is writing code and maybe 5% is doing simple code work that Devin was doing in the demos. Generative AI seems to have fundamental problems with reasoning, counting, and precision that I suspect will hold it back from being good at software engineering for a while longer.
I hope you are correct! As an outsider, I find it very hard to judge without standardized non-gameable benchmarks for agents.
I hope you are correct. I find it very hard to judge without standardized, non-gameable benchmarks for agents.
I hope you are correct. As an outsider, I find it very hard to judge without standardized, non-gameable benchmarks for agents.