Rendered at 01:56:52 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
stingraycharles 16 hours ago [-]
I’ve come to the realization that these kind of systems don’t work, and that a human in the loop is crucial for task planning; the LLM’s role being to identify issues, communicate the design / architecture, etc before it’s handed off, otherwise the LLM always ends up doing not entirely the correct thing.
How is this part tackled when all that you have is GH issues? Doesn’t this work only for the most trivial issues?
vidarh 8 hours ago [-]
I've come to the opposite conclusions: The big limitation of systems like this is starting and ending with human involvement at the same level, instead of directing at a higher level. You end up quibbling over detail the agents can handle themselves with sufficient guardrails and process, instead of setting higher level requirements and reviewing higher level decisions and outcomes, and dealing with exceptions.
You can afford a lot of extra guardrails and process to ensure sufficient quality when the result is a system that gets improved autonomously 24/7.
I'm on my way home from a client, and meanwhile another project has spent the last 10 hours improving with no involvement from me. I spent a few minutes reviewing things this morning, after it's spent the whole night improving unattended.
stingraycharles 7 hours ago [-]
I find that that doesn’t work in the long run. Software agents are not yet capable of maintaining a decently active repository for extended periods of time.
I am all for delegating everything to AI agents, but it just becomes a mess over time if you don’t steer things often enough.
vidarh 6 hours ago [-]
Not my experience at all. If anything, they make it cheap enough to deal with tech debt that it is far easier to justify being strict.
EDIT: I'll add that you can't expect it to guess what you want, but you can let it manage how it delivers it. We don't expect e.g. a product manager to dictate how developers deliver the code, just what the acceptance criteria is, and that's where I'm headed.
mshark 16 hours ago [-]
Had the same realization which inspired eforge (shameless plug) https://github.com/eforge-build/eforge - planning stays in the developer’s control with all engineering (agent orchestration) handed off to eforge. This has been working well for a solo or siloed developer (me) that is free to plan independently. Allows the developer to confidently stay in the planning plane while eforge handles the rest using a methodology that in my experience works well. Of course, garbage in garbage out - thorough human planning (AI assisted, not autonomous) is key.
stingraycharles 15 hours ago [-]
To me that doesn't do enough yet in terms of up-front planning and visualization, but it's a step in the right direction. I prefer Traycer myself.
mshark 14 hours ago [-]
Hadn’t seen Traycer, that looks really polished. An important difference is that eforge is open source (Apache 2.0). I purposefully left out planning features from eforge because I don’t want the same tool that builds my code to force me into a planning methodology. Our role as developers has shifted heavily into planning (offloading implementation), and I’m still getting comfortable with that and want to be free to explore the planning space. Maybe I’ll change my mind after my planning opinions evolve.
jawiggins 11 hours ago [-]
Maybe - I do think as the model get better they'll be able to handle more and more difficult tasks. And yet, even if they can only solve the simplest issues now, why not let them so you can focus on the more important things?
denysvitali 1 days ago [-]
FWIW, a "cheaper" version of this is triggering Claude via GitHub Actions and `@claude`ing your agents like that. If you run your CI on Kubernets (ARC), it sounds pretty much the same
saltpath 16 hours ago [-]
The parallel execution model makes sense for independent tickets but I'm wondering what happens when agent A is halfway through a PR touching shared/utils.py and agent B gets assigned a ticket that needs the same file.
Does the orchestrator do any upfront dependency analysis to detect that, or do you just let them both run and deal with the conflict at merge time?
vidarh 8 hours ago [-]
It's generally not worth it worrying about it too much other than at a very high level vs. letting them fight it out, as long as your test suite is good enough and your orchestrator is even moderately prepared to handle retries.
naultic 24 hours ago [-]
I'm working on something a little similar but mines more a dev tool vs process automation but I love where yours is headed. The biggest issue I've run into is handling retries with agents. My current solution is I have them set checkpoints so they can revert easily and when they can't make an edit or they can't get a test passing, they just restart from earlier state. Problem is this uses up lots of tokens on retries how did you handle this issue in your app?
jawiggins 24 hours ago [-]
Generally I've found agents are capable of self correcting as long as they can bash up against a guardrail and see the errors. So in optio the agent is resumed and told to fix any CI failures or fix review feedback.
MrDarcy 1 days ago [-]
Looks cool, congrats on the launch. Is there any sandbox isolation from the k8s platform layer? Wondering if this is suitable for multiple tenants or customers.
jawiggins 1 days ago [-]
Oh good question, I haven't thought deeply about this.
Right now nothing special happens, so claude/codex can access their normal tools and make web calls. I suppose that also means they could figure out they're running in a k8s pod and do service discovery and start calling things.
What kind of features would you be interested in seeing around this? Maybe a toggle to disable internet connections or other connections outside of the container?
nevon 19 hours ago [-]
Network policies controlling egress would be one thing. I haven't seen how you make secrets available to the agent, but I would imagine you would need to proxy calls through a mitm proxy to replace tokens with real secrets, or some other way to make sure the agent cannot access the secrets themselves. Specifically for an agent that works with code, I could imagine being able to run docker-in-docker will probably be requested at some point, which means you'll need gvisor or something.
I wonder, based on your experience, how hard would it be to improve your system to have an AI agent review the software and suggest tickets?
Like, can an AI agent use a browser, attempt to use the software, find bugs and create a ticket? Can an AI agent use a browser, try to use the software and suggest new features?
ramon156 19 hours ago [-]
I think it's more important to pin down where a human must be in order for this not to become a mess. Or have we skipped that step entirely?
pianopatrick 6 hours ago [-]
Personally my theory is that to solve the messiness we will need some new frameworks and even languages that are designed to catch AI mistakes in large code bases. For example, AIs in the past would sometimes hallucinate methods that do not exist. But in a language with a strong type system a static type checker should be able to catch that mistake and give the AI automated feedback to fix that mistake without a human in the loop.
As far as humans in the loop, the only human we ultimately cannot get rid of is the user. But I think with a combo of user feedback forms and automated metrics we can give AI a lot of feedback about how good software is just from users using the software.
vidarh 8 hours ago [-]
Yes, they can, and they do a reasonably good job at it. Hand them playwright or similar, and point them at it. The caveat is that they're often "lazy", and it takes some practice to coax them into being thorough (hot tip: have one write a list of things to probe and test, and tell it to use sub agents to address each; otherwise they tend to decide very quickly it's too tedious and start taking shortcuts)
mlsu 19 hours ago [-]
perhaps we can give the AI a bit of money, make it the customer, then we can all safely get off the computer and go outside :)
stingraycharles 16 hours ago [-]
AI agents can absolutely use web browsers to do these things, but the hard part is accurately defining the acceptance criteria.
smokeyfish 20 hours ago [-]
Datadog have a feature like that.
maxdo 13 hours ago [-]
Is the pod per repo or per task ?
jawiggins 11 hours ago [-]
One pod is an instance of a repo, you can set the number of instances of each agent/task that can be running on a pod at a time. For >1, each agent should be using it's own worktree.
fhouser 11 hours ago [-]
Hot take: You should want to review your agents' output and progress.
vidarh 8 hours ago [-]
I prefer to have my agents review my agents output and progress, and have them improve the prompts for future runs.
jawiggins 10 hours ago [-]
Yeah totally, you don't have to auto-merge anything - you can review the PRs yourself
fhouser 9 hours ago [-]
Yeah, I think that's the most important part in these new types of processes. Although it is tempting to just let an agent run with it for a while.
the_real_cher 11 hours ago [-]
This project should be called the Rube Goldeberg machine creator.
fhouser 11 hours ago [-]
The Hitchhiker's Guide to issue-tracking.
raised_hand 20 hours ago [-]
Why K6? Is there a way I could run it without
conception 1 days ago [-]
What’s the most complicated, finished project you’ve done with this?
jawiggins 1 days ago [-]
Recently I used to to finish up my re-implementation of curl/libcurl in rust (https://news.ycombinator.com/item?id=47490735). At first I started by trying to have a single claude code session run in an iterative loop, but eventually I found it was way to slow.
I started tasking subagents for each remaining chunk of work, and then found I was really just repeating the need for a normal sprint tasking cycle but where subagents completed the tasks with the unit tests as exit criteria. So optio came to my mind, where I asked an agent to run the test suite, see what was failing, and make tickets for each group of remaining failures. Then I use optio to manage instances of agents working on and closing out each ticket.
antihero 1 days ago [-]
And what stops it making total garbage that wrecks your codebase?
jawiggins 1 days ago [-]
There are a few things:
a) you can create CI/build checks that run in github and the agents will make sure pass before it merges anything
b) you can configure a review agent with any prompt you'd like to make sure any specific rules you have are followed
c) you can disable all the auto-merge settings and review all the agent code yourself if you'd like.
kristjansson 1 days ago [-]
> to make sure
you've really got to be careful with absolute language like this in reference to LLMs. A review agent provides no guarantees whatsoever, just shifts the distribution of acceptable responses, hopefully in a direction the user prefers.
jawiggins 1 days ago [-]
Fair, it's something like a semantic enforcement rather than a hard one. I think current AI agents are good enough that if you tell it, "Review this PR and request changes anytime a user uses a variable name that is a color", it will do a pretty good job. But for complex things I can still see them falling short.
SR2Z 23 hours ago [-]
I mean, having unit tests and not allowing PRs in unless they all pass is pretty easy (or requiring human review to remove a test!).
A software engineer takes a spec which "shifts the distribution of acceptable responses" for their output. If they're 100% accurate (snort), how good does an LLM have to be for you to accept its review as reasonable?
11 hours ago [-]
59nadir 20 hours ago [-]
We've seen public examples of where LLMs literally disable or remove tests in order to pass. I'm not sure having tests and asking LLMs to not merge things before passing them being "easy" matters much when the failure modes here are so plentiful and broad in nature.
jawiggins 10 hours ago [-]
You'd want to have the tests run as a github action and then fail the check if the tests don't pass. Optio will resume agents when the actions fail and tell them to fix the failures.
ElFitz 19 hours ago [-]
My favourite so far was Claude "fixing" deployment checks with `continue-on-error: true`
SR2Z 7 hours ago [-]
So... add another presubmit test that fails when a test is removed. Require human reviews.
It's not like a human being always pushes correct code, my risk assessment for an LLM reading a small bug and just making a PR is that thinking too hard is a waste of time. My risk assessment for a human is very similar, because actually catching issues during code review is best done by tests anyways. If the tests can't tell you if your code is good or not then it really doesn't matter if it's a human or an LLM, you're mostly just guessing if things are going to work and you WILL push bad code that gets caught in prod.
jamiemallers 17 hours ago [-]
[dead]
AbanoubRodolf 19 hours ago [-]
[dead]
upupupandaway 1 days ago [-]
Ticket -> PR -> Deployment -> Incident
zvqcMMV6Zcr 13 hours ago [-]
> To make error is human. To propagate error to all server in automatic way is #devops
I am not sure how AI agent variation of that joke would look like. Every now and then some blog posts lands on HN asking "Where are all new apps created thanks to LLM productivity boost"?. I am more surprised there are no news about some serious fuck-ups that can be traced back to LLM usage in code.
verdverm 20 hours ago [-]
I love k8s, but having it as a requirement for my agent setup is a non-starter. Kubernetes is one method for running, not the center piece.
abybaddi009 23 hours ago [-]
Does this support skills and MCP?
jawiggins 21 hours ago [-]
Yup. MCP can be configured on a repo level. At task execution time, enabled MCP servers are written as a .mcp.json file into the agent's worktree. Enabled skills are written as .claude/commands/{name}.md files in the worktree, making them available as slash commands to the agent
hmokiguess 1 days ago [-]
the misaligned columns in the claude made ASCII diagrams on the README really throw me off, why not fix them?
|
|
|
|
jawiggins 1 days ago [-]
Should be fixed now :)
hmokiguess 11 hours ago [-]
thank you x)
MarcelinoGMX3C 12 hours ago [-]
[dead]
bmd1905 10 hours ago [-]
[dead]
ferreyadinarta 13 hours ago [-]
[flagged]
Acacian 14 hours ago [-]
[dead]
rafaelbcs 1 days ago [-]
[dead]
QubridAI 1 days ago [-]
[flagged]
knollimar 1 days ago [-]
I don't want to accuse you of being an LLM but geez this sounds like satire
How is this part tackled when all that you have is GH issues? Doesn’t this work only for the most trivial issues?
You can afford a lot of extra guardrails and process to ensure sufficient quality when the result is a system that gets improved autonomously 24/7.
I'm on my way home from a client, and meanwhile another project has spent the last 10 hours improving with no involvement from me. I spent a few minutes reviewing things this morning, after it's spent the whole night improving unattended.
I am all for delegating everything to AI agents, but it just becomes a mess over time if you don’t steer things often enough.
EDIT: I'll add that you can't expect it to guess what you want, but you can let it manage how it delivers it. We don't expect e.g. a product manager to dictate how developers deliver the code, just what the acceptance criteria is, and that's where I'm headed.
Right now nothing special happens, so claude/codex can access their normal tools and make web calls. I suppose that also means they could figure out they're running in a k8s pod and do service discovery and start calling things.
What kind of features would you be interested in seeing around this? Maybe a toggle to disable internet connections or other connections outside of the container?
Like, can an AI agent use a browser, attempt to use the software, find bugs and create a ticket? Can an AI agent use a browser, try to use the software and suggest new features?
As far as humans in the loop, the only human we ultimately cannot get rid of is the user. But I think with a combo of user feedback forms and automated metrics we can give AI a lot of feedback about how good software is just from users using the software.
I started tasking subagents for each remaining chunk of work, and then found I was really just repeating the need for a normal sprint tasking cycle but where subagents completed the tasks with the unit tests as exit criteria. So optio came to my mind, where I asked an agent to run the test suite, see what was failing, and make tickets for each group of remaining failures. Then I use optio to manage instances of agents working on and closing out each ticket.
a) you can create CI/build checks that run in github and the agents will make sure pass before it merges anything
b) you can configure a review agent with any prompt you'd like to make sure any specific rules you have are followed
c) you can disable all the auto-merge settings and review all the agent code yourself if you'd like.
you've really got to be careful with absolute language like this in reference to LLMs. A review agent provides no guarantees whatsoever, just shifts the distribution of acceptable responses, hopefully in a direction the user prefers.
A software engineer takes a spec which "shifts the distribution of acceptable responses" for their output. If they're 100% accurate (snort), how good does an LLM have to be for you to accept its review as reasonable?
It's not like a human being always pushes correct code, my risk assessment for an LLM reading a small bug and just making a PR is that thinking too hard is a waste of time. My risk assessment for a human is very similar, because actually catching issues during code review is best done by tests anyways. If the tests can't tell you if your code is good or not then it really doesn't matter if it's a human or an LLM, you're mostly just guessing if things are going to work and you WILL push bad code that gets caught in prod.
I am not sure how AI agent variation of that joke would look like. Every now and then some blog posts lands on HN asking "Where are all new apps created thanks to LLM productivity boost"?. I am more surprised there are no news about some serious fuck-ups that can be traced back to LLM usage in code.
| | | |