Thanks for your thorough response, and yeah, I'm broadly on board with all that. I think learning from detailed text behind decisions, not just the single-bit decision itself, is a great idea that can leverage a lot of recent work.
I don't think that using modern ML to create a model of legal text is directly promising from an alignment standpoint, but by holding out some of your dataset (e.g. a random sample, or all decisions about a specific topic, or all decisions later than 2021), you can test the generalization properties of the model, and more importantly test interventions intended to improve those properties.
I don't think we have that great a grasp right now on how to use human feedback to get models to generalize to situations the humans themselves can't navigate. This is actually a good situation for sandwiching: suppose most text about a specific topic (e.g. use of a specific technology) is held back from the training set, and the model starts out bad at predicting that text. Could we leverage human feedback from non-experts in those cases (potentially even humans who start out basically ignorant about the topic) to help the model generalize better than those humans could alone? This is an intermediate goal that it would be great to advance towards.
Presumably you're aware of various Dylan Hadfield-Menell papers, e.g. https://dl.acm.org/doi/10.1145/3514094.3534130 , https://dl.acm.org/doi/10.1145/3306618.3314258 , https://dl.acm.org/doi/10.1145/3514094.3534130
And of course Xuan's talk ( https://www.lesswrong.com/posts/Cty2rSMut483QgBQ2/what-should-ai-owe-to-us-accountable-and-aligned-ai-systems )
But, to be perfectly honest... I think there's part of this proposal that has merit, and part of this proposal that might sound good to many people but is actually bad.
First, the bad: The notion that "Law is a computational engine that converts human values into legible directives" is wrong. Legibility is not an inherent property of the directives. It is a property of the directives with respect to the one interpreting them, which in the case of law is humans. If you build an AI that doesn't try to follow the spirit of the law in a human-recognizable way, the law will not be legible in the way you want.
The notion that it would be good to build AI that humans direct by the same process that we currently create laws is wrong. Such a process works for laws, specifically for laws for humans, but the process is tailored to the way we currently apply it in many ways large and small, and has numerous flaws even for that purpose (as you mention, about expressions of power).
Then, the good: Law offers a lot of training data that directly bears on what what humans value, what vague statements of standards mean in practice, and what humans think good reasoning looks like. The "legible" law can't be used directly, but it can be used as a yardstick against which to learn the illegible spirit of the law. This research direction does not look like a Bold New Way to do AI alignment, instead it looks like a Somewhat Bold New Way to apply AI alignment work that is fully contiguous with other alignment research (e.g. attempts to learn human preferences by actively asking humans).
One thing that confused me was the assumption at various points that the oracle is going to pay out the entire surplus generated. That'll get the most projects done, but it will have bad results because you'll have spent the entire surplus on charity yachts.
The oracle should be paying out what it takes to get projects done. Not in the sense of labor theory of value, I mean that if you are having trouble attracting projects, payouts should go up, and if you have lots of competition for funding, payouts should go down.
This is actually a lot like a monopsony situation, where you can have unemployment (analogous to net-positive projects that don't get done) because the monopsonistic employer has a hard time paying those last few people what they want without having to raise wages for everyone else, eating into their surplus.
I think moderated video calls are my favorite format, as boring as that is. I.e. you have a speaker and also a moderator who picks people to ask questions, cuts people off or prompts them to keep talking depending on their judgment, etc.
Another thing I like, if it seems like people are interested in talking about multiple different things after the main talk / QA / discussion, is splitting up the discussion into multiple rooms by topic. I think Discord is a good application for this. Zoom is pretty bad at this but can be cajoled into having the right functionality if you make everyone a co-host, I think Microsoft Teams is fine but other people have problems, and other people think GatherTown is fine but I have problems.
I'm curious about your takes on the value-inverted versions of the repugnant and very-repugnant conclusions. It's easy to "make sense" of a preference (e.g. for positive experiences) by deciding not to care about it after all, but doing that doesn't actually resolve the weirdness in our feelings about aggregation.
Once you let go of trying to reduce people to a 1-dimensional value first and then aggregate them second, as you seem to be advocating here in ss. 3/4, I don't see why we should try to hold onto simple rules like "minimize this one simple thing." If the possibilities we're allowed to have preferences about are not 1-dimensional aggregations, but are instead the entire self-interacting florescence of life's future, then our preferences can get correspondingly more interesting. It's like replacing preferences over the center of mass of a sculpture with preferences about its pose or theme or ornamentation.
Academics choose to work on things when they're doable, important, interesting, publishable, and fundable. Importance and interestingness seem to be the least bottlenecked parts of that list.
The root of the problem is difficulty in evaluating the quality of work. There's no public benchmark for AI safety that people really believe in (nor do I think there can be, yet - talk about AI safety is still a pre-paradigmatic problem), so evaluating the quality of work actually requires trusted experts sitting down and thinking hard about a paper - much harder than just checking if it beat the state of the art. This difficulty restricts doability, publishability, and fundability. It also makes un-vetted research even less useful to you than it is in other fields.
Perhaps the solution is the production of a lot more experts, but becoming an expertise on this "weird" problem takes work - work that is not particularly important or publishable, and so working academics aren't going to take a year or two off to do it. At best we could sponsor outreach events/conferences/symposia aimed at giving academics some information and context to make somewhat better evaluations of the quality of AI safety work.
Thus I think we're stuck with growing the ranks of experts not slowly per se (we could certainly be growing faster), but at least gradually, and then we have to leverage that network of trust both to evaluate academic AI safety work for fundability / publishability, and also to inform it to improve doability.
That's a good point. I'm a little worried that coarse-grained metrics like "% unemployment" or "average productivity of labor vs. capital" could fail to track AI progress if AI increases the productivity of labor. But we could pick specific tasks like making a pencil, etc. and ask "how many hours of human labor did it take to make a pencil this year?" This might be hard for diverse task categories like writing a new piece of software though.
I think the intersection with recommender algorithms - both in terms of making them, and in terms of efforts to empower people in the face of them - is interesting.
Suppose you have an interface that interacts with a human user by recommending actions (often with a moral component) in reaction to prompting (voice input seems emotionally powerful here), and that builds up a model of the user over time (or even by collecting data about the user much like every other app). How do you build this to empower the user rather than just reinforcing their most predictable tendencies? How to avoid top-down bias pushed onto the user by the company / org making the app?