That's an interesting point: Under this model if EAGx's don't matter then we'd expect engagement to decerase for attendees and stable engagement could eb interpeted as a positive effect. A proper cohort analysis could help determine the volatility/churn to give us a baseline and estimate the magnitude of this effect among the sort of people who might attend EAG(x) but didn't.
That said, I still think that any effect of EAG(x) would presumably be a lot stronger in the 6 months after a conference than in the 6 months after that (/6 months before a conference) so if it had a big effect and engagement of attendees was falling on average than you'd see a bump (or stabilization) in the few months after an event and a bigger decline after that. Though this survey has obvious limitations for detecting that.
What did you mean by the last sentence? Above I've assumed that it has an effect not just for new people who are attending a conference for the first time (though my intuition is that this would be bigger) but also in maintaining (on the margin) engagement of repeat attendees. Do you disagree?
Thanks Rudstead, I agree about the "keen beans" limitation, though if anything that makes them more similar to EAGx attendees (which they're supposed to be a comparison to). In surveys in general there's also steeply diminishing returns for getting a higher response rate with more reminders or higher cash incentives.
(2) Agreed, but hopefully we'll be able to continue following people up over time. The main limitation is that loads of people in any cohort study are going to drop out over time, but if it succeeded such a cohort study could provide loads of information.
Thanks for the comment, this is a really strong point.
I think this can make us reasonably confident that the EAGx didn't make people more engaged on average and even though you already expected this, I think a lot of people did expect EAGs would lead to actively higher engagement among participants. We weren't trying to measure the EA growth rate of course, we were trying to measure whether the EAGs lead to higher counterfactual engagement among attendees.
The model where an EAG matters could look something like: There are two separate populations of EA: less-engaged members who don't attend EAGs, and more-engaged members who attend EAGs at least sometimes. And attending an EAG helps push people into being even more engaged and maintains their level of engagement that would otherwise flag. So even if both populations are stable, EAG keeps the high-engagement population more engaged and/or larger.
A similar model where EAG doesn't matter is that people stay engaged for other reasons and people attend EAG believing incorrectly it will help or as de-facto recreation.
If the first model is true then we should expect EA engagement to be a lot higher in the few months after the conference and gradually fall until at least the few weeks before the conference (and spiking again during/just after the conference). But if the second model is true then any effects on EA engagement from the conference should dissapear quickly, perhaps within a few weeks or even days.
While the survey isn't perfect for measuring this (6 months is a lot of time for the effects to decay and it would be better for the initial survey would've been better weeks before the conference might have been getting people excited) I think it provides significant value since it asks about behavior over the past 6 months in total. You'd expect if the conference had a big effect on maintaining motivation (which averages steady-state across years) that people would donate more, have more connections, attend more events etc. 0-5 months after a conference than 6-12 months after.
Given we don't see that, it seems harder to argue that EAGs have a big effect on motivation and therefore harder to argue that EAGs play an important role in maintaining the current steady-state motivation and energy of attendees.
It could still be that EAGs matter for other reasons (e.g. a few people get connections that create amazing value) but this seems to provide significant evidence against one major supposed channel of impact.
We considered it and I definitely agree that people who are attending their first EAGx are much more likely to be affected. The issue is that people in that bucket are already likely to be dramatically increasing their level of engagement, so it's hard to draw conclusions from the results on that front
Thanks for the feedback Laura, I think the point about ceiling effects is really interesting. If we care about increasing the mean participation then that shouldn't affect the conclusions (since it would be useless for people already at the ceiling), but if (as you suggest) the value is mostly coming from a handful of people maintaining/growing their engagement and networks then our method wouldn't detect that. Detecting effects like that is hard and while it's good practice to be skeptical of unobservable explanations, it doesn't seem that implausible.
Perhaps trying to systematically look at the histories of people who are working in high-impact jobs and joined EA after ~2015 and tracing through interviews with them and their friends whether we think they'd have ended up somewhere equally impactful if not for attending EAGs. But that would necessarily involve huge assumptions about how impactful EAGs are already, so may not add much information.
I agree that randomizing almost-accepted people would be statistically great but not informative about the impacts of non-marginal people, and randomly excluding highly-qualified people would be too costly in my opinion. We specifically reached out to people who were accepted but didn't attend for various reasons (which should be a good comparison point) but there's nowhere near enough of them for EAGxAus to get statistical results. If this was done for all EAG(x)'s for a few years we might actually get a great control group though!
We did consider having more questions and aiming more directly at the factors that are most indicative of direct impact but we decided on this compromise for two reasons: First, every extra question reduces the response rate. Given the 40% drop out and a small sample size I'd be reluctant to add too much. Second, questions that take time and through for people to answer is especially likely to lead to drop outs and inaccurate responses.
That said, leaving a text box for 'what EA connections and opportunities have you found in the last 6 months?' could be very powerful, though quanitifying the results would of require a lot of interpretation.
Thanks for the feedback Sam. It's definitely a limitation but the diff-in-diff analysis still has significant value. The specific way the treatment and control groups are different constrains the stories we can tell where the conference did have a big (hopefully positive) effect but appared not to due to some unobserved factors. If none of these stories seem plausible then we can still be relatively confident in the results.
The post mentions that the difference in donation appears to be driven by a 3 respondents, and the idea that non-attendee donations fall by ~50% without attendance but would be unchanged with attendance seems unlikely (and confounded with high-earning professionals having presumably less time to attend).
Otherwise, the control group seems to have similar beliefs but is much less likely to take EA actions. This isn't surprising given attending EAGx is an EA action but does present a problem. Looking only at people who were planning to attend but didn't (for various reasons) would have given a very solid subgroup but there were too few of these to do any statistical analysis. Though a bigger conference could have looked specifically at that group, which I'd be really excited to see.
With diff-in-diff we need the parallel trends assumption as you point out, but we don't need parallel levels: if the groups would have continued at their previous (different) rates of engagement in the absence of the conference then we should be fine. Similarly, if there's some external event affecting EA in general and we can assume it would have impacted both groups equivalently (at least in % of engagement) then the diff-in-diff methodology should account for that.
So (excluding the donation case) we have a situation where a more engaged group and a less engaged group both didnt change their behavior.
If the conference had a big positive effect then this would imply that in the absence of the conference the attendees / the more engaged group would have decreased their engagement dramatically but that the effect of the conference happened to cancel that out. It also implies that whatever factor would have led attendees to become less engaged wouldn't have affected non-attendees (or at least is strongly correlated to attendance).
You could imagine the response rates being responsible, but I'm struggling to think of a credibly story for this: The 41% of attendees who dropped out of the follow-up survey would presumably be those least affected by the conference, which would make the data overestimate the impact of EAGx. Perhaps the 3% of contacted people who volunteered for the treatment group were much more consistent in their EA engagement than the (more engaged on average) attendees who volunteered and so were less affected by an EA-wide downturn that conference attendance happened to cancel out? But this seems tenuous and 'just-so'.
To me the most plausible way this could happen is reversion to the mean: EA engagement is highly volatile on a year-to-year level with only the most engaged going to EAGx and that results in them maintaining their high-level of EA engagement for at least the next year (roughly cancelling out the usual decline).
This last point is the biggest issue with the analysis in my opinion. Following attendees over the long-run with multiple surveys per year (to compare results before vs. after a conference) would help a lot, but huge incentives would be needed to maintain a meaningful sample for more than a couple of years.
This can be avoided with a treaty that requires full access given to international inspectors. This already happens with the IAEA and was set up even in the far greater tensions of the cold war. If someone like Iran tries to kick out the inspectors, everyone assumes they're trying to develop nuclear weapons and takes serious action (harsh sanctions, airstrikes, even the threat of war).
If governments think of this as an existential threat, they should agree to it for the same reasons they did with the IAEA. And while there's big incentives to defect (unless they have very high p(doom)), there is also the knowledge that kicking out inspectors will lead to potential war and their rivals defecting too.
On (2): I agree most are unlikely to focus on it heavily, but convincing some people at top labs to care at least slightly seems like it could have a big effect in making sure at least a little animal welfare and digital minds content is included in whatever they train AIs to aim towards. Even a small amount of empathy and open-mindedness for what could be capable of suffering should do a lot for the risk of astronomical suffering.
I agree there wouldn't be new effects at that point, but we're asking about total effects over the 6 months before/since the conference. If the connections etc. persist for 6 months then it should show up in the survey and if they have dissapeared within a few months then that indicates these effects of EAGx attendance are short-lived, which presumably makes them far less significant for a person's EA engagement and impact overall.
If the EAG impacts are spiky enough that they start disspiating substantially within several months (but get re-upped but future attendance) then we should be able to detect a change with our methodology (higher engagement after). You're right that if the effects persist for many years (and don't stack much with repeat attendance) then we wouldn't be able to measure any effect on repeat attendees but this would presume that it isn't having much impact on repeat attendees anyway. On the other hand, if effects persist for many years then we should be able to detect a strong effect for first-time attendees (though you'd need a bigger sample).