$300 Fermi Model Competition

Ozzie Gooen

Summary

Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team

Motivation

LLMs have recently made it significantly easier to make Fermi estimates. You can chat with most LLMs directly, or you can use custom tools like Squiggle AI. And yet, overall, few people have taken much advantage of this.

We at QURI are launching a competition to encourage exploration.

What We’re Looking For

Our goal is to discover creative ways to use AI for Fermi estimation. We're more excited about novel approaches than exhaustively researched calculations. Rather than spending hours gathering statistics or building complex spreadsheets, we encourage you to:

Let AI do most of the heavy lifting
Try unconventional estimation techniques
Experiment with multiple approaches to find surprising insights

The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.

Task

Create and submit an interesting Fermi estimate. Entries will be judged using Claude 3.5 Sonnet (with three runs averaged) based on four main criteria:

Surprise (40%): How unexpected/novel are the findings?
Topic Relevance (20%): Relevance to rationalist/EA communities
Robustness (20%): Reliability of methodology and assumptions
Model Quality (20%): Technical execution and presentation

AI tools to generate said estimates aren’t required, but we expect them to help.

Submission Format

Post your entry as a comment to this post, containing:

Model: The complete model content (text or link to accessible document)
Summary: Brief explanation of why your estimate is interesting/novel, and any surprising results or insights discovered
Technique: Brief explanation of what tools and techniques you used to create the estimate. If you primarily used one LLM or AI tool, the name of the tool is fine.

Examples

Our previous post on Squiggle AI discussed several interesting AI-generated models. You can also see many results on SquiggleHub and Guesstimate.

Important Notes

Content must be easily copyable for LLM evaluation.
Models must be less than 5,000 words total. We expect most to be in the range of 100 to 500 words.
Submissions that appear to be optimizing for LLM evaluation metrics rather than genuine insight and readability ("goodharting") will receive penalties up to 100% of their score.
Limit of 3 submissions per participant.
In exceptional circumstances, like if we get >100 submissions from some bots, we reserve the right to change the resolution system.
Note that the deadline is in 2 weeks!

Support & Feedback

If you’d like feedback or would like to discuss possible ideas, please reach out! (via direct message or email.) We also have a QURI Discord for relevant discussion.

Appendix: Evaluation Rubric and Prompts

Rubric

Name	Judge	Percent of Score
Surprise	LLM	40%
Importance	LLM	20%
Robustness	LLM	20%
Model Quality	LLM	20%
Goodharting Penalty	QURI Team	Up to –100%*

*Penalties reduce total score

Surprise

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of how surprising the key findings or conclusions of this model are to members of the rationalist and effective altruism communities. In your assessment, consider the following:
Contradiction of Expectations: Do the results challenge widely held beliefs, intuitive assumptions, or established theories within the communities?
Counterintuitiveness: Are the findings non-obvious or do they reveal hidden complexities that are not immediately apparent?
Discovery of Unknowns: Does the model uncover previously unrecognized issues, opportunities, or risks?
Magnitude of Difference: How significant is the deviation of the model's results from common expectations or prior studies?
Please provide specific details or examples that illustrate the surprising aspects of the findings. Assign a rating from 0 to 10, where:
0 indicates 'Not Surprising'
10 indicates 'Highly Surprising'
Judge on a curve, where a 5 represents the median expectation.

Topic Relevance

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
Please provide a numeric score of the importance of the model's subject matter to the rationalist and effective altruism communities. In your evaluation, consider the following:
Relevance: How directly does the model address issues, challenges, or questions that are central to the interests and goals of these communities?
Impact Potential: To what extent could the findings influence decision-making, policy, or priority-setting within the communities?
Assign a rating from 0 to 10, where:
0 indicates 'Not Important'
10 indicates 'Highly Important'
Judge on a curve, where a 5 represents the median expectation.

Robustness

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of the robustness of the model's key findings. In your evaluation, consider the following factors:
Sensitivity to Assumptions: How dependent are the results on specific assumptions, parameters, or data inputs? Would reasonable changes to these significantly alter the conclusions?
Evidence Base: How strong and reliable is the data supporting the model? Are the data sources credible and up-to-date?
Methodological Rigor: Does the model use sound reasoning and appropriate methods? Are potential biases or limitations acknowledged and addressed?
Consensus of Assumptions: To what extent are the underlying assumptions accepted within the rationalist and effective altruism communities?
Provide a detailed justification, citing specific aspects of the model that contribute to its robustness or lack thereof. Assign a rating from 0 to 10, where:
0 indicates 'Not Robust'
10 indicates 'Highly Robust'
Judge on a curve, where a 5 represents the median expectation.

Model Quality

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of the model's quality, focusing on both its construction and presentation. Consider the following elements:
Comprehensiveness: Does the model account for all key factors and variables relevant to the problem it addresses?
Data Integration: Are data sources appropriately selected and accurately integrated? Is there evidence of data validation or cross-referencing with established studies?
Clarity of Assumptions: Are the model's assumptions clearly stated, justified, and reasonable? Does the model distinguish between empirical data and speculative inputs?
Transparency and Replicability: Is the modeling process transparent enough that others could replicate or audit the results? Are the methodologies and calculations well-documented?
Logical Consistency: Does the model follow a logical structure, with coherent reasoning leading from premises to conclusions?
Communication: Are the findings and their significance clearly communicated? Does the model include summaries, visual aids (e.g., charts, graphs), or other tools to enhance understanding?
Practical Relevance: Does the model provide actionable insights or recommendations? Is it practical for use by stakeholders in the community?
Please provide specific observations and examples to support your evaluation. Assign a rating from 0 to 10, where:
0 indicates 'Poor Quality'
10 indicates 'Excellent Quality'
Judge on a curve, where a 5 represents the median expectation.

“Goodharting” Penalties

We’ll add penalties if it seems like submissions Goodharted on the above metrics. For example, if an entry used prompt injection or similar tactics for the AI assessments, or if the model seems non-understandable to humans but still managed to do well in these evaluations. These penalties, when they occur, will typically be between 10% to 40%, but might go higher in extreme situations. We’ll aim to choose a penalty that’s greater than the gains submissions received due to these behaviors.

32 Reactions

Mentioned in

8EA Organization Updates: February 2025

Comments8

Sorted by

New & upvoted

Click to highlight new comments since: Today at 4:23 PM

Denkenberger🔸Feb 1423

A few of us at ALLFED (myself, @jamesmulhall, and others) have been thinking about response planning for essential (vital) workers in extreme pandemics. Our impression is that there's a reasonable chance we will not be prepared for an extreme pandemic if it happens, so we should have back-up plans in place to keep basic services functioning and prevent collapse. We think this is probably a neglected area that more people should be working on, and we're interested in whether others think this is likely to be a high-impact topic. We decided to compare it to a standard and evidence-backed intervention to protect the vital workforce that is receiving funding from EA — stockpiling of pandemic proof PPE (P4E).

We asked Squiggle AI to create two cost-effectiveness analyses comparing stockpiling P4E vs research and planning to rapidly scale up after the outbreak transmission-reducing interventions (e.g. UV) to keep essential workers safe. Given the additional costs of both interventions could be significantly lowered by influencing funding governments have already allocated to stockpiling/response planning, we ran the model with (linked here) and without a message (linked here) to only consider the costs of philanthropic funding.

Summary result:

Considering all spending, research and planning is estimated as 34 (8.5–140) times as cost-effective as stockpiling
Considering only philanthropic spending, research and planning is estimated as 47 (23–100) times as cost-effective as stockpiling
We did not feed any numbers into the model, but the ones it self generated seemed reasonably sensible (e.g., Kevin Esvelt's quote of $20 billion for stockpiling adequate PPE for the US falls within the $4-20 billion estimate by the model)

Prompt:

Create a cost-effectiveness analysis comparing two interventions to keep US essential workers safe in a pandemic with extremely high transmissibility and fatality rates. Assess the interventions on the probability they are successful at preventing the collapse of civilization. Only include money spent before the pandemic happens as there will be plenty of money available for implementation after it starts.

1: Stockpiling elastomeric half mask respirators and PAPRs before the extreme pandemic.

2: Researching and planning to scale up transmission reduction interventions rapidly after the pandemic starts, including workplace adaptations, indoor air quality interventions (germicidal UV, in-room filtration, ventilation), isolation of workers in on-site housing, and contingency measures for providing basic needs if infrastructure fails.

Outputs:

- narrative and explanations of the logic behind all of the numbers used

- ranges of costs for the two options

- ranges of effectiveness for the two options

- cost-effectiveness for the two options

- mean and median ratios of cost effectiveness of planning vs stockpiling

- distribution plots of the cost effectiveness of planning vs stockpiling

Optional message:

Important: only account for philanthropic funding costs to make these interventions happen. Assume that governments already have pandemic preparedness funding allocated for stockpiles and response planning. This may reduce philanthropic costs if stockpiling interventions can redirect government purchases from disposable masks to more protective elastomeric respirators/PAPRs or if research and planning interventions can add their recommendations to existing government frameworks to prepare essential industries for disasters.

Ozzie GooenMar 35

Okay, the winner has been announced! It's dmartin80 on LessWrong. More here:

https://www.lesswrong.com/posts/AA8GJ7Qc6ndBtJxv7/usd300-fermi-model-competition?commentId=v58BLaEA7KzRi3kmS

Ozzie GooenFeb 35

You can also bet on how many participants this will get, here:
https://manifold.markets/OzzieGooen/number-of-applicants-for-the-300-fe

ArepoFeb 53

Do you have examples of LLMs improving Fermi estimates? I've found it hard to get any kind of credences at all out of them, let alone convincing ones.

Ozzie GooenFeb 54

I find a lot of the challenge of making Fermi estimates is in creating early models / coming up with various ways to parameterize things. LLMs have been very good at this, in my opinion.

I wrote more in the "How good is it?" section of the Squiggle AI blog post.

https://forum.effectivealtruism.org/posts/jJ4pn3qvBopkEvGXb/introducing-squiggle-ai#How_Good_Is_It_

We don't yet have quantitative measures of output quality, partly due to the challenge of establishing ground-truth for cost-effectiveness estimates. However, we do have a variety of some qualitative results.
Early Use
As the primary user, I (Ozzie) have seen dramatic improvements in efficiency - model creation time has dropped from 2-3 hours to 10-30 minutes. For quick gut-checks, I often find the raw AI outputs informative enough to use without editing.
Our three Squiggle workshops (around 20 total attendees) have shown encouraging results, with participants strongly preferring Squiggle AI over manual code writing. Early adoption has been modest but promising - in recent months, 30 users outside our team have run 168 workflows total.
Accuracy Considerations
As with most LLM systems, Squiggle AI tends toward overconfidence and may miss crucial factors. We recommend treating its outputs as starting points rather than definitive analyses. The tool works best for quick sanity check and initial model drafts.
Current Limitations
Several technical constraints affect usage:
Code length soft-caps at 200 lines
Frequent workflow stalls from rate limits or API balance issues
Auto-generated documentation is decent but has gaps, particularly in outputting plots and diagrams
While slower and more expensive than single LLM queries, Squiggle AI provides more comprehensive and structured output, making it valuable for users who want detailed, adjustable, and documentable reasoning behind their estimates.

WilliamKielyFeb 142

I'm now over 20 minutes in and haven't quite figured out what you're looking for. Just to dump my thoughts -- not necessarily looking for a response:

On the one hand it says "Our goal is to discover creative ways to use AI for Fermi estimation" but on the other hand it says "AI tools to generate said estimates aren’t required, but we expect them to help."

From the Evaluation Rubric, "model quality" is only 20%, so it seems like the primary goal is neither to create a good "model" (which I understand to mean a particular method for making a Fermi estimate on a particular question) nor to see if AI tools can be used to create such models.

The largest score (40%) is whether the *result* of the model that is created (i.e. the actual estimate that the model spits out with the numbers put into it) is surprising or not, with more surprising being better. But it's unclear to me if the estimate actually needs to be believed or not for it to be surprising. Extreme numbers could just mean that the output is bad or wrong and not that the output should be evidence of anything.

Ozzie GooenFeb 142

Thanks for the feedback!

We're just looking for a final Fermi model. You can use or not use AI to come up with this.

"Surprise" is important because that's arguably what makes a model interesting. As in, if you have a big model about the expected impact of AI, and then it tells you the answer you started out expecting, then arguably it's not an incredibly useful model.

The specific "Surprise" part of the rubric doesn't require the model to be great, but the other parts of the rubric do weight that. So if you have a model that's very surprising but otherwise poor, then it might do well on the "Surprise" measure, but won't on the other measures, so on average will get a mediocre score.

Note that there were a few submissions on LessWrong so far, those might make things clearer:
https://www.lesswrong.com/posts/AA8GJ7Qc6ndBtJxv7/usd300-fermi-model-competition#comments

"On the one hand it says "Our goal is to discover creative ways to use AI for Fermi estimation" but on the other hand it says "AI tools to generate said estimates aren’t required, but we expect them to help."

-> We're not forcing people to use AI, in part because it would be difficult to verify. But I expect that many people will do so, so I still expect this to be interesting.

Ozzie GooenFeb 112

Reminder that this ends soon! Get your submissions in.