Hide table of contents

Summary

Task: Make an interesting and informative Fermi estimate
Prize: $300 for the top entry
Deadline: February 16th, 2025
Results Announcement: By March 1st, 2025
Judges: Claude 3.5 Sonnet, the QURI team

Motivation

LLMs have recently made it significantly easier to make Fermi estimates. You can chat with most LLMs directly, or you can use custom tools like Squiggle AI. And yet, overall, few people have taken much advantage of this. 

We at QURI are launching a competition to encourage exploration.

What We’re Looking For

Our goal is to discover creative ways to use AI for Fermi estimation. We're more excited about novel approaches than exhaustively researched calculations. Rather than spending hours gathering statistics or building complex spreadsheets, we encourage you to:

  • Let AI do most of the heavy lifting
  • Try unconventional estimation techniques
  • Experiment with multiple approaches to find surprising insights

The ideal submission might be as simple as a particularly clever prompt paired with the right AI tool. Don't feel pressured to spend days on your entry - a creative insight could win even if it takes just 20 minutes to develop.

Task

Create and submit an interesting Fermi estimate. Entries will be judged using Claude 3.5 Sonnet (with three runs averaged) based on four main criteria:

  • Surprise (40%): How unexpected/novel are the findings?
  • Topic Relevance (20%): Relevance to rationalist/EA communities
  • Robustness (20%): Reliability of methodology and assumptions
  • Model Quality (20%): Technical execution and presentation

AI tools to generate said estimates aren’t required, but we expect them to help.

Submission Format

Post your entry as a comment to this post, containing:

  1. Model: The complete model content (text or link to accessible document)
  2. Summary: Brief explanation of why your estimate is interesting/novel, and any surprising results or insights discovered
  3. Technique: Brief explanation of what tools and techniques you used to create the estimate. If you primarily used one LLM or AI tool, the name of the tool is fine.

Examples

Our previous post on Squiggle AI discussed several interesting AI-generated models. You can also see many results on SquiggleHub and Guesstimate.

Important Notes

  • Content must be easily copyable for LLM evaluation.
  • Models must be less than 5,000 words total. We expect most to be in the range of 100 to 500 words.
  • Submissions that appear to be optimizing for LLM evaluation metrics rather than genuine insight and readability ("goodharting") will receive penalties up to 100% of their score.
  • Limit of 3 submissions per participant.
  • In exceptional circumstances, like if we get >100 submissions from some bots, we reserve the right to change the resolution system.
  • Note that the deadline is in 2 weeks!

Support & Feedback

If you’d like feedback or would like to discuss possible ideas, please reach out! (via direct message or email.) We also have a QURI Discord for relevant discussion.


Appendix: Evaluation Rubric and Prompts

Rubric

NameJudgePercent of Score
SurpriseLLM40%
ImportanceLLM20%
RobustnessLLM20%
Model QualityLLM20%
Goodharting PenaltyQURI TeamUp to –100%*

*Penalties reduce total score


Surprise

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of how surprising the key findings or conclusions of this model are to members of the rationalist and effective altruism communities. In your assessment, consider the following:

  • Contradiction of Expectations: Do the results challenge widely held beliefs, intuitive assumptions, or established theories within the communities?
  • Counterintuitiveness: Are the findings non-obvious or do they reveal hidden complexities that are not immediately apparent?
  • Discovery of Unknowns: Does the model uncover previously unrecognized issues, opportunities, or risks?
  • Magnitude of Difference: How significant is the deviation of the model's results from common expectations or prior studies?

Please provide specific details or examples that illustrate the surprising aspects of the findings. Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Surprising'
  • 10 indicates 'Highly Surprising'

Judge on a curve, where a 5 represents the median expectation.


Topic Relevance

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.

Please provide a numeric score of the importance of the model's subject matter to the rationalist and effective altruism communities. In your evaluation, consider the following:

  • Relevance: How directly does the model address issues, challenges, or questions that are central to the interests and goals of these communities?
  • Impact Potential: To what extent could the findings influence decision-making, policy, or priority-setting within the communities?

Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Important'
  • 10 indicates 'Highly Important'

Judge on a curve, where a 5 represents the median expectation.


Robustness

Prompt:
 

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the robustness of the model's key findings. In your evaluation, consider the following factors:

  • Sensitivity to Assumptions: How dependent are the results on specific assumptions, parameters, or data inputs? Would reasonable changes to these significantly alter the conclusions?
  • Evidence Base: How strong and reliable is the data supporting the model? Are the data sources credible and up-to-date?
  • Methodological Rigor: Does the model use sound reasoning and appropriate methods? Are potential biases or limitations acknowledged and addressed?
  • Consensus of Assumptions: To what extent are the underlying assumptions accepted within the rationalist and effective altruism communities?

Provide a detailed justification, citing specific aspects of the model that contribute to its robustness or lack thereof. Assign a rating from 0 to 10, where:

  • 0 indicates 'Not Robust'
  • 10 indicates 'Highly Robust'

Judge on a curve, where a 5 represents the median expectation.


Model Quality

Prompt:

You are an expert Fermi model evaluator with extensive experience in constructing and analyzing models.
 

Please provide a numeric score of the model's quality, focusing on both its construction and presentation. Consider the following elements:

  • Comprehensiveness: Does the model account for all key factors and variables relevant to the problem it addresses?
  • Data Integration: Are data sources appropriately selected and accurately integrated? Is there evidence of data validation or cross-referencing with established studies?
  • Clarity of Assumptions: Are the model's assumptions clearly stated, justified, and reasonable? Does the model distinguish between empirical data and speculative inputs?
  • Transparency and Replicability: Is the modeling process transparent enough that others could replicate or audit the results? Are the methodologies and calculations well-documented?
  • Logical Consistency: Does the model follow a logical structure, with coherent reasoning leading from premises to conclusions?
  • Communication: Are the findings and their significance clearly communicated? Does the model include summaries, visual aids (e.g., charts, graphs), or other tools to enhance understanding?
  • Practical Relevance: Does the model provide actionable insights or recommendations? Is it practical for use by stakeholders in the community?

Please provide specific observations and examples to support your evaluation. Assign a rating from 0 to 10, where:

  • 0 indicates 'Poor Quality'
  • 10 indicates 'Excellent Quality'

Judge on a curve, where a 5 represents the median expectation.

“Goodharting” Penalties

We’ll add penalties if it seems like submissions Goodharted on the above metrics. For example, if an entry used prompt injection or similar tactics for the AI assessments, or if the model seems non-understandable to humans but still managed to do well in these evaluations. These penalties, when they occur, will typically be between 10% to 40%, but might go higher in extreme situations. We’ll aim to choose a penalty that’s greater than the gains submissions received due to these behaviors.

29

0
0

Reactions

0
0
Comments3
Sorted by Click to highlight new comments since:

Do you have examples of LLMs improving Fermi estimates? I've found it hard to get any kind of credences at all out of them, let alone convincing ones.

I find a lot of the challenge of making Fermi estimates is in creating early models / coming up with various ways to parameterize things. LLMs have been very good at this, in my opinion.

I wrote more in the "How good is it?" section of the Squiggle AI blog post.

https://forum.effectivealtruism.org/posts/jJ4pn3qvBopkEvGXb/introducing-squiggle-ai#How_Good_Is_It_
 

We don't yet have quantitative measures of output quality, partly due to the challenge of establishing ground-truth for cost-effectiveness estimates. However, we do have a variety of some qualitative results.

Early Use

As the primary user, I (Ozzie) have seen dramatic improvements in efficiency - model creation time has dropped from 2-3 hours to 10-30 minutes. For quick gut-checks, I often find the raw AI outputs informative enough to use without editing.

Our three Squiggle workshops (around 20 total attendees) have shown encouraging results, with participants strongly preferring Squiggle AI over manual code writing. Early adoption has been modest but promising - in recent months, 30 users outside our team have run 168 workflows total.

Accuracy Considerations

As with most LLM systems, Squiggle AI tends toward overconfidence and may miss crucial factors. We recommend treating its outputs as starting points rather than definitive analyses. The tool works best for quick sanity check and initial model drafts.

Current Limitations

Several technical constraints affect usage:

  • Code length soft-caps at 200 lines
  • Frequent workflow stalls from rate limits or API balance issues
  • Auto-generated documentation is decent but has gaps, particularly in outputting plots and diagrams

While slower and more expensive than single LLM queries, Squiggle AI provides more comprehensive and structured output, making it valuable for users who want detailed, adjustable, and documentable reasoning behind their estimates.

You can also bet on how many participants this will get, here:
https://manifold.markets/OzzieGooen/number-of-applicants-for-the-300-fe

Curated and popular this week
Relevant opportunities