Using Points to Rate Different Kinds of Evidence

Ozzie Gooen

Epistemic Status: Briefly written. The specific equation here captures my quick intuition - this is meant primarily as a demonstration.

There’s a lot of discussion on the EA Forum and LessWrong about epistemics, evidence, and updating.

I don’t know of many attempts at formalizing our thinking here into concrete tables or equations. Here is one (very rough and simplistic) attempt. I’d be excited to see much better versions.

Equation

Initial Points

Scientific Evidence

20 - A simple math proof proves X

8 - A published scientific study in Economics supporting X

6 - A published scientific study in Psychology supporting X

Market Prediction

14 - Popular stock markets strongly suggest X

11 - Prediction markets claim X, with 20 equivalent hours of research

10 - A poll shows that 90% of LessWrong believe X

6 - Prediction markets claim X, with one equivalent hour of research

Expert Opinion

8 - An esteemed academic believes X, where it’s directly in their line of work

6 - The author has strong emotions about X

Reasoning

6 - There's a (20-100 node) numeric model that shows X

5 - A reasonable analogy between X and something clearly good/bad

4 - A long-standing proverb

Personal Accounts

5 - The author claims a long personal history that demonstrates X

3 - Someone in the world has strong emotions about X

2 - A clever remark, meme, or tweet

2.3 - An insanely clever, meme, or tweet

0 - Believing X is claimed to be personally beneficial

Tradition / Use

12 - Top businesses act as if X

8 - A long-standing social tradition about X

5 - A single statistic about X

Point Modifiers

Is this similar to existing evidence?
Subtract the similarity from the extra amount of evidence. This likely will remove most of the evidence value.

Is it convenient for the source to believe or say X?
-10% to -90%

Is there a lot of money or effort put behind spreading this evidence? For example, as an advertising campaign?
+5% to +40%

How credible is the author or source?
-100% to +30%

Do we suspect the source is goodharting on this scale?
-20%

Points, In Practice

Evidence Points, as outlined, are not trying to mimic mathematical bits of information or another clean existing unit. I attempted to find a compromise between accuracy and ease of use.

Is this too complicated and speculative?

As long-time readers will know, I’m a big fan of attempting to measure highly speculative concepts. I guess explicit and speculative models are often preferable to standard text discussions. There’s a potential danger that some people might over-trust these numbers because they are numbers. However, the alternative to modeling is often “lots of blog posts with different undefined ontologies and tons of misunderstanding,” so I think this is often a reasonable tradeoff.

One great thing about models is that you can improve them. As we get more evidence and opinions, I’m hopeful that eventually, models emerge that wind up being pretty okay. If you kill mediocre attempts, you likely eventually kill decent ones, too.

Future Work

This is basic now, but I think it illustrates a worthy goal. Potential future work (for someone, likely not us) would include:

If you’re reading this, post your own list! It would be good to get thoughts there.
Organized surveys for different groups to point these things.
Instead of specific points, use probability densities for each. Even better would be functions - for example, if survey data is used, there could be a function that takes in the number and quality of the respondents and outputs a corresponding point value.
Use ML to come up with the algorithm. Its algorithm might be very complicated, but it could be helpful even if it were a black box.
Point an extensive list of concrete examples of evidence. For example, “How many points of evidence do you think Tweet X provides for claim Y?”
Have people forecast what experts will think, maybe using something like Squiggle.
As you have functions that people (generally) trust, use automatic evaluations on evidence. On websites, display these points wherever evidence is presented. Reward people/analysts as a function of the points they have discovered.

Thanks to Nuño Sempere for discussion on this topic. You can also see a short attempt by him on something similar here.

Afterward: Quick Attempts by LLMS

I fed this post to Claude and ChatGPT to have them provide simple takes.

Claude

Scientific Evidence

Meta-analysis of multiple randomized controlled trials (n>10,000) - 10 points
Large randomized controlled trial (n>1,000) - 9 points
Observational or cohort study (n>100,000) - 8 points
Smaller randomized trial (n<1,000) - 7 points
Data from international space agency - 7 points
Government census data - 6 points
Survey by reputable organization (n>1,000) - 6 points
Computational simulation model - 5 points
Observational study (n<100,000) - 5 points
Survey (n<1,000) - 4 points

Expert Opinion

Consensus of leading experts in field - 8 points
Opinion of renowned thought leader in field - 7 points
Tenured academic in field - 6 points
Industry practitioner with 10+ years experience - 5 points

Reasoning

Simple logical deduction from accepted premises - 5 points
Complex logical reasoning with multiple steps - 4 points
Common sense heuristic - 3 points
Speculative theory - 2 points

Records

Government records - 6 points
Historical documents - 5 points
Personal diary/journal - 3 points
Unverified personal account - 2 points
Anonymous internet claim - 1 point

Culture

Widespread social tradition - 5 points
Popular opinion or "conventional wisdom" - 4 points
Commentary in major newspaper - 3 points
Remark from comedian/entertainer - 2 points
Meme or viral tweet - 1 point

GPT-4

Experimental Evidence (Evidence from controlled experiments or scientific studies)

Randomized Controlled Trial (RCT) results: 25 points
Meta-analysis of multiple RCTs: 23 points
Laboratory experiment results: 20 points
Observational study with a large sample size: 15 points
Case-control study: 12 points

Empirical Evidence (Evidence based on observation or experience)

Longitudinal study: 14 points
Cross-sectional study: 12 points
Field observations by non-experts: 9 points
Informal surveys: 7 points
Anecdotal observations with potential biases: 5 points

Expert Testimony and Opinion

Consensus of experts in the field: 20 points
Peer-reviewed publication: 19 points
Opinion of a recognized expert in the field: 16 points
Opinion of a professional not specifically in the field: 9 points
Opinion of a well-read layperson: 6 points

Anecdotal and Personal Accounts

Detailed case study: 10 points
Personal testimonies verified by third-party: 8 points
Unverified personal accounts: 5 points
Hearsay or second-hand accounts: 4 points
Popular myths or urban legends: 3 points

Historical and Archival Evidence

Primary historical documents (e.g., treaties, letters): 18 points
Secondary sources or historical analysis: 15 points
Oral histories or traditions: 9 points
Conspiracy theories based on historical events: 3 points

Logical and Theoretical Evidence

Proven mathematical or logical theorem: 22 points
Philosophical argument based on accepted premises: 13 points
Hypothetical scenarios or thought experiments: 9 points
Personal reasoning without validation: 6 points

Public Opinion and Mass Media

Large-scale, representative opinion poll: 14 points
News reports from lesser-known outlets: 8 points
Opinion pieces or editorials: 7 points
Rumors or unverified news: 4 points
Social media trends or viral content: 5 points
Individual's post or comment on social media: 2 points

Miscellaneous

Personal intuition or gut feeling: 4 points
Popular sayings or proverbs: 6 points
Unverified quotes: 5 points
Popular memes or jokes: 3 points

This system offers a broader range of evidence types, especially those under 10 points, to capture the nuances of everyday information sources.

david_reinsteinAug 28 20234

Could you explain this a bit more? It's very shorthand, and hard to know what you are doing and what you are asking us. I think I have a lot of comments, but for most of them I worry I might be missing the point.

E.g., a math proof is something different than scientific evidence, and it generally applies in different domains. If I have confidence in the proof itself (i.e., proof not in error), that would make any other evidence moot. However, in most relevant cases the 'math proof' is a proof of something that is only a very simplified model of the question at hand.

Aryeh EnglanderAug 25 20233

This is great! Just wanted to mention that this kind of weighting approach works very well with the recent post A Model-based Approach to AI Existential Risk, by Sammy Martin, Lonnie Chrisman, and myself, particularly the section on Combining inside and outside view arguments. Excited to see more work in this area!

Ozzie GooenAug 25 20232

Quick flag/reminder that I'd be interested in comments from others here - try giving your own scores, or flag things that you think are wrong about mine (or an LLMs).

titotalAug 26 20231

I think this is a reasonable exercise in the abstract, and could help people more easily communicate how they approach different forms of evidence.

However, if actually implemented practically, I think it would be too easily gamed to be of any use. Using your system as an example, if person A has a mathematical proof of X (20 points), but person B makes 11 clever tweets suggesting not x (2*11 = 22 points), then person B "wins" the argument.

The other problem I see is that there's no modifier here for "actually being correct". If person A presents a correct mathematical proof for X, and person B presents a mathematical proof for not X that is actually false, do they both get 20 points?

Chris KerrSep 7 20233

If you check the proofs yourself and you can see that one is obviously wrong and the other is not obviously (to you) wrong then you only give the not-obviously-wrong one 20 points. If you can't tell which is wrong then they cancel out. If a professor then comes along and says "that proof is wrong, because [reason that you can't understand], but the other one is OK" then epistemically it boils down to "tenured academic in field - 6 points" for the proof that the professor says is OK.

Ozzie GooenAug 26 20232

This equation was definitely meant as a rough initial guide. I think it's still usable as a heuristic - i.e. most of the time, you pay attention to higher point evidence than lower point evidence. It's meant to be better than other heuristics, not a complete solution.

if person A has a mathematical proof of X (20 points), but person B makes 11 clever tweets suggesting not x (2*11 = 22 points), then person B "wins" the argument.

I didn't get into adding evidence, for this reason. I think it's very clear that things are not linearly-additive like that. I think that an aggregation function would take into account the similarity of different sorts of content (two tweets that are clever, but near-identical), but also the similarity of the types of content (it's better to have a diverse set of different kinds of content, like a meta-study and "businesses commonly use it"). There would be quick leveling off - so that 50 tweets would have the evidence strength of something like 2 to 5 or so.

The other problem I see is that there's no modifier here for "actually being correct".

I thought this was fairly obvious to add. Again, I think this would need a lot more complexity, depending on how much you actually rely on it.

Effective Altruism Forum
EA Forum

Using Points to Rate Different Kinds of Evidence

33

Equation

Initial Points

Point Modifiers

Points, In Practice

Meta

Using an Equations for Discussion

Presumptions

Agreeing on an evidence-weighing algorithm before direct discussions

Is this too complicated and speculative?

Future Work

Afterward: Quick Attempts by LLMS

Claude

GPT-4

33

Reactions

More posts like this