If I can modus tolens this modus polens, it feels to me that
Indeed, even for 100 questions [...] this would come up as significant less than 50% of the time
is evidence that the noise level is low, and the skill difference is small.
E.g., taking the top 20 forecasters in Metaculus' last Quarterly Cup, we see average score differences of ~0.05 (equivalent to your highest noise level), and that's among the very top forecasters we had on that tournament!
If I can modus tolens this modus polens, it feels to me that
is evidence that the noise level is low, and the skill difference is small.
E.g., taking the top 20 forecasters in Metaculus' last Quarterly Cup, we see average score differences of ~0.05 (equivalent to your highest noise level), and that's among the very top forecasters we had on that tournament!