Spotting a cherry-picked paper

Epistemic status: quite confident that the things said here are true but are not the entire story. Before you argue that a claim seems to lack nuance or that I haven't addressed something, please refer to the nuance section.

At the moment, reinforcement learning (RL) is a primarily empirical field, with a wide gap between empirical performance and theoretical guarantees. Furthermore, a large number of participants in the field are primarily tool-driven rather than theory driven, they read papers to find the best new algorithm that they can apply to their problem and so the primary consideration in reading a paper is: will the proposed algorithm work better on my problem than prior algorithms? There are two primary concerns for a practitioner that arise from this question:

Are the results in the paper true?
Are the results in the paper likely to generalize to my problem?

Here I will try to formulate the heuristics that I personally use to ask: how can I decide if a paper is true and worth building upon / using? This is not as easy as it might look at first glance since

Fields that use techniques with tons of hyper-parameters do not currently have consistent standards of evidence. We do not yet have a good way to assess whether the results of a paper might flip in the other direction if a different set of hyper-parameters we used or if there might be some lovely portion of hyperparameter space, yet unfound, that might make a baseline work better.
The results are task dependent. While there are theoretical measures of the "hardness" of an RL task, they generally cannot be easily applied to determine whether an algorithm that works for one task might failure for another.
We know that a fair number of fields experience reproducibility crises and so it's fairly likely our field is experiencing this as well.

Still, despite these difficulties, each of us must forge on and attempt to build a foundation of tools on which to base our work. I suspect that everyone over time develops a set of heuristics that they use to separate out ideas that are worth building on from the chaff. I outline my personal set in the hope of sharing some useful advice with new researchers and in turn learning of the heuristics that other researchers use! Note that these are heuristics and so they will necessarily accidentally discard papers that are true as well as admit papers that are false.

Heuristics

The heuristics outlined here are things that, if I spot them, make me suspicious / trusting of a paper and unlikely / likely to try and attempt work that builds upon it.

Separation of results

It is extremely common to see graphs that look like

rather than like

while both can be indicative of a real advance in the field, I'm much more likely to spend time and energy trying to test out the latter on my problems rather than the former. The former makes me suspect that with a slight adjustment of hyperparameters, the order of the two curves might reverse or that the two methods are likely equivalent. Of course, the same could be true of the latter curve but that's the problem with the heuristics, they're imperfect. In my experience, the technique with the latter curve is much more likely to work out if I set it to some new applied problem where it hasn't been used before.

To put a point on it, this is basically why I've avoided transformers in reinforcement learning except as a way to handle data of inconsistent shape. The gains seem marginal at best and make me suspicious that a lack of care with hyperparameters and baseline tuning is a large source of the gain.

Consistency of task

RL algorithms are often evaluated by sweeping them over a set of benchmarks and reporting their performance. One thing you might often spot is that a paper will report its performance on some of the tasks of a benchmark while leaving the rest of the tasks out or will report results on a totally new set of tasks that they have invented for the paper. While this can be totally innocent, it can also mean that the authors attempted to run their algorithms on the full set of tasks or standard benchmarks, discovered that the technique under-performed, and started scattering about for some new task where their technique works well.