Think ahead of letting players play some kind of mini games before being allowed to the arena - the games vary in difficulty and allow judging the skill, they also must be made such, that it's difficult to turn on/off cheats selectively in realy time, and there need to be near-impossible tasks involved. Possibly too complex with latency in mind, but just using statistics of acting within those contexts, could be interesting to compare. Spin it further and let players play the actual tournament in groups according to the estimated difficulty - and relate their behavior of mini-game vs. arena - this also could be done 'lazily' on top of past data sets. Not limited to classic methods.
These kind of minigames only make sense, if they're easy to replace, so clients can't adapt too easily.