Skip to content

Instantly share code, notes, and snippets.

@scheibo
Last active February 1, 2023 06:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save scheibo/f35f30061dc8cc6e15974f5914e95dab to your computer and use it in GitHub Desktop.
Save scheibo/f35f30061dc8cc6e15974f5914e95dab to your computer and use it in GitHub Desktop.
5. Stats

Boggle

  • iOS infinite spinner (service worker related?)
  • Update to NWL 2020
  • Fix NaN% full score bug
  • Add "Net" scoring mode
  • Support Super Big Boggle on large screens
  • Add in all dice configurations, though don't use
  • Don't restore application to Dictionary view (ie don't save view)
  • Finish gathering updated GIFs which reflect up to date UI
  • Host canonically under boggle.scheibo.com with GitHub pages

Stats

pkmn/stats

==Marty commented out these lines https://github.com/Antar1011/Smogon-Usage-Stats/blob/master/batchMovesetCounter.py#L146-L147==

  • can make adjustment: https://www.smogon.com/forums/threads/gen-8-smogon-university-usage-statistics-discussion-thread.3657197/post-8845077

  • Design doc, discuss:

    • list of all possible improvements (address FIXME)
    • anonymizing logs (visitor pattern/logs processing framework)
    • compressed directories (ZIP > tar)
    • database store
    • generate (static) web pages instead of ASCII tables
      • table-able syntax (array of arrays? JSON?), use table and html
    • store only half of encounters/teammates pseudo-symmetric matrix
    • short circuit if weight is zero (but counts?)
    • continuous mode vs. batch/catchup mode~
    • split / apply / combine
  • update to reflects Marty's changes

    • Doubles/Other Metagame rises and drops
  • tear out worker infrastructure in favor - [ ] architecture handles child process OR worker threads

  • parse out additional koed+switched information required for 'human-readable' stats from moveset.txt

  • abstract out the process script

    • need high level script for setting default args, then let individual works do the rest. process can call the sub command and have it parse the rest of the args
    • takes common (logs) options and path to worker script, passes additional options to worker
  • track win percentage (TI request for Random Battles)

  • track unique user weights

  • track pre-mega ability

  • process handles 7z/tgz: automatically extract in tmp, delete set of files after checkpoint is finalized

  • checkpoints store config information in case of changes

  • handle N months worth of logs at once

    • take begin and end timestamp (shortcut which allows for 2018-02 to be specified) and only include relevant files
    • create new checkpoint directory each time, if date range changes and checkpoints dont then will be nonsensical
    • just use 2018-02 2018-03 each time and it will only incrementally update the 2018-02 reports
  • Stats UI: index.html (+ apache rewrites) to serve pages

  • anon: worker needs to handle rename as well during reduce stage

  • stats sharding logic: input AND output shards

  • shard over cutoff, tag (pull up, instead of push down)

  • for the output shard only: if one is missing, redo the entire checkpoint

  • error if run with different batch size and only one shard is missing

  • shards:

    format/<shard>/day/filejson
    
    prepare: needs to make folders for all the shards
    restore: look for all checkpoints
    deshard:: sort, notice if any missing, if any missing = REDO whole DAY
    
    apply: ONLY GETS format, not shard info?
    
    1. shard = input and output => read log file multiple times :(, much easier
    2. shard => output only = read files once (but at least 4x memory...)
  • how to avoid reading the same log N times

    • OK to read data N times = different parts of data = desirable for report sharding
  • conditional probability tables ('bigram' support) needed for EPOke

    • can arbitrary plugins be added/removed from the process? just specify --plugins=bigrams to process?
  • Leads reports: add N (> 2; 1 good, 1 bad) battles from gen7monotype (same teams) + process tags for teams & one nonexistent

  • adjust stats report testing logic to handle monotype

  • fix memory error in z-std, use for checkpoints


Deliverables

  • design doc completed
  • verify one month of data against Smogon-Usage-Stats
  • rewritten anon on top of @pkmn/protocol
  • all known issues from Smogon-Usage-Stats fixed - no more FIXME
  • bigrams published
  • daily (hourly?) reports
  • Stats UI published
  • human stats published
  • full run across entire corpus (user with most battles?)

  • set cluster: For each analysis set, grab statistics and see how many could match (if spread, convert stats -> spread with assumptions, allow for speed creep/inexact matching)

@pkmn/stats workflows/ cant be privileged at all, just depend on @pkmn/logs + @pkmn/stats etc and run process script hardcodes logic so that anon => workflows/anon.ts fallback, but otherwise will look for file for worker

  • handle stats debug.ts, config.ts, process script
  • work on stats README.md and DESIGN.md
  1. parse configuration
  2. accept(format:id) => returns (shard, weight), if weight = 0 then drop, otherwise for each batch key up shard shard spin up workers and pass shard
    for *(const batch = await batch()) {
    	for (const shard of shards) {
    		 yield [shard, batch];
    	}
    }
  • what is weight doing? weight only works if we go by least loaded, not round robin...
  • dont do round robin, just pass work to whoever is read for it
  • instead, master just sends workers jobs and its up to them to decide how many to do at once?
  1. figure out logs that need to be processed, start yield-ing anything that matches accept
  2. workers save checkpoints for a shard and (shard or format shard )

master thread is generating work, passes to least loaded worker

problem => want to be unpacking in most parallel, ie doing multiple formats in parallel, but this means it will take longer before we are done with a format (because other formats are getting done at the same time) need --sequentialFormats parameter to ensure all work for a specific format gets done FIRST

  • still need to potentially account for checkpoints having happened
/tmp/checkpoints-2pA7Hjx
  WORKER (up to worker to determine if checkpoints are still valid)
  checkpoints/ - actual checkpoint files
  decompressed/ - decompressed data
  scratch/ - scratch output from worker

if --checkpoints is not passed, directory is created in /tmp and hook installed to delete ALL.

  • Worker is always responsible for cleaning up scratch/
  • checkpoints/ - if really concerned about space these can be deleted once a format is done (turned into marker)
  • decompressed/ - can be deleted when 'done' with them = all shards have been processed if concerned about space, otherwise useful for future runs and should mirror a hypothetical PS logs directory

DO NOT CLEAN UP CHECKPOINTS ON EXIT IF PASSED IN

  • changing input affects checkpoints? = should be able to still use if they overlap. WORKER is the main thing which affects checkpoints

  • ONLY WORKER SHOULD MATTER

  • delete checkpoints if:

    • flag not passed in and we created temp
  • worker responsible for cleaning

// 2020-08/gen1ou/2020-08-14/battle-gen1ou-24687621.log.json -> 2020-08-14_gen1ou_24687621
interface Offset {
	day: string; // 2020-08-14
	format: string; // gen1ou
	log: number; // 24687621
}

// CLEANUP: happens at format level, how to pass up?
interface Batch {
	begin: Offset;
	end: Offset;
}

type AcceptFn = (format: ID) =>  string[] | undefined;
interface LogStorage {
	// Returns:
	// - offsets to pass to bar
	// - something to delete 
	foo(checkpoints: XXX, accept: AcceptFn, begin?: Date, end?: Date): AsyncIterator<>;
	// Return names which can be passed to read
	bar(begin: Offset, end: Offset): AsyncIterator<string[]>;
	read(log: string): Promise<string>
}

interface CheckpointStorage {
	read(XXX: string): Promise<string>;
	write(checkpoint: Checkpoint): Promise<void>;
}

--constrained mode = only important if concerned about space = run formats sequentially = how, need completely different architecture to yield earlier..., need coroutine = wait until format is done (all shards), turn checkpoint into tombstone and delete decompressed for format = delete decompressed formats that we dont accept

want to be able to delete a day when done with it is day to granular? delete format instead? surely MONTH-format

month-format-day month-format month

ASSUME NOT SPACE LIMITED (can expand particular formats at will)

process(format, shard, batch); // may stretch over many months...

// need some indicator that ALL batches are done to know to call combine on the shard // if all batches are done, can also know to delete data used by batch...

cleanup and serial formats

{

done(); => }

yield batches

yield batches | done if (Array.isArray(yielded))

else (done());

==PROBLEM== format is under year, actually need MONTH-FORMAT too hard to ever know its OK to delete?

stream results to workers = only know we can finish a form when all formats are done... eg. might find more format data in future months for worker, better if FORMAT-MONTH-DAY-

SELECT format FROM battles WHERE created_at > begin && created_at <= end;
SELECT id FROM battles WHERE format = ? && created_at > begin && created_at <= end;
SELECT output_log FROM battles WHERE format = id;

SELECT id FROM battles WHERE format = ? && id >= ? and id <= ?; -- only works because id is ASC with time

select(format, begin, end): ids read(id);

// problem for 2 reasons

  • dont know

2 problems:

1: what if we dont know which formats exist 2: if compressed, select(format, begin, end) is suboptimal because need to wait for

with compress logs

  • if theyre compressed in the first place you probably are constrained?
  1. Must be able to store entire working set at once
  2. No cleanups, but wont delete
YYYY-MM
└── format
└── YYYY-MM-DD
└── battle-format-N.log.json

BFS to open all YYYY-MM => if 7z youre probably fucked DFS on formats (fine YYYY-MM for each format)

can do formats in parallel or serial

  1. signal from split -> master that format is done
  2. signal from master -> split that format can be cleaned up (if constrained)

WITHIN a format = can be parallel across days/months/etc

Offset -> Offset

yield [] yeild [] yield {cleanup()} <= this is signal that format is done, master than call done() to cleanup

HOW DO SHARDS FIT = each shard might have different batches (has its own checkpoints), need to instead yield (format, shard, (begin, end)) | (format, shard, done())

==TODO== WORKER file needs to handle shards changing in stats work! --shard=tag,cutof --shard=tag = split out tags but do all cutoffs at once --shard=cutoff =- split out cutoffs but do all tags at once --shard=tag,cutoff = shard out everything (monowater-1500) no sharding = do EVERYTHING at once


parse configuration set up workspace, make sure workspace is compatible with worker start spliting using LogsStorage + Checkpoint Storage (= restore) (heavily storage layer dependent)

storage.logs.process => open up all months in parallel and find all month/formats for formats we accept. accept returns true or shards if accepted => if constrained - process each format serially (though internally days can be processed in parallel), send out a done() for each shard after completion => if not constrained - process each format all together

processing format => expand format dir, if constrained delete days out of range, process days in parallel

process day => find all files in range (dont bother deleting files out of range) with range of files: for each shard get batch from range of files + get checkpoints for shard + day process(format, shard, batch)

tasks = Map<{format: ID, shard?: string}, Promises<void>[]>


await storage.logs.process(accept, 
process(task) {
	let remaining = tasks.get({format: task.format, shard: task.shard}
	if (task.done) {
		if (!remaining) {
			done();
		} else {
			Promise.all(remaining).then(() => {
				combine(task.format, task.shard).then(() => task.done()); 
			});
		}
	} else {
	if (!remaining) {
	 remaining = [];
	 tasks.get({format: task.format, shard: task.shard}, remaining);
	}
	remaining.push(apply(task.format, task.shard, task.batch));
}, begin, end);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment