scheibo/04-BOGGLE-TODO.md

## 04-BOGGLE-TODO.md

      
    Raw
  

              04-BOGGLE-TODO.md
            
          
    Boggle


 iOS infinite spinner (service worker related?)
 Update to NWL 2020
 Fix NaN% full score bug
 Add "Net" scoring mode
 Support Super Big Boggle on large screens
 Add in all dice configurations, though don't use
 Don't restore application to Dictionary view (ie don't save view)
 Finish gathering updated GIFs which reflect up to date UI
 Host canonically under boggle.scheibo.com with GitHub pages


## 06-IDEAS.md

      
    Raw
  

              06-IDEAS.md
            
          
    Stats

pkmn/stats

==Marty commented out these lines https://github.com/Antar1011/Smogon-Usage-Stats/blob/master/batchMovesetCounter.py#L146-L147==


can make adjustment: https://www.smogon.com/forums/threads/gen-8-smogon-university-usage-statistics-discussion-thread.3657197/post-8845077


 Design doc, discuss:

 list of all possible improvements (address FIXME)
 anonymizing logs (visitor pattern/logs processing framework)
 compressed directories (ZIP > tar)
 database store
 generate (static) web pages instead of ASCII tables

table-able syntax (array of arrays? JSON?), use table and html


 store only half of encounters/teammates pseudo-symmetric matrix
 short circuit if weight is zero (but counts?)
 continuous mode vs. batch/catchup mode~
 split / apply / combine


 update to reflects Marty's changes

  Doubles/Other Metagame rises and drops


 tear out worker infrastructure in favor  - [ ] architecture handles child process OR worker threads


 parse out additional koed+switched information required for 'human-readable' stats from moveset.txt


 abstract out the process script

need high level script for setting default args, then let individual works do the rest. process can call the sub command and have it parse the rest of the args
takes common (logs) options and path to worker script, passes additional options to worker


 track win percentage (TI request for Random Battles)


 track unique user weights


 track pre-mega ability


 process handles 7z/tgz: automatically extract in tmp, delete set of files after checkpoint is finalized


 checkpoints store config information in case of changes


 handle N months worth of logs at once

take begin and end timestamp (shortcut which allows for 2018-02 to be specified) and only include relevant files
create new checkpoint directory each time, if date range changes and checkpoints dont then will be nonsensical
just use 2018-02 2018-03 each time and it will only incrementally update the 2018-02 reports


 Stats UI: index.html (+ apache rewrites) to serve pages


 anon: worker needs to handle rename as well during reduce stage


 stats sharding logic: input AND output shards


 shard over cutoff, tag (pull up, instead of push down)


 for the output shard only: if one is missing, redo the entire checkpoint


 error if run with different batch size and only one shard is missing


 shards:
format/<shard>/day/filejson

prepare: needs to make folders for all the shards
restore: look for all checkpoints
deshard:: sort, notice if any missing, if any missing = REDO whole DAY

apply: ONLY GETS format, not shard info?


shard = input and output => read log file multiple times :(, much easier
shard => output only = read files once (but at least 4x memory...)


 how to avoid reading the same log N times

OK to read data N times = different parts of data = desirable for report sharding


 conditional probability tables ('bigram' support) needed for EPOke

can arbitrary plugins be added/removed from the process? just specify --plugins=bigrams to process?


 Leads reports: add N (> 2; 1 good, 1 bad) battles from gen7monotype (same teams) + process tags for teams & one nonexistent


 adjust stats report testing logic to handle monotype


fix memory error in z-std, use for checkpoints


Deliverables


 design doc completed
 verify one month of data against Smogon-Usage-Stats
 rewritten anon on top of @pkmn/protocol
 all known issues from Smogon-Usage-Stats fixed - no more FIXME
 bigrams published
 daily (hourly?) reports
 Stats UI published
 human stats published
 full run across entire corpus (user with most battles?)


set cluster: For each analysis set, grab statistics and see how many could match (if spread, convert stats -> spread with assumptions, allow for speed creep/inexact matching)

@pkmn/stats
workflows/ cant be privileged at all, just depend on @pkmn/logs + @pkmn/stats etc and run
process script hardcodes logic so that anon => workflows/anon.ts fallback, but otherwise will look for file for worker

handle stats debug.ts, config.ts, process script
work on stats README.md and DESIGN.md


parse configuration
accept(format:id) => returns (shard, weight), if weight = 0 then drop, otherwise for each batch key up shard shard spin up workers and pass shard
for *(const batch = await batch()) {
	for (const shard of shards) {
		 yield [shard, batch];
	}
}


what is weight doing? weight only works if we go by least loaded, not round robin...
dont do round robin, just pass work to whoever is read for it
instead, master just sends workers jobs and its up to them to decide how many to do at once?


figure out logs that need to be processed, start yield-ing anything that matches accept
workers save checkpoints for a shard and (shard or format shard	)

master thread is generating work, passes to least loaded worker
problem => want to be unpacking in most parallel, ie doing multiple formats in parallel, but this means it will take longer before we are done with a format (because other formats are getting done at the same time)
need --sequentialFormats parameter to ensure all work for a specific format gets done FIRST

still need to potentially account for checkpoints having happened


/tmp/checkpoints-2pA7Hjx
  WORKER (up to worker to determine if checkpoints are still valid)
  checkpoints/ - actual checkpoint files
  decompressed/ - decompressed data
  scratch/ - scratch output from worker

if --checkpoints is not passed, directory is created in /tmp and hook installed to delete ALL.

Worker is always responsible for cleaning up scratch/
checkpoints/ - if really concerned about space these can be deleted once a format is done (turned into marker)
decompressed/ - can be deleted when 'done' with them = all shards have been processed if concerned about space, otherwise useful for future runs and should mirror a hypothetical PS logs directory

DO NOT CLEAN UP CHECKPOINTS ON EXIT IF PASSED IN


changing input affects checkpoints? = should be able to still use if they overlap. WORKER is the main thing which affects checkpoints


ONLY WORKER SHOULD MATTER


delete checkpoints if:

flag not passed in and we created temp


worker responsible for cleaning


// 2020-08/gen1ou/2020-08-14/battle-gen1ou-24687621.log.json -> 2020-08-14_gen1ou_24687621
interface Offset {
	day: string; // 2020-08-14
	format: string; // gen1ou
	log: number; // 24687621
}

// CLEANUP: happens at format level, how to pass up?
interface Batch {
	begin: Offset;
	end: Offset;
}

type AcceptFn = (format: ID) =>  string[] | undefined;
interface LogStorage {
	// Returns:
	// - offsets to pass to bar
	// - something to delete 
	foo(checkpoints: XXX, accept: AcceptFn, begin?: Date, end?: Date): AsyncIterator<>;
	// Return names which can be passed to read
	bar(begin: Offset, end: Offset): AsyncIterator<string[]>;
	read(log: string): Promise<string>
}

interface CheckpointStorage {
	read(XXX: string): Promise<string>;
	write(checkpoint: Checkpoint): Promise<void>;
}
--constrained mode = only important if concerned about space
= run formats sequentially = how, need completely different architecture to yield earlier..., need coroutine
= wait until format is done (all shards), turn checkpoint into tombstone and delete decompressed for format
= delete decompressed formats that we dont accept
want to be able to delete a day when done with it
is day to granular? delete format instead? surely MONTH-format
month-format-day
month-format
month
ASSUME NOT SPACE LIMITED (can expand particular formats at will)
process(format, shard, batch); // may stretch over many months...
// need some indicator that ALL batches are done to know to call combine on the shard
// if all batches are done, can also know to delete data used by batch...
cleanup and serial formats
{
done(); =>
}
yield batches
yield batches | done
if (Array.isArray(yielded))
else (done());
==PROBLEM== format is under year, actually need MONTH-FORMAT
too hard to ever know its OK to delete?
stream results to workers = only know we can finish a form when all formats are done...
eg. might find more format data in future months for worker, better if FORMAT-MONTH-DAY-
SELECT format FROM battles WHERE created_at > begin && created_at <= end;
SELECT id FROM battles WHERE format = ? && created_at > begin && created_at <= end;
SELECT output_log FROM battles WHERE format = id;

SELECT id FROM battles WHERE format = ? && id >= ? and id <= ?; -- only works because id is ASC with time
select(format, begin, end): ids
read(id);
// problem for 2 reasons

dont know

2 problems:
1: what if we dont know which formats exist
2: if compressed, select(format, begin, end) is suboptimal because need to wait for
with compress logs

if theyre compressed in the first place you probably are constrained?


Must be able to store entire working set at once
No cleanups, but wont delete

YYYY-MM
└── format
└── YYYY-MM-DD
└── battle-format-N.log.json

BFS to open all YYYY-MM => if 7z youre probably fucked
DFS on formats (fine YYYY-MM for each format)
can do formats in parallel or serial

signal from split -> master that format is done
signal from master -> split that format can be cleaned up (if constrained)

WITHIN a format = can be parallel across days/months/etc
Offset -> Offset
yield []
yeild []
yield {cleanup()} <= this is signal that format is done, master than call done() to cleanup
HOW DO SHARDS FIT
= each shard might have different batches (has its own checkpoints), need to instead yield
(format, shard, (begin, end)) | (format, shard, done())
==TODO== WORKER file needs to handle shards changing in stats work! --shard=tag,cutof
--shard=tag = split out tags but do all cutoffs at once
--shard=cutoff  =- split out cutoffs but do all tags at once
--shard=tag,cutoff = shard out everything (monowater-1500)
no sharding = do EVERYTHING at once

parse configuration
set up workspace, make sure workspace is compatible with worker
start spliting using LogsStorage + Checkpoint Storage (= restore) (heavily storage layer dependent)
storage.logs.process
=> open up all months in parallel and find all month/formats for formats we accept. accept returns true or shards if accepted
=> if constrained
- process each format serially (though internally days can be processed in parallel), send out a done() for each shard after completion
=> if not constrained
- process each format all together
processing format
=> expand format dir, if constrained delete days out of range, process days in parallel
process day => find all files in range (dont bother deleting files out of range)
with range of files:
for each shard
get batch from range of files + get checkpoints for shard + day
process(format, shard, batch)
tasks = Map<{format: ID, shard?: string}, Promises<void>[]>


await storage.logs.process(accept, 
process(task) {
	let remaining = tasks.get({format: task.format, shard: task.shard}
	if (task.done) {
		if (!remaining) {
			done();
		} else {
			Promise.all(remaining).then(() => {
				combine(task.format, task.shard).then(() => task.done()); 
			});
		}
	} else {
	if (!remaining) {
	 remaining = [];
	 tasks.get({format: task.format, shard: task.shard}, remaining);
	}
	remaining.push(apply(task.format, task.shard, task.batch));
}, begin, end);