ancms2600/statistical_analysis__d3.txt

## statistical_analysis__d3.txt
note taking terminology:
  data array = list = set = range (programming; ie. Ruby)
  value: normally means number, but practically could be any type you can compare two of with eq, gt, lt operators
  ordered set: sorted list

There are many kinds of Averages in the field of Statistics:
- Mean:
    Arithmetic average. Is the most common, and the one you are used to; ie. [1,2,4] is 2.3~
    Remember as: "Mean" is what most people "mean" when they say "average".
- Median:
    The value in the exact middle of the sorted/ordered data array (when odd)
    Or, take the mean of the middle two numbers (when even) ie. [0,2,5,7,3] is 3
    remember (n+1)/2 is middle index. if a non-whole number, then even, so take floor and ceil
- Mode:
    Value most frequently observed. ie. [1,1,1,10,100] is 1
    The value most frequently repeated in the data array.
    (if you take a sorted unique count, its the head or max count in that list)
- Range:
    The size of the set if you take the Max - Min. ie. Range of 2..5 is 3
- Standard Deviation:
    [Mean +] Standard Deviation > Common vs. Outlier
    a.k.a. Statistically Significant (furthest from mean; outlier) vs. Expected Variation (closest to mean)
    (all referring to the Empirical Rule) and 1/2/3-Sigma (represented by lowercase Sigma: σ)
      side note: Sigma (uppercase Σ, lowercase σ) is the 18th letter of the Modern Greek alphabet.
        on why the upper/lower look different (there are even two forms of the lower!):
          https://www.quora.com/Why-does-the-lowercase-sigma-have-different-forms-at-the-beginning-or-middle-of-a-word-%CF%83-and-at-the-end-%CF%82

    good video intro on std dev:
      https://www.youtube.com/watch?v=WVx3MYd-Q9w

    relating from Rick & Morty:
      ie. "60 iterations off the central finite curve"
      https://www.reddit.com/r/c137/comments/6zoca3/60_iterations_off_the_central_finite_curve_need/dmxel2v/
      technobabble for:
      "X standard deviations off the [bell curve] ..."
      https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/stats-normal-distributions/v/ck12-org-normal-distribution-problems-empirical-rule

      I think its not quite this, but inspired by this:
      "In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations"
      https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule
      I think the Reddit commenter's explanation is most accurate for actual comic lore interpretation.
      Regardless, both ideas encompass and require some background in statistical analysis to fully comprehend.
      Hence lending credence to the idea that R&M is humor for/by high intellectuals.

Other terms:
- Extent: A Tuple2 containing the Minimum and Maximum value in the set. ie. Extent of 2..5 is [2,5]
- Stem Plot: not really worth knowing;
    like a histogram but IMO only works well for a range of numbers 1-100


When to [or NOT to] use / How to apply the various kinds of Averages (a.k.a. "Measures of Central Tendency", "to describe the central tendency")
("Measures of/off/from Center"):
- Mean: (represented by "bar" over any symbol, ie. x̄)
    discouraged use when many outliers are present, as they will unfairly influence the outcome. ie. [1,2,3,1e9]
- Median:
    less affected by outliers and skewed data, (but still affected?)
    ie. [1,2,3,4,1e9] is still median 3 vs. avg is >2e8
    therefore, it can be said that the Median is a more "robust" statistic of the center, than the "mean"
    when Mean - Median is great difference, its a sign (like stddev) that distribution is wide.
      in these cases sometimes median is preferable as a sample observed in the ordered middle of the crowd/set.
    used for normal number distributions, which have a low amount of outliers.
    used to return the central tendency for skewed number distributions.
- Mode:
    a.k.a. "most popular value", "most common category"
    gotchas: possible to end up with multiple modes which are equally common.
("Measures of Spread"):
- Range:
    relies solely on the two polar/most-extreme observations
    for understanding largest potential variation
- [Sample] Std. Dev.:
    use the n-1 version of this fn when you have a sample of a population (infinite/stream), not an entire/complete/finite set
- [Sample] Variance: is stddev squared

NOTICE: TO USE THE WRONG AVERAGE to describe the central tendency of a data set ("set of data")
        IS TO BE MISLEADING.
        This is how stats can be used to LEAD or MISLEAD; for good or evil.
        Therefore it is CRITICAL to also understand the
        spread/variation/deviation/distribution: distance from the mean
        when you hear about the "average", or you may be being mislead.

Good question:
  "Why square the difference instead of taking the absolute value in standard deviation?"
  https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia
  > Much of the field of robust statistics is an attempt to deal with the excessive sensitivity to outliers that that is a consequence of choosing the variance as a measure of data spread...

Other terms:
- Robust - Less affected by outliers
- Skewed distribution a.k.a. "extreme observations" - Largest variance
- Symmetric distribution - Smallest variance
- Interquartile Range (IQR) -
    analog to stddev; where stddev is based on mean, IQR is based on median;
    intended as more "robust" alternative to stddev.
    also called the "midspread" or "middle 50%", or technically "H-spread"
    https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/measuring-spread-quantitative/v/mean-and-standard-deviation-versus-median-and-iqr

  Aside: Interesting high theory on robust statistics in the face of adversarial data corruption
    may have applications in robotics, machine learning, and security
    https://simons.berkeley.edu/talks/clone-clone-clone-sketching-linear-algebra-i-basics-dim-reduction

- Quantile (In statistics and, notably, probability)
    quadrants / zones / intervals
    dividing the
    bell curve / probability distribution / probability density / normal distribution
    into equal probabilities,
    commonly used when dividing the observations in a sample
    ie. 1 Sigma, 2 Sigma, 3 Sigma
    aside: This is probably what is really referred to by "Six Sigma",
          ultimately meaning defects become a probability of .999 "three nines"
    important: Common quantiles have special names!
      2) Median: group of 2
      3) Tertiles or Terciles (symbol: T)
      4) Quartiles (symbol: Q)
          when group size is known, each group can be numbered (in left-to-right order)
          commonly denoted Q1, Q2, Q3, Q4 (ie. in the case of a Quartile);
          basically the same as Quarters of a year
      5) Quintiles (QU), 6) Sextiles (S) , 7) Septiles, 8) Octiles,
      10) Deciles (D), 12) Duo-deciles or Dodeciles, 16) Hexadeciles (H),
      20) Ventiles, Vigintiles, or Demi-Deciles (V)
      100) Percentiles! (P)
      1000) Permilles, Milliles (rare, considered obsolete for some reason)

- Quantization (mathematics, [Digital] Signal Processing or DSP)
    the process of mapping input values from a large set (often a continuous set)
    to output values in a smaller discrete (finite/countable) set.
    Rounding and truncation are examples of common quantization functions.

- Discrete vs. Continuous
  https://www.statisticshowto.datasciencecentral.com/discrete-vs-continuous-variables/

- Input Domain
    the input min and max or the complete set of input data
- Output Range
    the output min and max or the complete set of possible output quantiles
  Remember word pairings the following way:
    notice, if each word is a column in a table,
    then they would appear alphabetically sorted (ascending; top-to-bottom):
      ---------|---------
      (I)nput  | (D)omain
      (O)utput | (R)ange

D3 has several scale methods built-in:
"Continuous" scales (continuous input domain to continuous output range)
- linear —
    The default mapping of [n1..m1] to [n2..m2]
- identity —
    A 1:1 scale, useful primarily for pixel values
- sqrt —
    square the input domain.
    in general, squaring the inputs makes them more visually pleasing, aligned, meaningful, and comparable.
    as in this example:
    https://observablehq.com/@d3/continuous-scales#scale_sqrt
    > a visual legend that might be suitable for a square-root scale.
    > The disk labeled ‘5’ is 4 times smaller in area as the disk labeled ‘20’,
    > and its radius is half as long.
    > (Finding an harmonious progression with nice and round values is half art, half science.)
- power —
    input domain transformed by Math.pow() using a given exponent.
    not sure why one would do this...
- radial —
    Radial scales are a variant of linear scales
    where the range is internally squared (you don't see squared output values)
    so that an input value corresponds [visually?] linearly to the squared output value.
    Best to emphasize patterns in cyclical data (ie. temperatures of the year)
- log —
    use log scale when you dont want the largest value to dwarf the smallest
      ie. scale of the universe is used a lot
      https://observablehq.com/@d3/continuous-scales#scale_log
      see also: the universe wikipedia article and scale image. very innovative imagery!
      https://en.wikipedia.org/wiki/Order_of_magnitude
- time —
    visualizations of time scales animating and at relative ordinals
    https://observablehq.com/@d3/d3-scaletime?collection=@d3/d3-scale

- sequential —
    like a continuous scale but the output range has an interpolator applied

- diverging —
    visualize phenomena going in two opposite directions
    https://observablehq.com/@d3/diverging-scales?collection=@d3/d3-scale
    The default domain for diverging scales is [0, 0.5, 1] but
    most applications will set it to [–1, 0, 1] or [minimum, neutral, maximum].

...and these next ones, which I find particularly interesting because they can map to non-numbers;
  like strings and colors and discrete [predefined, potentially fixed-size] sets...
  for comparing "apples" to "oranges"

- quantize —
    A linear scale with discrete values for its output range,
    for when you want to sort data into “buckets”.
    https://github.com/d3/d3-scale#quantize-scales

- quantile —
    Similar to above, but with discrete values for its input domain
    (when you already have “buckets”)
    https://github.com/d3/d3-scale#quantile-scales
    I would rephrase to suggest (TBD; verify) the distinction is:
      when you want to clamp inputs to a set of possible/prescribed nearest values.
  demo for above two:
    http://bl.ocks.org/syntagmatic/29bccce80df0f253c97e
  better explanation and demo:
    https://observablehq.com/@d3/quantile-quantize-and-threshold-scales
    though, I would supplement the Quantile explanation with:
    imagine in the last example that each square were instead the shape of the state it represented
    then you could see which states were earning the most,
    sorted by states that earn the most appearing last.
    now we can begin to see the usefulness of doing it that way.
    this presumes the input data were like:
      state_labels: [alaska, alabama, arkansas, ...]
      domain:       [60000,  250000,  34000,    ...]
      range:        ['white','pink','red']
    executed:
      scaleFn({ x: 'alaska', y: 60000 })
    post-scale/quartiled
      output:       ['pink', 'red',  'white',   ...]

    one last attempt at explaining this:
      quantize
        input domain is just the extent of input values (like 0.0 - 1.0)
        output range is a fixed set of buckets

      quantile
        fixed size input set to fixed size output set
        (ie. N income values : 50 states : 10 shades of blue)

- threshold —
    between a fixed set of DOMAIN (NOT range) values (ie. 10^0, 10^1, ... 10^9)
    see example: http://using-d3js.com/04_06_quantize_scales.html
    Threshold scales allow the specification of precise cut values, which can emerge from personal observation
    or if special thresholds are deemed interesting (e.g, for external reasons, ie. if a law applies above a certain revenue)
    Various clustering (like CIA terminology for critical thinking, no?) algorithms can provide reasonable thresholds
    see algorithms like: “Jenks natural breaks” (old, but one example)

- ordinal —
    Ordinal scales use non-quantitative values (like category names) for output;
    perfect for comparing "apples" to "oranges"
    api: https://github.com/d3/d3-scale#ordinal-scales
    demo: https://observablehq.com/@d3/d3-scaleordinal
    possibly could be used like this:
    https://insights.stackoverflow.com/survey/2019#community-_-why-dont-developers-participate-on-stack-overflow
    - band
        https://observablehq.com/@d3/d3-scaleband
    - point
        https://observablehq.com/@d3/d3-scalepoint


- metric? not a scale really; just an after-the-fact number formatter.
    Q: How does one apply the International System of Units (SI) ?
      https://en.wikipedia.org/wiki/Metric_prefix#List_of_SI_prefixes
    A: by formatting labels at render time (input value, output formatted string)
      https://stackoverflow.com/questions/19928690/how-to-display-data-as-kb-mb-gb-tb-on-y-axis
      and then switching to a plugin that does that but most elaborately
      https://github.com/d3/d3-format
      see demo:
      https://bl.ocks.org/mbostock/9764126


NOTICE: scale functions have an .inverse() property/fn which takes output and returns input
  so you can say, give a pixel location to a time scale and get the date out of it.
  so this is useful in user interaction, but probably other ways, as well.
  (spatial mapping/partitioning? BSP trees?)
NOTICE: scale functions also have a .clamp() which works for input,
  as well as if you use it with .invert() it will apply the clamp to the result

Its great how scales can be applied in way more situations than just x and y axes!
- For example, could be used to determine circle width, height--but also radius, color, and infinite more!
- Could also be used to dictate switch positions, converting joystick input to rotation, etc.
- or a farcical but not totally useless example:
  mapping names to geographic 2d (lat,lon) or 3d (lat,lon,altitude) points in space (or the globe; on the planet Earth)


Other cool things:
- Application in Machine Learning
  - Linear Discriminant Analysis (LDA)
    https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher's_linear_discriminant


notes on D3:

Stands for Data-Driven Documents.
But it has become more like a group of statistical utility functions with a math-scientific community behind it.
The "data" being the key/central concept--Combine:
  - jQuery selectors, with;
  - .data([]) arrays, to get close to;
  - Mithril/React-style DOM-diffing, for a;
  - buffered renderer (like gl double-buffering or vim/terminal/IDE character buffer).
Ironically, NOT "data" as in Charts, although I guess it became that.
Concept of each datum having a unique identifier (UID) for use in tracking transformations, similar to .key: attributes.
Goes "too far" IMO by trying to implement the draw loop and integrating with browser-specific rendering semantics;
 - i can see these becoming POTENTIALLY dated
But the statistical utilities are what I am interested in it for.
 - d3.scale().*
 - d3.extent()
 - etc.
I was originally put-off by D3 because of the myriad of community-contributed (read: varying quality, no central maintainer) work
 - The variety of graphs on the front page is what misguided me into thinking it was charting-only library.
   It actually has many more use cases; but it is just one simple concept.

It just ended up being used for charting because it also comes with a set of statistical utility functions.
Its ironic how that combination shaped its future away from its original stated purpose.
It just-so-happens that D3 community documentation has
    some of the highest-grade of educational material on the topic;
    just look at this interactive bl.ocks.org example!
      https://bl.ocks.org/aviddiviner/84d905e60c6788f77ee21d35f873b236

    aside: btw, something like ObservableHQ.com and bl.ocks.org
      a codable scratchpad + real-time/interactive documentation is great
      though i feel the UI could be even simpler.
        their current implementation is "simple as in Perl" which makes it somewhat write-only.
        I could probably do a true-simple which is still readable to the inexperienced / lay interpreter / unwashed masses.
    PROTIP: Google Colaboratory with Tensorflow is the most modern version of this idea.


Conclusion:
- d3's collection of statistical analysis functions are nice.
  It [intentionally?] does a good job of encapsulating and abstracting the academic science of data, which is unchanging,
  away from the hot rendering technology of the day.


Interesting by way of comparison
- Q:
    which scale is Kibana using for its y-axis?
    see if i can identify it among the default d3 scales,
    else prove that its custom? ie. list of distinct values in a throttle scale
  A:
    it must be the scale.threshold() because the numbers are so perfectly rounded
    as to have been selected from a shortlist between like -1m to +1m or possibly broader
      this seems it could be made further agnostic by looking at it as anonymous but ordered and fixed set of ordinals
        so instead of KB MB GB TB could call it O1 O2 O3 O4
          and then the range is just between 0.0 - 1.0N where N is the delta between ordinals
          so delta between G and T would be N=1024
            now it doesnt matter what the units are! could be disk space or natural numbers
      its too bad this is not what the ordinal scale does--or does it?

    hmm, no it could be a linear scale with .nice() or .round() applied
    https://observablehq.com/@d3/d3-scalelinear

    or hmm, it might even be (and make the most sense to be) a log scale?
    so it doesn't matter how varying the extremes are?
    it could even be a SymLog scale hmm
    https://observablehq.com/@d3/continuous-scales#scale_symlog

    this looks like a close demo:
    https://bl.ocks.org/mbostock/9764126


NOTABLE:
- d3-interpolate
    interpolate colors, numbers, strings--whatever!
    basically these are handy any time you need to interpolate and can't use the gpu.
    https://github.com/d3/d3-interpolate#interpolateCubehelix
    https://github.com/d3/d3/blob/master/API.md#interpolators-d3-interpolate
- d3-scale-chromatic
    curated color scales that look nice
    https://github.com/d3/d3-scale-chromatic/blob/master/README.md#interpolateRainbow
- d3-quadtree
    recursively partitions two-dimensional space into squares
    Quadtrees can accelerate various spatial operations, such as the Barnes–Hut approximation for computing
    N-body forces, collision detection, and searching for nearby points.
    https://github.com/d3/d3-quadtree/tree/v1.0.6
- d3.zoom
    provides utilities to handle PanZoom and wheelDelta stuff
- d3.hierarchy
    tree/graph visualization utilities :O
    includes everything from treemaps to graphviz.
    possibly binpack algorithm?
    possibly force-directed node layouts?
    possibly spring dynamics and 2d physics?
    api: https://github.com/d3/d3-hierarchy/tree/v1.1.8
    demos: https://observablehq.com/collection/@d3/d3-hierarchy
- @solgenomics/d3-pedigree-tree
- rescape-geospatial-sankey
    sankey graphs might be useful for visualizing where network bandwidth is going [unexpectedly]
find more at:
  https://www.npmjs.com/search?q=keywords%3Ad3-module&page=1&perPage=20


QUOTES:

“Scales are functions that map from an input domain to an output range.”
That’s Mike Bostock’s definition of D3 scales. (he's the original author and primary maintainer of D3)
I wonder how functions-as-data performs versus alternative single-pass for...loop implementations?
  ie. I can see how its easier for programmers to compose with chaining syntax, but--does it perform?
I wonder whether the functional approach composes better mathematically in the Haskell/Curry-style of programming?

but, so... as you learn statistics,
  we see Mike's definition of a Scale function
  is: a Quantization function
  takes any one input or data point from the source,
    and converts it from the input domain
    to the equivalent value in the target output range
    or sink.
  but Bostock said it better.

  by this description, our mapper of whole documents or document key:values
  from->to output->input values, aliasing, transforming, etc. ...
  this is technically a kind of scale or quantization function?


Not statistics related, but found while researching and liked:

"Being well-rested is a competitive advantage."
True in my experience, and likely of [knowledge] work in general.

Isaac Asimov's "Foundation" series
> one idea from the books has proved influential in real-world social science,
> the Uncertainty Principle of the Social Sciences:
"If a population gains knowledge of its predicted behavior,
 its self-aware collective actions become unpredictable."
Seems to imply that if a NSA/CIA/NASA/other-random-3-or-4-letter-agency
were to gain knowledge of the probability of certain future events,
expected behavior, or controlled outcome,
it would behoove them to keep that knowledge secret,
if the outcome were favorable (for them, or for humanity? you decide.)
Otherwise to expose it, if they want us to find any alternative.

> Animated transitions are pretty, but they also serve a purpose:
> they make it easier to follow the data.
> This is known as object constancy: a graphical element that represents a
> particular data point (such as Ohio)
> can be tracked visually through the transition.
> This lessens the cognitive burden by using
> preattentive processing of motion
> rather than sequential scanning of labels.