Skip to content

Instantly share code, notes, and snippets.

View dusenberrymw's full-sized avatar

Mike Dusenberry dusenberrymw

View GitHub Profile
@dusenberrymw
dusenberrymw / tmux_tips_and_tricks.md
Last active April 8, 2024 00:41
Tmux Tips & Tricks

Quick cheat sheet of helpful tmux commands

  1. tmux new - Create and attach to a new session.
  2. tmux new -s NAME_HERE - Create and attach to a new session named NAME_HERE.
  3. CTRL-b, d - Detach (i.e. exit) from the currently-opened tmux session (alternatively, tmux detach). Note, this means press and hold CTRL, press b, release both, press d.
  4. tmux ls - Show list of tmux sessions.
  5. tmux a - Attach to the previously-opened tmux session.
  6. tmux a -t NAME_HERE - Attach to the tmux session named NAME_HERE.
  7. CTRL-d - Delete (i.e. kill) currently-opened tmux session (alternatively tmux kill-session).
  8. CTRL-b, [ - Enter copy mode, and enable scrolling in currently-opened tmux session. Press q to exit.
  9. CTRL-b, " - Split window horizontally (i.e. split and add a pane below).
@dusenberrymw
dusenberrymw / proxy.pac
Last active February 7, 2024 23:50
Proxy PAC file template for selective SSH SOCKS proxies, plus a [re]installation script.
// Proxy PAC File
// - Used to redirect certain addresses to the server through the SOCKS ssh port (1280 for this file), i.e.
// tunnel traffic through server.
// - Useful for easily accessing webpages from services running on a server (Jupyter notebooks, TensorBoard, Spark UI, etc.)
// that is otherwise locked down by a firewall.
// - To install on OS X/MacOS, go to "Settings->Network->Advanced->Proxies->Automatic Proxy Configuration"
// and paste the local file url (`file:///absolute/path/to/proxy.pac`).
// - Alternatively, use `./reinstall_proxy.sh`.
// - SSH to the server with `ssh -D 1280 ....`.
function FindProxyForURL(url, host) {
@dusenberrymw
dusenberrymw / ml_dl_scenarios.md
Last active January 3, 2024 07:14
Interesting Machine Learning / Deep Learning Scenarios

Interesting Machine Learning / Deep Learning Scenarios

This gist aims to explore interesting scenarios that may be encountered while training machine learning models.

Increasing validation accuracy and loss

Let's imagine a scenario where the validation accuracy and loss both begin to increase. Intuitively, it seems like this scenario should not happen, since loss and accuracy seem like they would have an inverse relationship. Let's explore this a bit in the context of a binary classification problem in which a model parameterizes a Bernoulli distribution (i.e., it outputs the "probability" of the true class) and is trained with the associated negative log likelihood as the loss function (i.e., the "logistic loss" == "log loss" == "binary cross entropy").

Imagine that when the model is predicting a probability of 0.99 for a "true" class, the model is both correct (assuming a decision threshold of 0.5) and has a low loss since it can't do much better for that example. Now, imagine that the model

@dusenberrymw
dusenberrymw / mwd.sleepMac.plist
Last active December 4, 2023 20:25
Automatically force OS X / macOS to sleep when the battery is low (script + plist).
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mwd.sleepMac</string>
<key>ProgramArguments</key>
<array>
<string>/path/to/sleepMac.sh</string>
</array>
@dusenberrymw
dusenberrymw / spark_tips_and_tricks.md
Last active February 8, 2023 05:11
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
@dusenberrymw
dusenberrymw / 1.rsync_tips_and_tricks.md
Last active December 20, 2022 13:27
Rsync Tips & Tricks

Rsync Tips & Tricks

  • rsync -auzPhv --delete --exclude-from=rsync_exclude.txt SOURCE/ DEST/ -n
    • -a -> --archive; recursively sync, preserving symbolic links and all file metadata
    • -u -> --update; skip files that are newer on the receiver; sometimes this is inaccurate (due to Git, I think...)
    • -z -> --compress; compression
    • -P -> --progress + --partial; show progress bar and resume interupted transfers
    • -h -> --human-readable; human-readable format
    • -v -> --verbose; verbose output
  • -n -> --dry-run; dry run; use this to test, and then remove to actually execute the sync
@dusenberrymw
dusenberrymw / notebook.json
Created September 5, 2016 23:23
Jupyter 2 space indent (~/.jupyter/nbconfig/notebook.json)
{
"CodeCell": {
"cm_config": {
"indentUnit": 2
}
}
}
@dusenberrymw
dusenberrymw / keras_tips_and_tricks.md
Last active July 28, 2020 11:59
Keras Tips & Tricks

Keras Tips & Tricks

fit_generator

  • Can using either threading or multiprocessing for concurrent and parallel processing, respectively, of the data generator.
  • In the threading approach (model.fit_generator(..., pickle_safe=False)), the generator can be run concurrently (but not parallel) in multiple threads, with each thread pulling the next available batch based on the shared state of the generator and placing it in a shared queue. However, the generator must be threadsafe (i.e. use locks at synchronization points).
  • Due to the Python global interpreter lock (GIL), the threading option generally does not benefit from >1 worker (i.e. model.fit_generator(..., nb_worker=1) is best). One possible use case in which >1 threads could be beneficial is the presence of exceptionally long IO times, during which the GIL will be released to enable concurrency. Note also that TensorFlow's `session.run(
@dusenberrymw
dusenberrymw / tensorflow_tips_and_tricks.md
Last active April 2, 2020 16:49
Tips and tricks for TensorFlow, Keras, CUDA, etc.

TensorFlow Tips & Tricks

GPU Memory Issues

  • nvidia-smi to check for current memory usage.
  • watch -n 1 nvidia-smi to monitor memory usage every second.
  • Often, extra Python processes can stay running in the background, maintaining a hold on the GPU memory, even if nvidia-smi doesn't show it.
    • Probably due to running Keras in a notebook, and then running the cell that starts the processes again, since this will fork the current process, which has a hold on GPU memory. In the future, restart the kernel first, and stop all process before exiting (even though they are daemons and should stop automatically when the parent process ends).
@dusenberrymw
dusenberrymw / git-cherry-pick-with-committer.sh
Created January 14, 2017 01:56
Cherry-pick commits while maintaining original information (original script from https://github.com/dingram/git-scripts/blob/master/scripts/git-cherry-pick-with-committer)
#!/bin/bash
#
# Copyright (c) 2013-2014 David Ingram
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions: