LSgeo/python_ML_geosci_intro.md

## python_ML_geosci_intro.md

      
    Raw
  

              python_ML_geosci_intro.md
            
          
    So you want to program. You need a few things, but first - a few questions to ask yourself:
Do I want to learn python, or just use it?
Do I want to learn to manage my own development environment, or just work in one provided to me?
Think about these questions as you follow along below.
This guide is written as a launch-pad for you to research each point and make your own decisions along the way.
Learning to search the internet effectively is the most valuable skill you can have as a developer!
Python

You need to get the program that executes your Python code. There are lots of places to get this, but my favourite is Anaconda python.
Your choices are either the full Anaconda suite, or the minimal Miniconda installer. Anaconda provides a GUI, and Miniconda needs you to use the command line - but don't be scared!
If you want to manage your own dev env, choose miniconda. If you want an easy to use ready to go solution, choose Anaconda.
IDE

You need a way to write your code! There are two really fun ways to work on your code: Notebooks (IPython) or scripts (.py files). Both methods have their merits, with notebooks letting you interact, document, and link images, etc into the development experience. Scripts require fewer dependencies, and are the go-to when you just want a program to work in the background. It's really good to familiarise yourself with both!
Regardless of notebook or script, all python code is written in plain text, meaning you can use something as simple as notepad (but not office programs!). However, there are far better options, known as Integrated Development Environments, or IDEs.
Some examples include Spyder (included with Anaconda), VScode (language agnostic, extensive ecosystem), and Pycharm (Professional Python solution). If you use vim, you probably don't need this guide.
Any of these choices are fine, and you'll find your own favourite. If you use VScode, you will need to install the Python extension from the sidebar.
Hello world

When seeing code on the internet, there are a few ways of showing that something should be typed in as code.
One of the most common is the >> symbol, meaning everything following the >> should be typed in. Another is using specific_formatting.
>> print(“hello world”)
Press run in your ide to make the code run!
Version control

Coding is full of small changes and lots of testing. Lots of solutions exist, such as Github and git. These should be built into your IDE.
Learning version control is a super important skill! Here is one I have written - but it's not the best!: https://gist.github.com/LSgeo/d0ed0b07b40ed677622ced3021ded558 Version control relies on your state of mind as much as it does on your knowledge of the tools your VCS offers. At the very least, learn to write useful commit messages, and try to organise commits as a collection of modifications for a singular purpose.
Packages

Lots of pre-written code exists outside of the built-in python codes. Things such as plotting figures, doing extensive math, and many more.
You need to install and import these packages to use them. Thankfully, package managers makes this easy!
I recommend Mambaforge for speed and quality, however this is based on the widely used conda package manager, so that may be a better starting point.
Virtual Environments

As you progress your projects, you'll find it hard to maintain a monolithic suite of packages - some don't co-exist easily!
It's good practice to have a seperate venv for each of your projects - and an up-to-date list on what needs to be installed to run your code.
Get in the habit of writing a environment.yml for conda and using it to maintain your venv! Self-documenting code is the best code.
Machine Learning in Pytorch

Train/Val/Test

One of the most important things to design is your split between training data (used to train the model), validation data (used to pick the best model), and test data (used to demonstrate the performance of the best model).
"Test" set is reserved - don't even look at them until you've finished training everything else, and don't use their statistics when normalising! It's a pandoras box representing how your model will perform in the real world.
The typical Train/Val/Test split is maybe 60:30:10, making sure you have a statistically reasonable amount of representative data in each.
So if you have 100 synthetic samples, use 70 when calculating training loss, 20 when calculating validation loss, and keep 10 spare to run the final model on and use as your "we've never seen this data before" data.
Tracking Training

There are a few things to consider when measuring if training is successful:

Training Loss (loss calculated on the training dataset)
Validation Loss (loss calculated on the validation dataset)
Non-Loss Performance metrics (This can be the same as your loss, e.g. Mean square error, etc, OR a different function you calculate between target and prediction)

Losses are differentiable and implemented in Pytorch. Performance metrics can be anything, including "does this look better, yes or no?". Thinking about your loss and performance metrics is a great way to design your experiments and models. What are you trying to achieve?
Overfitting
Ideally you track both training and validation loss together, and they should follow each other fairly consistently - eventually the validation loss stops getting better at the same rate as the training loss, because the model starts to overfit the training samples.
The most important part of all of this (the key definition of validation loss) is tied to the .backward() call and gradient accumulation. You need to tell Pytorch NOT to update the model weights while it is calculating validation loss. Otherwise it is  "training" on your validation data. You will need to look up some guides and examples on disabling gradient accumulation in Pytorch for validation.
Tracking machine learning experiments

Tracking machine learning experiments is something that lots of people want to do. Things like... What hyperparameters did I use yesterday? What loss did I get? What did my outputs look like? These are all important things to track, and it can be hard to do.
Fortunately there's lots of programs to track all these for you. Tensorboard is one example, and comet.ml is a good online service.
Stepping back into the Numpy-verse

When you get your model outputs, they are Tensors. Most generic programming you want to do will probably be in numpy or matplotlib, etc, which don't know how to deal with Tensors.
It's a long chain of steps, but you need to use output.detach().cpu().numpy() on your output when you want to use it as a numpy array. .cpu() is included because it costs nothing, and it is required if you ever interact with GPU training. We'll see this later!
Hardware

Most operations you do are performed on your CPU, using your CPU RAM. Pytorch also offers simple GPU access. Keep track of what is where, using .to(), .cpu(), etc. There are many tutorials on Pytorch regarding how to manage this. GPU's offer great performance benefits for many applications - but know what device you have and how to use it can be a challenge!
CUDA

It is likely you have a NVIDIA CUDA device if you are trying to learn to use it for ML. Conda can help you manage the correct CUDA and pytorch-GPU packages, simply follow Pytorch's "getting started" guide.
Dataset and DataLoader

Write a custom Dataset class! Don't write a custom Dataloader! Define your Dataset operations to run once at the start using __init__, or run each iteration in __getitem__. It's a balance depending on your hardware and preprocessing - and especially if you have a GPU and dataloader workers. You want to maximise your hardware utilisation and not leave anything waiting.
Scientific "images"

Pytorch expects batches of arrays for model inputs. These are 4D arrays with shape [B, C, H, W], where B is your batch size (specified in your dataloader), C is the number of Channels in your array (3 for Photos, 1* for scientific data), Height, and Width. You should know if your scientific data are single channel, or multi-channel, and C will change accordingy.
Your Dataset class should return Tensor image rasters, with 3 dimensions, no matter if the data are 1 channel or 3 (or n). Your Dataloader will add the 4th dimension, by batching samples from the dataset.
Many tutorials focus on Colour photographs, which have 3 channels, R,G,B. Additionally, photographs are constrained to UINT8 0-255 values, whereas your own data may be unconstrained float32 values! You need to be vigilant how this affects your code if you are following example code on the internet!
Some notes on running out of CUDA memory

You'll recognise the issue when you get an error along the lines of "CUDA out of memory, could not allocate 2.3 GB..." etc.
When you are using CUDA on GPU (Nvidia only), you are allocating storage within the GPU RAM. This RAM is independent of your CPU RAM, and is most often lower capacity. You can see how much is available (and how much is currently being used) by running nvidia-smi. A typical amount is around 8 GB, but modern cards are rapidly increasing, to even around 12 GB or more.
Your Windows desktop will be using a portion of the GPU RAM to just run itself normally, somewhere around 1 GB or less. Pytorch will also require some GPU RAM regardless of what you send to the device manually. When you restart the kernel, or exit a script, anything allocated to pytorch will be removed. There are some methods to un-allocate and delete / garbage collect CUDA memory, but these are not a great way to deal with the issue.
When you are using CUDA for pytorch, you will typically set up with some initial
device = torch.device("cuda")
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
....
model.to(device)
loss_fn.to(device)
... etc

etc. You then send the model (i.e. its parameters, etc) and loss functions to the device (the GPU), and they will sit there happily while the model trains. These probably have a static amount of GPU ram usage.
The issue arises when you collect your data into a batch for training. The size and data type of a single dataset item, multiplied by the batch size will determine how much GPU RAM a batch uses. You can calculate it mathematically, if you dare.
It is most typically (and TL:DR) the batch size that causes the issue, followed by having too large a dataset item (i.e. having an image that is 512 x 512, instead of 64 x 64 pixels).The first step in troubleshooting is to reduce the dataloader batch size to 1. If this still gives an out of memory issue, you may need to reduce the size of each item in the dataset (i.e. crop the input, or sample fewer points etc).
However, this assumes that each new batch, and its associated losses do not get stored in GPU RAM between fresh iterations on a new batch. That is, if you do not clear the output tensor or the loss values from the GPU RAM each iteration / epoch, etc.The typical pattern for doing this is to send your output and loss values for tracking back to the cpu each time they are calculated, and overwrite the output and loss value variables each iteration. That is, during your iteration loop, send your output and losses back to the cpu with the .cpu() or .to("cpu") methods, and track them in a list/dictionary/etc as a cpu tensor (even just a numpy value, not as a cuda tensor. This applies to both loss values and outputs. You will see patterns such as output_loss.detach().cpu().numpy(), which stops tracking gradients for the tensor, sends it to the CPU RAM, and converts it to a numpy array, from which you can do whatever you want to track it (print, log, store in dictionary, etc).
Hopefully after

Reducing batch size / input size; and
Moving old data off the GPU each iteration

your code will run without issue.While you run your code with batch_size = 1 , pay attention to what nvidia-smi -l  outputs from the terminal. This loops the nvidia-smi call, and will update the RAM allocation in real time. There are dozens of academic papers out there talking about the pros and cons of big and small batch sizes, but a very rough rule of thumb is to simply max out your GPU RAM as much as possbile, and read those papers while your model trains. For example, if batch_size=1 uses 1/4 of your ram, try batch_size = 4. If you don't get an error, happy days!