Without good tools and organization, it is easy for a project to get out of hand. Below I describe
the tools and workflow used for the arXiv/viXra project.
Tools and Organization
I used Google Colab
Pro+ notebooks in order to access GPUs, with
pytorch lightning
on top of vanilla
pytorch
for coding efficiency, and
wandb
to track and visualize the results of model training.
Google Colab
Google Colab noteboks are essentially cloud-based, slightly modified Jupyter notebooks which can easily be synced
up
with Google Drive
Beware: reading data directly from Drive is very slow! Files on Drive can be copied to the Colab
notebook cwd via code such as
from google.colab import drive
drive.mount("/content/drive")
!cp 'path_to_data_on_Drive' . # Don't forget the period.
Pulling from the copied file(s) will be much faster than pulling from Drive.
. The number of simultaneous notebooks one can run, their maximum runtime, and the GPU specs
and amount of RAM Google provides are all determined by current demand, historical use, and subscription level
I sprung for the
most expensive option ($50/month), which nonetheless seemed economical when compared to AWS instances and other,
similar
options.
. They are not without headaches, but overall I have been very pleased with the Colab experience.
PyTorch Lightning
pytorch appears to have solidified a solid lead amongst the various ML research
frameworks available (sorry tensorflow) due, in part, to its flexibility and
highly pythonic API.
pytorch lightning (pl) improves the pytorch
experience further by removing much of the boilerplate code needed in developing, training, and testing models. A
few features:
-
pl handles moving tensors to the proper device. No explicit some_tensor.to('cuda') calls needed for utilizing GPUs.
-
pl's
LightningModule
(a subclass of pytorch's
nn.Module) encapsulates the model architecture, optimizer, and
{train, val, test}-loop code in a neat package, while still allowing for
massive customization and minimizing
code
For example, there is no need to zero-out gradients, call model.eval()/
model.train(), or write optimizer.step()
manually.
.
-
pl's
LightningDataModule
similarly encapsulates all elements related to the {train, val, test}
data and
pl's
Trainer
ties the preceding elements together at runtime
The Trainer also implements useful features like using 16-bit mixed
precision or profiling your training runs to look for code bottlenecks. These tips (and more) are detailed by
the pytorch_lightning creator in Medium posts here
and here
(the profiler API in the first link is deprecated; see the docs for the current
API).
.
Weights and Biases
Weights and Biases (wandb) is a lightweight, cloud-based platform for easily
tracking, organizing, and visualizing the
many iterations of models and training runs in a project. Further, wandb can be
used to automate hyperparameter sweeps with a menu flexible options. It is easy to use and has
pytorch lightning
integration via pl's
WandbLogger
module.
In addition to tracking any desired statistics for training runs, one can also upload arbitrary files for each run
(such as the state_dict for a saved pytorch
model) and easily generate useful visualizations comparing runs or illustrating the results of a single run.
Below are two such examples of single-run visualizations.
First: histograms demonstrating how the probabilities assigned to a sample of papers evolves with training time,
as predicted by a particular model. Here p (y-axis) is the probability that a given paper is from
viXra. The predictions spread out from p=.5 in the expected manner (hover to see details of
individual time-steps).
(Direct
link here.)
Second: predictions for specific titles generated by same model as above. By
allowing for easy inspection of specific examples, wandb makes it easier to
spot patterns (and possible signs of cheating or code issues) in the results. (Direct
link here.)
Training Workflow
The general training workflow I used for training all of the deep-learning models detailed in these posts is
relatively straightforward:
-
In order to organize the deep-learning model code and minimize the amount of explicit code in my Colab
notebooks, I put all pl models, data modules, helper functions, etc., into
a package,
arxiv_vixra_models (which can be found
here), which is easily imported into Colab notebooks from Drive.
-
After importing a model from arxiv_vixra_models and setting the
hyperparameters which determine the model's capacity, I then performed a trial-run on a subset of the training
data to estimate the optimal initial learning rate (lr), à la the well-known paper by Leslie Smith. In this method, a
small initial lr is
increased by a constant factor for each new batch of data until some maximum lr is surpassed or
the loss diverges (this process is implemented by pl's lr_find
method). The location of steepest descent
This steepest-descent prescription is not precisely what was advocated for in 1506.01186 (nor was it the primary
focus of the paper) and the rule seems often to be stated without explanation. The essential logic is
that the
logarithmic learning-rate axis is a proxy for time-step, since the learning rate is being increased by some
constant factor for each new batch. Explicitly, the lr at the n-th batch, \ell_{n}
, is related to initial rate, \ell_{0}, by \ell_{n}=\alpha^n \ell_{0}
for some \alpha > 1, meaning that \ln \ell_{n} grows
linearly
with n (the time-step), as claimed. The point of steepest-descent therefore corresponds
to
the configuration for which the loss was most-rapidly decreasing in time, which is precisely the desired
property for an initial lr. (The model weights are being updated in the usual way at each step, so
comparing
different
points on the
log(lr) axis is not exactly an apples-to-apples comparison since they correspond to different points in
the
loss-landscape,
but I am not aware of any analysis regarding this point.)
on the resulting loss-vs-log(lr) plot provides the suggested initial lr; see the figure below.
-
Set up a wandb hyperparameter sweep
Updates to pl have
broken
wandb sweeps recently
, but the pl team is very responsive
about addressing the issue, thankfully.
exploring a neighborhood of the suggestedinitial lr and possibly other parameters
The sweep parameters for each run are chosen randomly. This
is a superior strategy to a grid-search, essentially because a random-search avoids making unnecessary
assumptions about the structure of hyperparameter space and freeing oneself from a rigidly structured scan
expands the effective search volume.
which do not significantly change the size of the model (e.g. don't
scan over possible choices of hidden dimensions). The latter condition ensures that we can use a constant,
large-as-possible batch size without running into CUDA error: out of memory
issues due to changes in model size.
-
Increase the capacity of the model until it can learn the training set, while trying not to add any
complications, per Karpathy's advice
.
-
Finally, add regularization to improve validation-set performance, toy around with various learning-rate
schedulers, and save the best models to wandb.
An example of Smith's method for estimating the ideal initial learning rate, as implemented by pytorch lightning. The red dot is the suggested lr.