arXiv/viXra - Workflow

Colab, PyTorch Lightning, and Weights and Biases.

Without good tools and organization, it is easy for a project to get out of hand. Below I describe the tools and workflow used for the arXiv/viXra project.

Tools and Organization

I used Google Colab Pro+ notebooks in order to access GPUs, with pytorch lightning on top of vanilla pytorch for coding efficiency, and wandb to track and visualize the results of model training.

Google Colab

Google Colab noteboks are essentially cloud-based, slightly modified Jupyter notebooks which can easily be synced up with Google Drive Beware: reading data directly from Drive is very slow! Files on Drive can be copied to the Colab notebook cwd via code such as from google.colab import drive drive.mount("/content/drive") !cp 'path_to_data_on_Drive' . # Don't forget the period. Pulling from the copied file(s) will be much faster than pulling from Drive. . The number of simultaneous notebooks one can run, their maximum runtime, and the GPU specs and amount of RAM Google provides are all determined by current demand, historical use, and subscription level I sprung for the most expensive option ($50/month), which nonetheless seemed economical when compared to AWS instances and other, similar options. . They are not without headaches, but overall I have been very pleased with the Colab experience.

PyTorch Lightning

pytorch appears to have solidified a solid lead amongst the various ML research frameworks available (sorry tensorflow) due, in part, to its flexibility and highly pythonic API.

pytorch lightning (pl) improves the pytorch experience further by removing much of the boilerplate code needed in developing, training, and testing models. A few features:

pl handles moving tensors to the proper device. No explicit some_tensor.to('cuda') calls needed for utilizing GPUs.
pl's LightningModule (a subclass of pytorch's nn.Module) encapsulates the model architecture, optimizer, and {train, val, test}-loop code in a neat package, while still allowing for massive customization and minimizing code For example, there is no need to zero-out gradients, call model.eval()/ model.train(), or write optimizer.step() manually. .
pl's LightningDataModule similarly encapsulates all elements related to the {train, val, test} data and pl's Trainer ties the preceding elements together at runtime The Trainer also implements useful features like using 16-bit mixed precision or profiling your training runs to look for code bottlenecks. These tips (and more) are detailed by the pytorch_lightning creator in Medium posts here and here (the profiler API in the first link is deprecated; see the docs for the current API). .

Weights and Biases

Weights and Biases (wandb) is a lightweight, cloud-based platform for easily tracking, organizing, and visualizing the many iterations of models and training runs in a project. Further, wandb can be used to automate hyperparameter sweeps with a menu flexible options. It is easy to use and has pytorch lightning integration via pl's WandbLogger module.

In addition to tracking any desired statistics for training runs, one can also upload arbitrary files for each run (such as the state_dict for a saved pytorch model) and easily generate useful visualizations comparing runs or illustrating the results of a single run.

Below are two such examples of single-run visualizations.

First: histograms demonstrating how the probabilities assigned to a sample of papers evolves with training time, as predicted by a particular model. Here p (y-axis) is the probability that a given paper is from viXra. The predictions spread out from p=.5 in the expected manner (hover to see details of individual time-steps). (Direct link here.)

Second: predictions for specific titles generated by same model as above. By allowing for easy inspection of specific examples, wandb makes it easier to spot patterns (and possible signs of cheating or code issues) in the results. (Direct link here.)

Training Workflow

The general training workflow I used for training all of the deep-learning models detailed in these posts is relatively straightforward:

In order to organize the deep-learning model code and minimize the amount of explicit code in my Colab notebooks, I put all pl models, data modules, helper functions, etc., into a package, arxiv_vixra_models (which can be found here), which is easily imported into Colab notebooks from Drive.
After importing a model from arxiv_vixra_models and setting the hyperparameters which determine the model's capacity, I then performed a trial-run on a subset of the training data to estimate the optimal initial learning rate (lr), à la the well-known paper by Leslie Smith. In this method, a small initial lr is increased by a constant factor for each new batch of data until some maximum lr is surpassed or the loss diverges (this process is implemented by pl's lr_find method). The location of steepest descent This steepest-descent prescription is not precisely what was advocated for in 1506.01186 (nor was it the primary focus of the paper) and the rule seems often to be stated without explanation. The essential logic is that the logarithmic learning-rate axis is a proxy for time-step, since the learning rate is being increased by some constant factor for each new batch. Explicitly, the lr at the n-th batch, \ell_{n} , is related to initial rate, \ell_{0}, by \ell_{n}=\alpha^n \ell_{0} for some \alpha > 1, meaning that \ln \ell_{n} grows linearly with n (the time-step), as claimed. The point of steepest-descent therefore corresponds to the configuration for which the loss was most-rapidly decreasing in time, which is precisely the desired property for an initial lr. (The model weights are being updated in the usual way at each step, so comparing different points on the log(lr) axis is not exactly an apples-to-apples comparison since they correspond to different points in the loss-landscape, but I am not aware of any analysis regarding this point.) on the resulting loss-vs-log(lr) plot provides the suggested initial lr; see the figure below.
Set up a wandb hyperparameter sweep Updates to pl have broken wandb sweeps recently , but the pl team is very responsive about addressing the issue, thankfully. exploring a neighborhood of the suggestedinitial lr and possibly other parameters The sweep parameters for each run are chosen randomly. This is a superior strategy to a grid-search, essentially because a random-search avoids making unnecessary assumptions about the structure of hyperparameter space and freeing oneself from a rigidly structured scan expands the effective search volume. which do not significantly change the size of the model (e.g. don't scan over possible choices of hidden dimensions). The latter condition ensures that we can use a constant, large-as-possible batch size without running into CUDA error: out of memory issues due to changes in model size.
Increase the capacity of the model until it can learn the training set, while trying not to add any complications, per Karpathy's advice .
Finally, add regularization to improve validation-set performance, toy around with various learning-rate schedulers, and save the best models to wandb.

An example of Smith's method for estimating the ideal initial learning rate, as implemented by pytorch lightning. The red dot is the suggested lr.

Acknowledgments

Thank you to the Distill team for making their article template publicly available and to the Colab, pytorch lightning, and wandb teams for their wonderful tools.

All Project Posts

Links to all posts in this series. Note: all code for this project can be found on my GitHub page.

The Data
Workflow
Baseline Models
Simple Recurrent Models
Embeddings
...in progress...
Test Set Performance and Conclusions