arXiv/viXra - Simple Recurrent Models

Working memory.

Note: The Jupyter/Colab notebooks relevant to this post are here on my GitHub page. Mypytorch_lightning implementation of the simple RNNs considered in this post can be found here.

Sequential Information

A central limitation of the baseline models considered previously is that they have no means of efficiently using the information inherent to the ordering of the text. For instance, one would expect that the logistic regression and random forest models would perform just about as well Apart from some specific minor degradations, such as the weakening of the signal which arises when a title ends in a punctuation mark (an uncommon, but strong, viXra signal). if we were to shuffle around the words in each title.

Recurrent Neural Networks (RNNs) are the basic architecture which attempts to capture the information in the relative ordering of inputs. I briefly review their properties below, some details of their implementation in pytorch, and examine their performance on the arXiv/viXra data and on my own papers.

RNNs, Briefly

General Structure

RNNs are relevant when our input data forms an ordered series x_t, t\in\{1,\ldots, \texttt{seq\_len}\}. In the context of arXiv/viXra, the x_t would be (tensorial representations of) the individual characters or words in a title with t indexing the position of these elements in the title.

The inputs x_t are consumed sequentially and are used to update an internal state of the RNN, h_t, a hidden or latent variable. The update step also takes into account the form of the previous hidden state, h_{t-1}, and this pattern continues on and on such that h_t contains information from all previous inputs and hidden states all the way back to the initial The initial hidden state h_0 is often taken to be a tensor filled with zeros. This is the default behavior pytorch, for instance. However, it is generally beneficial in Machine Learning to allow the model to figure out its own best parameters, when possible, and the best practice is to promote the initial state to a learnable parameter; see, e.g., Hinton's RNN presentation, slide 14.

It is easy enough to implement this in a pytorch model (minimal sketch): class DynamicInitialHiddenRNN(nn.Module):   """Adds a learnable initial hidden state to the vanilla torch RNN.   """   def __init__(self, input_size: int, hidden_size: int):     super().__init__()     self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)     # Initialize the hidden state randomly.     self.initial_hidden = nn.Parameter(torch.randn(1, 1, hidden_size))   def forward(self, inputs: Tensor, hidden: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:     if hidden is None:       hidden = self.initial_hidden       # Expand hidden to match batch size and ensure contiguous.       batch_size = inputs.shape[0]       hidden = hidden.expand(-1, batch_size, -1).contiguous()     output, hidden = self.rnn(inputs, hidden)     return output, hidden
h_0. This long chain of dependencies can create difficulties in training For very long sequences of length-T, it is common to process the sequence in length- n\ll T chunks (with n determined by practical considerations), performing backpropagation only over these subsequences and passing the resulting final hidden state h_n along to be used as the initial state for the next n terms. More specifically, the computational graph is reset after each sub-sequence via calls of the form hidden.detach() in pytorch, since memory constraints forbid performing backprop over the full, length- T sequence. This procedure goes by the name of truncated backpropagation through time (TBPTT).

In pytorch_lightning TBPTT is performed automatically and there is no need for manual .detach() calls if an self.truncated_bptt_steps attribute is set for architecture (a pl.LightningModule subclass), though I found the whole process somewhat underdocumented. After every training step, pl extracts the hidden states of the architecture and recursively detaches them in the process. In my experience, getting TBPTT to function in the expected manner also requires overwriting the tbptt_split_batch method with a custom implementation, as the default seems to make some curious assumptions
.

The hidden states h_t are also the outputs of the entire network, one or more of which are then used for further purposes. In the context of arXiv/viXra text classificaiton, the simplest use would be to take the final hidden state h_{\texttt{seq\_len}}, which carries information about the entire sequence, and feed this into a fully connected layer and sigmoid function to predict the classification probabilities of a given title.

A Specific Architecture: Gated Recurrent Units

I used Gated Recurrent Units (GRUs) as the recurrent layers The other basic recurrent architectures which are commonly seen are so-called vanilla RNNs and long short-term memory (LSTM) layers. Simple trials demonstrated that GRUs performed similarly well on arXiv/viXra classification to LSTMs (both architectures faring better than vanilla RNNs, as expected), while training faster and being more compact, as they only require one latent variable in contrast to the two needed for LSTMs. See nn.RNN , nn.LSTM , and nn.GRU for the pytorch implementations of these architectures. in the arXiv/viXra models. GRUs process an input x_t via the following steps All of the x_t are generally encoded in three-dimensional tensors shaped as in (batch_size, seq_len, input_size), assuming a pytorch, batch_first = True type implementation (assumed throughout). E.g. for a batch of 50 one-hot encoded text samples, each 100 characters long and which only use lower case characters a-z, the x_t would be (50, 100, 26) -shaped. Similarly, all of the h_t are ensconced in a (b * num_layers, batch_size, hidden_size)-shaped output tensor where hidden_size is the length of the hidden state vector, num_layers is the number of stacked recurrent layers, and b = 2 if bidirectional else 1 doubles the size of the hidden state if the architecture is also bidirectional; see below. The batch dimensions are processed in parallel in the steps outlined below. Indices on x_t, h_t will generally be suppressed, as in the above. I have also compressed the separate b_{iz} and b_{hz} biases of the pytorch nn.GRU implementation into a single b_z vector for brevity and similar for b_r. :

The weights W and biases b are the learnable parameters of the model. The set of h_t's for allt form the output of the GRU.

Bells and Whistles: More Layers and Directions

Finally, one could increase the depth of the RNN architecture by stacking multiple such layers and could also process the inputs x_t in both the forward and backwards directions, resulting in a bidirectional architecture.

Both features are essentially what they sound like, though the details are important.

RNNs for arXiv/viXra

Sticking to the theme of starting simple, I analyze the performance of single-layer, uni-directional GRUs on arXiv/viXra title data Experiments with additional layers and bidirectionality only led to modest improvements upon the results reported below. .

Character- or Word-Level?

One needs to decide how exactly to encode the text as a series of tensors x_t. The natural options are as follows:

I use both options. For embeddings, there are various choices to make. One has to choose both the dimension of the embedding space The embeddings are stored in a (V, embedding_dim)-shaped matrix E . Empirically, choosing embedding_dim \sim \mathcal{O}(10^2) apparently works well on most NLP tasks and choosing a dimension much larger than this only has a weak effect on performance, if any. There has been progress in understanding this finding on a theoretical level by analyzing the so-called Parwise Inner Product (PIP) loss which is the Frobenius norm of the difference between the inner-products along the embedding_dim -sized dimension of two differnent embedding matrices: {\rm Loss} = || E_1 \cdot E_1^T - E_2 \cdot E_2^T|| This loss is a measure of the inherent similarity between the two embeddings since, for instance, if E_1 and E_2 only differed by a rotation in the embedding space, then the loss would vanish. The referenced paper finds a bias-variance decomposition for the loss between the ideal ("oracle") embedding matrix, E , and a estimator thereof, \hat{E}, and demonstrates how these contributions grow and shrink as the embedding_dim of \hat{E} varies with respect to that of E. , denoted by embedding_dim, and one might also choose to only work with a subset of the vocabulary, as the vast majority of the words in a corpus are captured in a tiny fraction of the vocabulary, a general phenomenon known as Zipf's law. See the plot below. I set embedding_dim = 256 and work with the whole V\sim \mathcal{O}(2 \times 10 ^4)-sized vocabulary for arXiv/viXra titles.

Plot of words versus vocab fraction removed.
One might excise rare words from the vocabulary, replacing all such outside-of-vocab words with <UNK>, a special token. This plot demonstrates the fraction of all words in arXiv/viXra titles that would be replaced by <UNK> (y-axis) if we were to remove various fractions of the total vocabulary size, rarest words first (x-axis). For instance, cutting the vocabulary size in half would only affect \sim5\% of all words in all titles. The different points come from discarding all words in in the vocabulary which appear fewer than min_word_count times in the training data.

Architecture Details

The recurrent models are all fairly simple: the GRU consumes the one-hot-encoded or embedded text as inputs and the ensuing hidden states (which are optionally first passed through a dropout layer) are fed into a single fully-connected layer to predict a single number, the probability that a given title comes from a viXra paper. I use hidden_size = 512 for all models.

The python code for the relevant nn.LightningModule subclasses can be found in the arxiv_vixra_models package. The code also accommodates the additional bells and whistles listed above and further features not detailed here.

An important choice is precisely what data from the GRU hidden states is passed to the feed-forward layers. While we could pass in only the hidden state from the final time step, since that is the unique state which has seen the entire title text, this choice poses the risk of missing out on important information that may have appeared early on in the title but which has faded out of the hidden state with time. So, instead of passing in this final time-step (output[:, -1]), one might instead use the maximum across all time steps for each hidden_size dimension (output.max(dim=1)), the similar mean across all time-steps (output.mean(dim=1)), or even concatenate all three options together, as suggested in the ULMFiT paper. Why be so rigid in choosing which portions of the hidden state to pass to the feed-forward network? Why not allow the architecture to dynamically choose which parts of the text to focus on by, say, using more finely-weighted averages of the different time-steps? Such an idea is a simple version of an attention mechanism, which will be explored in a later post.

The arxiv_vixra_models code accommodates each of these options and empirically the max option tended to perform best, though difference between the strategies were not large.

Performance and Interpretation

Validation Set

The one-hot and embedding-space recurrent architectures both achieved \approx 83\% accuracy on the validation data, which is significantly better than the \mathcal{O}(70\%) peformance of the simple baseline models discussed previously. Of the two models, the embedding-space architecture performed very slightly better. The ROC curve Receiver Operating Characteristic (ROC) curves plot the true-positive-rate (often called precision) on the y-axis and false-positive-rate on the x-axis as the threshold for what constitutes a positive prediction is varied. (In the context of arXiv/viXra, P({\rm pred. = viXra}|{\rm source =viXra}) is the vertical and P({\rm pred. = viXra}|{\rm source =arXiv}) is the horizontal.) By threshold, I am referring to the fact that classification models are assigning a score to every example and this score is used to set the prediction boundary. In the canonical case, the score is given by the logits and one usually declares that positive and negative logits correspond to a positive and negative predictions, respectively, but one could also consider moving this boundary away from zero. As the threshold is varied, the relative true- and false-positive-rates change, generating the below curve. The top-right corner corresponds to sending the threshold to -\infty (classifying all positive cases correctly, but also generating many false-positives) and the bottom-left corner is the opposite limit.

Denoting the abstract score assigned to an example by s, the positive and negative examples belong to separate distributions, \rho_1(s) and \rho_0(s), respectively. The area under the curve (AUC) has a lovely, precise meaning: it is the probability that the score the model assigns to a randomly chosen positive example will be higher than that of a randomly chosen negative example, {\rm AUC} = \int_{-\infty}^{\infty}{\rm d} s_1{\rm d} s_0\,\theta(s_1 - s_0) \rho_1(s_1)\rho_0(s_0)= P[S_1 \ge S_0]\ .
for the embedding model can be found below. A disadvantage of the present architectures relative to the baseline models, however, is that they are much more opaque. What, exactly, is the recurrent architecture picking up on that was missed by the simple baselines?

ROC curve for the embedding GRU
The ROC curve for the best-performing embedding model. AUC is, roughly, a measure of how well-separated the logit-space distributions of arXiv and viXra papers are, as generated by the model.

Interpretation

The simplicity of the present models allows us to see a glimpse of what is going on, though it is hard to draw any very precise conclusions. In particular, the single number that our model predicts (the probability that a given title is from a viXra paper) arises from a dot-product between the final dimension of the relevant components of the GRU output tensor and an appropriately shaped vector. For example, if we only feed the hidden states from the final GRU time-step into the fully connected layer, then this (batch_size, 512)-shaped tensor is turned into logits by taking a dot product with a (1, 512)-shaped weight vectorw and adding the scalar bias b to the result.

This means that by re-scaling the GRU hidden states by the appropriate weights and biases, we can get a direct view of the logits More precisely, if output is (batch_size, seq_len, hidden_size)-shaped hidden state tensor of the GRU model, wherehidden_size = 512, then the plot below shows a handful of entries along the batch_size dimension of logits_output for a particular slice of neurons along the hidden_size dimension with logits_output defined as in (sketch): w, b = model.weight, model.bias logits_output = output * w + b / output.shape[-1] When the model uses only the final time-slice to make viXra-probability predictions, as in thtensor could be computed ase plot below, thenlogits_output is related to the (batch_size, )-shaped probs probability tensor via last_step_logits = logits_output.sum(dim=-1)[:, -1] probs = last_step_logits.sigmoid() . While other strategies perform better, the case where one makes predictions based on the final hidden time step is easiest to interpret. The below plot shows how various, re-scaled hidden-state neurons of one such model are pushed towards a viXra signal (light) or arXiv signal (dark) as they step through three titles from my papers. The logits in each case would come from summing up the final (rightmost) column and adding this to the similar sum of neurons not shown in the plot.

logits from the hidden state
A plot of various hidden-state neurons re-scaled by appropriate weight and bias factors as they evolve upon seeing a few of my titles (which were not included in the training data). The displayed neurons are the same in all plots and were chosen for having relatively large magnitudes in their final-time-step (right end of plot). The common, learned initial-hidden state can be seen on the far left of each figure.

Though the evolution in the model's prediction can be clearly seen in each case, it's difficult to interpret what any individual neuron is looking for (nor should such human-interpretable behavior be expected). There are some intriguing signs, but most seem to fall apart under scrutiny. For instance, the bright streak in the top image could plausibly be attuned to title length as its viXra signal grows as it sees more and more padding and terminates in a moderate viXra-leaning signal for this very short title. However, these same neurons terminate in a slightly arXiv-leaning signal for the similarly-short middle example and in the final image the neuron whiplashes back and forth upon encountering text and terminates in a moderate viXra signal for this much-longer title. Inconclusive.

My Papers

Finally, I examine the architecture's performance on my own papers. These models performed much better than the baselines when predicting the source of my own papers, thankfully, classifying 17- and 18-out-of-20 correctly for the word-embedding and one-hot models, respectively. See the figures below.

one-hot predictions
Probabilities that each of my titles are viXra, as predicted by the one-hot GRU model.
embedding predictions
Probabilities that each of my titles are viXra, as predicted by the word-embedding GRU model.

There is no longer an obvious correlation between title-length and viXra-probability, as there was for the baseline model predictions.

Acknowledgments

Thank you to the Distill team for making their article template publicly available and to the Colab, pytorch lightning, and wandb teams for their wonderful tools.

Additional Relevant Links

All Project Posts

Links to all posts in this series. Note: all code for this project can be found on my GitHub page.