arXiv/viXra - The Data

Defining the problem and diving in.

Note: The Jupyter/Colab notebooks I used to perform much of the following analysis can be found on my GitHub page.

Wading Through

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns.

This quote is from Andrej Karpathy's excellent post A Recipe for Training Neural Networks , and it is advice Another sage piece of advice from the post: don't be a hero. Start simple and be conservative in your rate of adding bells and whistles. well-taken. Below, I summarize my results in applying this advice to the arXiv/viXra datasets. I detail the gross properties of the sets, their patterns and distinctions, and the process of filtering and normalizing the text. These last two processes serve multiple purposes:

Various design choices enter into the above and I explain the decisions I made below.

Gross Properties

Data Imbalance

There is far more data The arXiv dataset is publicly available on Kaggle. I wrote a small python web-scraper to collect the viXra data, which can be downloaded here as a .feather file (18MB). pandas handles .feather files natively and they are a much more efficient alternative to .csv. for arXiv than viXra (the ratio of papers is 50:1) and the alignment of categories is not perfect. That is, while the two repositories cover many of the same topics, such as

there also exist categories that belong to arXiv or viXra alone:

This provides a natural sanity check on the final models, since one naturally expects them to have an easier time classifying papers which belong to a category present in only one of the two repositories. In particular, this should provide a test regarding the ability of a model to discern semantic differences.


Various patterns emerge when inspecting the data.

For one, the viXra data is far more irregular:

arXiv papers also tend to be much wordier. As the graphic below demonstrates, there is a clear distinction in the distribution of arXiv and viXra papers in terms of counting statistics such as title length or the variance in word length in their abstracts. (Based on this plot alone, one should expect that even simple algorithms may be able to determine the source at a reasonable rate; see Baseline Models.) Academics are a loquacious They also tend to be long-winded, circumlocutory, logorrheic, and pleonastic. bunch.

Corner plot of data statistics.
Some of the statistical differences between arXiv (blue) and viXra (orange) data. Data points are a randomly selected, equally balanced subsample of the combined dataset.

One last pattern I can't resist but mention is Smarandache. A simple search on for "Smarandache" yields over 6,300 results while the analogous search on only gives around 300. Simply take the arXiv/viXra quiz for a few minutes and you will likely see multiple mentions of this name (or the associated phrase "Neutrosophic"). As this Reddit commenter notes, the presence of "Smarandache" is a nearly-perfect and non-trivial predictor that the paper is from viXra.

Filtering, Normalization, and Design Choices

Filters and Technical Markers

Distinguishing arXiv and viXra papers by technical clues such as the characters they use is not particularly interesting. Instead, being able to make the distinction based on a title or abstract's actual content (i.e. semantic meaning) would be much preferrable. At the same time, some technical markers (primarily the presence or absence of LaTeX) do carry significant meaning and, as a researcher, can represent obvious signals as to a paper's source.

I have attempted to strike a balance between concerns like the above by filtering out papers which are very easy to classify due to reasons such as not being in English (a strong viXra signal) or containing inordinate number of uncommon characters. Some examples (notebooks here):

The above steps primarily filter out viXra articles which would be fairly trivial to classify The examples at have all passed these series of filters. The examples were randomly chosen from the balanced training set used for all ML models. .

Text Normalization

Properly normalizing the remaining text is important both for avoiding accidental cheating (see the next section) and for making a fair comparison between model and human performance Some examples: as can be seen in this notebook about a third of viXra titles (and none of the arXiv titles) contain a \r carriage return character, and 40% of arXiv titles (and none of the viXra titles) have a \n newline character. Neither of these technical clues are directly accessible to the human participants at, but would be obvious markers to an ML architecture. .

For these reasons, I performed a fairly brutal and basic normalization to the text:

Cautionary Tale

If you see a plot of validation accuracy vs time like the one below This plot is from a test in which a single blank space was inserted at the end of every viXra title. After one-hot encoding and flattening the text the ensuring vector was fed into a logistic regression model which produced the below performance. This test was done to demonstrate the ability of even relatively simple models to pick up on technical clues. (direct link here) for any non-trivial dataset, you are probably cheating. The plot correspond to achieving near perfect performance on the validation set after training for only a handful of epochs. Finding the cheat may not be easy, however.

In the course of training recurrent architectures, I generated an even more-extreme version of the above when building a classifier for abstracts. After seeing the training examples only once, i.e., after a single epoch, the validation set accuracy was over 99.9%.

To make matter more confusing, the analogous models for classifying titles did not show similarly surprising accuracy. While I expected to (and eventually did) find some technical cheat as the cause, it was unclear how this could occur for the abstracts, but not the titles, as both texts had the same normalization and filtering procedures applied to them.

In the end, the root cause was the fact that all raw arXiv abstracts started with a blank space and ended with a \n newline character, while none of the other datasets displayed such strong technical regularities. Due to a typo Rather than the desired s = s.strip() line in my code, I had accidentally had a simple s.strip(), which does nothing to modify the to-be-returned s. , this leading and trailing whitespace was not properly stripped. Because the technical clues were in whitespace, they were difficult to detect by eye, but careful exploration of the data finally revealed the issue.

Final Datasets

After filtering, approximately 30,000 viXra and 1.7 million arXiv examples remained. In order to handle the massive data imbalance (and make it somewhat easier to use the data), I first created an equally balanced arXiv/viXra set with approximately 60,000 data points, with the unused arXiv data set aside for later use. The balanced dataset Most of the posts in this series regard models which were trained on the balanced set. A post on the utilization of the remaining data and handling the massive imbalance is planned for a future date. was then split into train/val/test subsets in a 70:15:15 ratio.


Thank you to the Distill team for making their article template publicly available. Conversations with Matt Malloy, Thomas Schaaf, and Rami Vanguri were useful in my attempts to find and sanity check the issue found in the "Cautionary Tale" section above.

Additional Links

All Project Posts

Links to all posts in this series. Note: all code for this project can be found on my GitHub page.