arXiv/viXra - The Data

Defining the problem and diving in.

Note: The Jupyter/Colab notebooks I used to perform much of the following analysis can be found on my GitHub page.

Wading Through

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns.

This quote is from Andrej Karpathy's excellent post A Recipe for Training Neural Networks , and it is advice Another sage piece of advice from the post: don't be a hero. Start simple and be conservative in your rate of adding bells and whistles. well-taken. Below, I summarize my results in applying this advice to the arXiv/viXra datasets. I detail the gross properties of the sets, their patterns and distinctions, and the process of filtering and normalizing the text. These last two processes serve multiple purposes:

Filtering: Removing various types of outliers in the data narrows and helps to define the scope of the classification problem.
Normalizing: Processing the text both prepares the data for ML models and is important for preventing accidental cheating via undesired technical clues. A cautionary tale on this last point is provided below.

Various design choices enter into the above and I explain the decisions I made below.

Gross Properties

Data Imbalance

There is far more data The arXiv dataset is publicly available on Kaggle. I wrote a small python web-scraper to collect the viXra data, which can be downloaded here as a .feather file (18MB). pandas handles .feather files natively and they are a much more efficient alternative to .csv. for arXiv than viXra (the ratio of papers is 50:1) and the alignment of categories is not perfect. That is, while the two repositories cover many of the same topics, such as

Number Theory
Condensed Matter
Data Structures and Algorithms

there also exist categories that belong to arXiv or viXra alone:

Mind Science (a sub-category of Biology) (viXra only)
Distributed, Parallel, and Cluster Computing (arXiv only)
Religion and Spiritualism (viXra only)

This provides a natural sanity check on the final models, since one naturally expects them to have an easier time classifying papers which belong to a category present in only one of the two repositories. In particular, this should provide a test regarding the ability of a model to discern semantic differences.

Patterns

Various patterns emerge when inspecting the data.

For one, the viXra data is far more irregular:

Many more duplicate title/abstract pairs exist on viXra. There are multiple viXra examples in which the same paper was seemingly submitted in different years, such as this 2016 and this 2017 submission. The arXiv data is not free of similar issues, however: for whatever reason this (now-withdrawn, with comment) submission is an exact duplicate of this paper submitted earlier the same year.
viXra papers are more likely to have very short or very long As in this extremely long viXra abstract. Some outliers were also due to the web-scraping process in which the viXra titles and abstracts were taken directly from the paper's landing page. For instance, this article's abstract is listed simply as "1", but inspection of the pdf source shows that a longer abstract does indeed exist. abstracts and titles.
viXra papers are more likely For instance, taking a balanced set of training abstracts and filtering out those which have more than 3% of their characters outside of the set of English and Greek characters, punctuation, and digits removes \mathcal{O}(2000) viXra examples and only \mathcal{O}(10) arXiv ones to be written in languages other than English.
The variety of unicode characters among the viXra papers After forcing all text to lower-case to normalize, arXiv titles and abstracts were primarily comprised of the usual 69 printable, non-upper-case ASCII characters. In contrast, viXra titles and abstracts were found to use 172 and 393 distinct characters, respectively. Dead giveaways for viXra papers from either an algorithmic or human perspective include the use of unicode Greek characters such as ϕ or ξ or mathematical symbols such as √ or ∫ (as opposed to writing these in LaTex). is far wider.

arXiv papers also tend to be much wordier. As the graphic below demonstrates, there is a clear distinction in the distribution of arXiv and viXra papers in terms of counting statistics such as title length or the variance in word length in their abstracts. (Based on this plot alone, one should expect that even simple algorithms may be able to determine the source at a reasonable rate; see Baseline Models.) Academics are a loquacious They also tend to be long-winded, circumlocutory, logorrheic, and pleonastic. bunch.

Corner plot of data statistics. — Some of the statistical differences between arXiv (blue) and viXra (orange) data. Data points are a randomly selected, equally balanced subsample of the combined dataset.

One last pattern I can't resist but mention is Smarandache. A simple search on viXra.org for "Smarandache" yields over 6,300 results while the analogous search on arXiv.org only gives around 300. Simply take the arXiv/viXra quiz for a few minutes and you will likely see multiple mentions of this name (or the associated phrase "Neutrosophic"). As this Reddit commenter notes, the presence of "Smarandache" is a nearly-perfect and non-trivial predictor that the paper is from viXra.

Filtering, Normalization, and Design Choices

Filters and Technical Markers

Distinguishing arXiv and viXra papers by technical clues such as the characters they use is not particularly interesting. Instead, being able to make the distinction based on a title or abstract's actual content (i.e. semantic meaning) would be much preferrable. At the same time, some technical markers (primarily the presence or absence of LaTeX) do carry significant meaning and, as a researcher, can represent obvious signals as to a paper's source.

I have attempted to strike a balance between concerns like the above by filtering out papers which are very easy to classify due to reasons such as not being in English (a strong viXra signal) or containing inordinate number of uncommon characters. Some examples (notebooks here):

Filter based on the prevalence of characters outside of the usual ASCII range.
Filter extreme outliers by counting statistics, such as character length, word count, fraction of numerical characters, etc.

The above steps primarily filter out viXra articles which would be fairly trivial to classify The examples at garrettgoon.com/arxiv-vixra-quiz have all passed these series of filters. The examples were randomly chosen from the balanced training set used for all ML models. .

Text Normalization

Properly normalizing the remaining text is important both for avoiding accidental cheating (see the next section) and for making a fair comparison between model and human performance Some examples: as can be seen in this notebook about a third of viXra titles (and none of the arXiv titles) contain a \r carriage return character, and 40% of arXiv titles (and none of the viXra titles) have a \n newline character. Neither of these technical clues are directly accessible to the human participants at garrettgoon.com/arxiv-vixra-quiz, but would be obvious markers to an ML architecture. .

For these reasons, I performed a fairly brutal and basic normalization to the text:

The unidecodepackage was used to convert all characters to the ASCII range.
All text had .strip() applied and all characters were forced lower-case with .lower().
Spaces were inserted around all punctuation marks and any ASCII control characters were replaced by blank spaces (and any consecutive blanks were condensed into a single space).

Cautionary Tale

If you see a plot of validation accuracy vs time like the one below This plot is from a test in which a single blank space was inserted at the end of every viXra title. After one-hot encoding and flattening the text the ensuring vector was fed into a logistic regression model which produced the below performance. This test was done to demonstrate the ability of even relatively simple models to pick up on technical clues. (direct link here) for any non-trivial dataset, you are probably cheating. The plot correspond to achieving near perfect performance on the validation set after training for only a handful of epochs. Finding the cheat may not be easy, however.

In the course of training recurrent architectures, I generated an even more-extreme version of the above when building a classifier for abstracts. After seeing the training examples only once, i.e., after a single epoch, the validation set accuracy was over 99.9%.

To make matter more confusing, the analogous models for classifying titles did not show similarly surprising accuracy. While I expected to (and eventually did) find some technical cheat as the cause, it was unclear how this could occur for the abstracts, but not the titles, as both texts had the same normalization and filtering procedures applied to them.

In the end, the root cause was the fact that all raw arXiv abstracts started with a blank space and ended with a \n newline character, while none of the other datasets displayed such strong technical regularities. Due to a typo Rather than the desired s = s.strip() line in my code, I had accidentally had a simple s.strip(), which does nothing to modify the to-be-returned s. , this leading and trailing whitespace was not properly stripped. Because the technical clues were in whitespace, they were difficult to detect by eye, but careful exploration of the data finally revealed the issue.

Final Datasets

After filtering, approximately 30,000 viXra and 1.7 million arXiv examples remained. In order to handle the massive data imbalance (and make it somewhat easier to use the data), I first created an equally balanced arXiv/viXra set with approximately 60,000 data points, with the unused arXiv data set aside for later use. The balanced dataset Most of the posts in this series regard models which were trained on the balanced set. A post on the utilization of the remaining data and handling the massive imbalance is planned for a future date. was then split into train/val/test subsets in a 70:15:15 ratio.

Acknowledgments

Thank you to the Distill team for making their article template publicly available. Conversations with Matt Malloy, Thomas Schaaf, and Rami Vanguri were useful in my attempts to find and sanity check the issue found in the "Cautionary Tale" section above.

Additional Links

I first attempted to handle the arXiv data on my laptop locally, which could not load the set into memory. daskwas very useful for being able to lazily load the data for processing, though I eventually switched to using Colab notebooks whose computing power was sufficient to simply load the sets in to memory.

All Project Posts

Links to all posts in this series. Note: all code for this project can be found on my GitHub page.

The Data
Workflow
Baseline Models
Simple Recurrent Models
Embeddings
...in progress...
Test Set Performance and Conclusions