The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly
inspecting your data.
This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through
thousands of examples, understanding their distribution and looking for patterns.
This quote is from Andrej Karpathy's excellent post
A Recipe for Training
Neural Networks
, and it is advice
Another sage piece of advice from the post: don't be a hero. Start simple and be conservative in your
rate of adding bells and whistles.
well-taken. Below, I summarize my results in applying this advice to the arXiv/viXra
datasets. I detail the gross properties of the sets,
their patterns and distinctions,
and the process of filtering and normalizing the text. These last two processes serve multiple purposes:
Filtering: Removing various types of outliers in the data narrows and helps to define the scope of the
classification
problem.
Normalizing: Processing the text both prepares the data for ML models and is important for preventing
accidental cheating via undesired technical clues. A cautionary tale on this last point is provided below.
Various design choices enter into the above and I explain the decisions I made below.
Gross Properties
Data Imbalance
There is far more data
The arXiv dataset is
publicly available on Kaggle.
I wrote a small python web-scraper to collect the viXra data, which can be downloaded here as a .feather file (18MB).
pandas handles .feather files natively
and they are a much more efficient alternative to .csv.
for arXiv than viXra (the ratio of papers is 50:1) and the alignment of categories is not perfect.
That is, while the two repositories cover many of the same topics, such as
Number Theory
Condensed Matter
Data Structures and Algorithms
there also exist categories that belong to arXiv or viXra alone:
Mind Science (a sub-category of Biology) (viXra only)
This provides a natural sanity check on the final models, since one naturally expects them to have an easier time
classifying papers which
belong to a category present in only one of the two repositories. In particular, this should provide a test
regarding the ability of a model to discern semantic differences.
viXra papers are more likely to have very short or very longAs in this extremely long
viXra abstract. Some outliers were also
due to the web-scraping process in which the viXra titles and abstracts were taken directly from the paper's
landing page. For instance,
this article's
abstract is listed simply as "1", but inspection of
the pdf source shows that a longer abstract does indeed exist.
abstracts and titles.
viXra papers are more likely
For instance, taking a balanced set of training abstracts and filtering out those which have more than 3% of
their characters outside of the set of English and Greek characters, punctuation, and digits removes
\mathcal{O}(2000) viXra examples and only \mathcal{O}(10) arXiv ones
to be written in languages other than English.
The variety of unicode characters among the viXra papers
After forcing all text to lower-case to normalize, arXiv titles and abstracts were primarily comprised of the
usual 69 printable,
non-upper-case ASCII characters. In contrast,
viXra titles and abstracts were found to use 172 and 393 distinct characters, respectively. Dead giveaways for
viXra papers from either an algorithmic or human perspective include the use of
unicode Greek characters such as ϕ or ξ or mathematical symbols such as √ or ∫
(as
opposed to writing these in LaTex).
is far wider.
arXiv papers also tend to be much wordier. As the graphic below demonstrates, there is a clear distinction in the
distribution of arXiv and viXra papers in terms of counting statistics such as title length or the variance in
word
length in their abstracts. (Based on this plot alone, one should expect that even simple algorithms may be able to
determine the source at a reasonable rate; see Baseline Models.) Academics are a loquacious
They also tend to be long-winded, circumlocutory, logorrheic, and pleonastic.
bunch.
One last pattern I can't resist but mention is Smarandache. A simple search on viXra.org for "Smarandache" yields over 6,300 results
while the analogous search on arXiv.org only gives around
300. Simply take the arXiv/viXra quiz for a few minutes and you will likely see
multiple mentions of this name (or the associated phrase "Neutrosophic"). As
this Reddit commenter notes, the presence of "Smarandache" is a nearly-perfect and non-trivial predictor
that the paper is from viXra.
Filtering, Normalization, and Design Choices
Filters and Technical Markers
Distinguishing arXiv and viXra papers by technical clues such as the characters they use is not particularly
interesting. Instead, being able to make the distinction based on a title or abstract's actual content (i.e.
semantic meaning) would be much preferrable. At the same time, some technical markers (primarily the presence or
absence of LaTeX) do carry significant meaning and, as a
researcher, can represent obvious signals as to a paper's source.
I have attempted to strike a balance between concerns like the above by filtering out papers which are very easy
to classify
due to reasons such as not being in English (a strong viXra signal) or containing inordinate number of uncommon
characters. Some examples (notebooks here):
Filter based on the prevalence of characters outside of the usual ASCII range.
Filter extreme outliers by counting statistics, such as character length, word count, fraction of numerical
characters, etc.
The above steps primarily filter out viXra articles which would be fairly trivial to classify
The examples at garrettgoon.com/arxiv-vixra-quiz have all passed these
series of filters. The examples were randomly chosen from the balanced training set used for all ML models.
.
Text Normalization
Properly normalizing the remaining text is important both for avoiding accidental cheating (see the next section)
and for making a fair comparison between model and human performance
Some examples: as can be seen in
this notebook about a third of viXra titles (and none of the arXiv titles)
contain a
\rcarriage return
character, and 40% of arXiv titles (and none of the viXra titles) have a \n
newline character. Neither of these technical clues are directly accessible to the human participants at garrettgoon.com/arxiv-vixra-quiz, but would be obvious
markers to an ML architecture.
.
For these reasons, I performed a fairly brutal and basic normalization to the text:
The unidecodepackage
was used to convert all characters to the ASCII range.
All text had .strip() applied and all characters were forced lower-case with
.lower().
Spaces were inserted around all punctuation marks and any ASCII control characters were replaced by blank
spaces (and any consecutive blanks were condensed into a single space).
Cautionary Tale
If you see a plot of validation accuracy vs time like the one below
This plot is from a test in which a single blank space was inserted at the end of every viXra title. After
one-hot encoding and flattening the text
the ensuring vector was fed into a logistic regression model which produced the below performance. This test was
done to demonstrate the ability of even relatively simple models to pick up on technical clues.
(direct
link here) for any non-trivial dataset, you are probably cheating. The plot correspond to achieving near
perfect performance on the validation set after training for only a handful of epochs. Finding the cheat
may not be easy, however.
In the course of training recurrent architectures, I generated an even more-extreme version of the above when
building a classifier for abstracts.
After seeing the training examples only once, i.e., after a single epoch, the validation set accuracy was over
99.9%.
To make matter more confusing, the analogous models for classifying titles did not show similarly surprising
accuracy. While I expected
to (and eventually did) find some technical cheat as the cause, it was unclear how this could occur for the
abstracts, but not the titles, as both texts had the same normalization and filtering procedures applied to
them.
In the end, the root cause was the fact that all raw arXiv abstracts started with a blank space and ended with a
\n newline character, while none of the other datasets displayed such strong
technical regularities. Due to a typo
Rather than the desired s = s.strip() line in my code, I had accidentally had
a simple s.strip(), which does nothing to modify the to-be-returned s.
, this leading and trailing whitespace was not properly stripped. Because the technical clues were in
whitespace,
they were difficult to detect by eye, but careful exploration of the data finally revealed the issue.
Final Datasets
After filtering, approximately 30,000 viXra and 1.7 million arXiv examples remained. In order to handle the
massive data imbalance (and make it somewhat easier to use the data), I first created an equally balanced
arXiv/viXra set with approximately 60,000 data points, with the unused arXiv data set aside for later use. The
balanced dataset
Most of the posts in this series regard models which were trained on the balanced set. A post on the utilization
of the remaining data and handling the massive imbalance is planned for a future date.
was then split into train/val/test subsets in a 70:15:15 ratio.
Acknowledgments
Thank you to the Distill team
for making their
article template publicly
available. Conversations with Matt Malloy, Thomas Schaaf, and Rami Vanguri were useful in my attempts to
find and sanity check the issue found in the "Cautionary Tale" section above.
Additional Links
I first attempted to handle the arXiv data on my laptop locally, which could not load the set into memory. daskwas very useful
for being able to lazily load the data for processing, though I eventually switched to using Colab
notebooks whose computing power was sufficient to simply load the sets in to memory.
All Project Posts
Links to all posts in this series.
Note: all code for this project can be found on my GitHub page.