arXiv/viXra - Introduction

Machine Learning the difference.

Which is which?

Here are two research paper titles:

An Efficient Lattice Algorithm for the Libor Market Model

and

Atomic Entanglement vs Photonic Visibility for Quantum Criticality of Hybrid System

One example comes from arXiv and the other from viXra. For the unfamiliar:

Question: Can you tell which title above came from an arXiv paper and which is from viXra?I guessed incorrectly for both of these titles, which can be found here and here.

Results from the arXiv/viXra quiz, November 2021.
A screen capture of the global results from garrettgoon.com/arxiv-vixra-quiz, taken November 2021.

Humans vs. Machines

Humans

The general task of determining the source of a given paper from, say, its title or abstract alone is non-trivial, particularly for those with little exposure to technical articles.

At garrettgoon.com/arxiv-vixra-quiz I have made a small quiz where you can test your own abilities. As of November 5, 2021 there have been over 25,000 total guesses with the following results: Participants are requested to self-identify as experts if they regularly read technical research articles. These experts guessed correctly on \approx 78\% of the title questions and \approx 71\% of the abstract questions. Non-experts guessed correctly on \approx 71\% and \approx 66\% of these same tasks. Approximately three-fifths of all guesses came from experts.

A summarizing graphic can also be found above.

Machines

Though humans perform significantly better than random, there is room for improvement and surely one can teach a computer to outperform the human baseline arXiv itself uses Machine Learning models to assess its submissions. Published results include studies of plagiarism detection and the correlation between submission time and citation accumulation. Paul Ginsparg (arXiv founder) gives fascinating talks on the subject. . This particular task of text classification is one of the classic problems of Machine Learning (ML), falling under the subcategory of Natural Language Processing (NLP).

In the below series of posts I describe in detail the process of building increasingly sophisticated models for this classification task In order to limit the scope of the project, most of the models focus on the (harder) problem of classifying papers based on title alone. . Along the way, I also attempt to explain various ML and Data Science concepts in my own words with the hope that others may find the explanations informative.

Project Posts

Note: all code for this project can be found on my GitHub page.

Acknowledgments

Thank you to the Distill team for making their article template publicly available. I also gratefully acknowledge useful conversations with Matt Gormley, Matt Malloy, Thomas Schaaf, and Rami Vanguri in the course of this project.

Helpful Links and Resources

This project is my first foray into Machine Learning and I have found the following links particularly helpful: