Skip to content
The Present

“The General Index”: New tool allows you to search 107 million research papers for free

The creator of the index called it a public utility for accessing the “vast ocean” of human knowledge.
general index
Credit: vnwayne fan / Unsplash
Key Takeaways
  • Millions of research papers get published every year, but the majority lie behind paywalls.
  • A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers.
  • Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content.

A new database aims to make it easier than ever to access and search through the world’s massive trove of research papers. 

Each year, millions of scientific and academic papers get published across thousands of journals. The majority of those papers lie behind paywalls, costing $9 to $30 (or more) to read. Finding them can be difficult: Tools like Google Scholar allow you to search for paper titles and keywords, but more specialized queries are difficult. 

The General Index was designed to reduce those obstacles without breaking the law. Developed by the technologist Carl Malamud and his nonprofit foundation Public Resource, the free-to-use index contains words and phrases from more than 107 million research papers, comprising 8.5 terabytes when compressed.

The General Index includes text from paywalled papers but not the whole text — only phrases up to five words long. This cut-off point was designed to keep the project in good legal standing. (The act of uploading millions of paywalled papers may prove more legally ambiguous.)

The searchable content within the General Index includes:

  • Billions of keywords (e.g., specific types of plants, genes, and materials)
  • Paper titles
  • Authors of research papers
  • DOI article identifiers

Malamud described the index as a tool for mining the “vast ocean” of the world’s accumulated knowledge.

“This is a look-up tool, a dictionary of knowledge, a map to knowledge,” Malamud said in a video. “A tool that we believe is an essential facility for the practice of science in our modern age. […] We view this as a public utility. We assert no ownership over the General Index. It is dedicated to the Public Domain — a series of unencumbered facts with which you can do what you will. There are no rights reserved.”

Should research papers be free?

The high cost of accessing research papers has long been controversial in the scientific community. Universities sometimes pay more than $10 million for an annual subscription to a suite of academic journals. Some of that money ends up going to nonprofits like the Massachusetts Medical Society, the American Medical Association, and the American Geophysical Union, and revenue is also sometimes used to fund student travel and other costs associated with institutional research.

However, the bulk of the revenue ends up in the pockets of major publishers. These for-profit companies, like Elsevier and Wiley, do not directly produce the research they publish; in fact, researchers often have to pay thousands of dollars to get published in major journals. The value that publishers bring to the table, in theory, is quality control through curation and peer review, functions that are not free.

But some in the community argue that research should be free to the public, and that the steep cost of accessing papers holds back scientific progress. That is the ethos behind the open-access movement. One key figure in the movement is the Kazakhstani computer programmer Alexandra Elbakyan. In 2011, she created Sci-Hub, an online database, or “shadow library,” that lets anyone with an internet connection access millions of research papers and books for free. 

Some considered Sci-Hub to be an altruistic tool for advancing scientific knowledge and research. But publishers considered it scientific piracy. The general argument was that Elbakyan had not only stolen the text of journal articles but also the time and expertise of editors and reviewers, not to mention the costs associated with uploading and archiving all of the papers.

In 2015, Elsevier, which owns thousands of academic journals that generate more than $1 billion annually, sued Elbakyan for copyright infringement. She wrote a letter to the judge describing how she found it “insane” that she, as a graduate student, had to pay $32 per paper “when you need to skim or read tens or hundreds of these papers to do research.” 

“Authors of these papers do not receive money,” Elbakyan wrote. “Why would they send their work to Elsevier then? They feel pressured to do this, because Elsevier is an owner of so-called ‘high-impact’ journals. If a researcher wants to be recognized, make a career — he or she needs to have publications in such journals.”

In an opinion piece published in The New York Times, Elbakyan was quoted citing part of the United Nations Charter: “Everyone has the right to freely share in scientific advancement and its benefits.”

A more modest step toward open access

Although far from an act of piracy, it is still unclear the General Index will face any legal challenges. Malamud told Nature News that he is “very confident” in the legality of his project. Over time, he and his colleagues hope to add new features to the database, such as one that shows how important certain terms are in the overall literature, a metric known as term frequency-inverse document frequency (TFIDF).

“If we are to stand on the shoulders of giants, we must provide these maps to that vast world of ideas,” Malamud said in a video. “The General Index is but one tool.”


Up Next