Systematic Bulk Downloading of Articles from PubMed Central (PMC)

In this era of artificial intelligence (AI) and machine learning (ML), there is increased interest in accessing large numbers of full-text articles to train deep learning models and/or evaluate their performance. The U. S. National Library of Medicine (NLM)’s PubMed Central (PMC) full-text article repository is a popular choice with AI/ML researchers who are often looking for a free, openly accessible source of the scholarly biomedical literature. For a recent example of research carried out using the PMC Open Access Subset, see PMID: 37094464:

Although the NLM is generally accommodating of researchers using and even building upon all the tools and resources that it develops and supports, there is an expectation on the part of NLM that researchers will work within their rules and restrictions. Anyone interested in “automated retrieval of articles in machine-readable formats in PubMed Central (PMC)” is encouraged to explore the “several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses)”. However, there are “Restrictions on the Systematic Downloading of Articles”– see https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/

When researchers try to bulk download a large amount of content via the regular PMC web interface on their own, PMC’s systems notice the increased activity and block the IP range(s) responsible as this is in violation of the terms of the PMC Copyright Notice which states that “Systematic downloading of batches of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.”

From: https://www.ncbi.nlm.nih.gov/pmc/about/copyright/:

PMC makes certain subsets of articles (i.e., the PMC Article Datasets) accessible through auxiliary services that may be used for automated retrieval and downloading. These are:

These services are the only services that may be used for this purpose. Do not use any other automated processes for downloading articles, even if you are only retrieving articles from the PMC Article Datasets (including the PMC Open Access Subset).

Questions? Be sure to Ask Us at the MSK Library!

International Open Access Week October 23 – 29 2023

Mark your calendars, International Open Access Week is just around the corner. This year’s theme is “Community over Commercialization,” providing an opportunity to emphasize and discuss approaches to open scholarship that best serve the interests of the public and the academic community.

The MSK Library promotes Open Access in several ways by:

  • adding relevant open access journals to our current list of eJournals;
  • managing the SKOAP fund which financially supports MSK authors for article processing charges;
  • providing a curated listed of open access initiatives, resources, tools, directories, and repositories;
  • and sharing information about open access specific to the needs of the Memorial Sloan Kettering (MSK) community.

Continue reading

New eBook: Handbook of Chemical Biology of Nucleic Acids

The Handbook of Chemical Biology of Nucleic Acids is the first to comprehensively cover nucleic acids from fundamentals to recent advances and applications. It is divided into 10 sections where authors present not only basic knowledge but also recent research. Each section consists of extensive review chapters covering the chemistry, biology, and biophysics of nucleic acids as well as their applications in molecular medicine, biotechnology and nanotechnology. 

This handbook is a valuable resource not only for researchers but also graduate students working in areas related to nucleic acids who would like to learn more about their important role and potential applications.