In this era of artificial intelligence (AI) and machine learning (ML), there is increased interest in accessing large numbers of full-text articles to train deep learning models and/or evaluate their performance. The U. S. National Library of Medicine (NLM)’s PubMed Central (PMC) full-text article repository is a popular choice with AI/ML researchers who are often looking for a free, openly accessible source of the scholarly biomedical literature. For a recent example of research carried out using the PMC Open Access Subset, see PMID: 37094464:
- Lin M, Hou B, Mishra S, Yao T, Huo Y, Yang Q, Wang F, Shih G, Peng Y. Enhancing thoracic disease detection using chest X-rays from PubMed Central Open Access. Comput Biol Med. 2023 Jun;159:106962. doi: 10.1016/j.compbiomed.2023.106962. Epub 2023 Apr 20. PMID: 37094464; PMCID: PMC10349296.
Although the NLM is generally accommodating of researchers using and even building upon all the tools and resources that it develops and supports, there is an expectation on the part of NLM that researchers will work within their rules and restrictions. Anyone interested in “automated retrieval of articles in machine-readable formats in PubMed Central (PMC)” is encouraged to explore the “several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses)”. However, there are “Restrictions on the Systematic Downloading of Articles”– see https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/.
When researchers try to bulk download a large amount of content via the regular PMC web interface on their own, PMC’s systems notice the increased activity and block the IP range(s) responsible as this is in violation of the terms of the PMC Copyright Notice which states that “Systematic downloading of batches of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.”
From: https://www.ncbi.nlm.nih.gov/pmc/about/copyright/:
PMC makes certain subsets of articles (i.e., the PMC Article Datasets) accessible through auxiliary services that may be used for automated retrieval and downloading. These are:
- the PMC OAI-PMH service;
- the PMC FTP service; and
- the PMC Cloud Service.
These services are the only services that may be used for this purpose. Do not use any other automated processes for downloading articles, even if you are only retrieving articles from the PMC Article Datasets (including the PMC Open Access Subset).
Questions? Be sure to Ask Us at the MSK Library!