Systematic Bulk Downloading of Articles from PubMed Central (PMC)

In this era of artificial intelligence (AI) and machine learning (ML), there is increased interest in accessing large numbers of full-text articles to train deep learning models and/or evaluate their performance. The U. S. National Library of Medicine (NLM)’s PubMed Central (PMC) full-text article repository is a popular choice with AI/ML researchers who are often looking for a free, openly accessible source of the scholarly biomedical literature. For a recent example of research carried out using the PMC Open Access Subset, see PMID: 37094464:

Although the NLM is generally accommodating of researchers using and even building upon all the tools and resources that it develops and supports, there is an expectation on the part of NLM that researchers will work within their rules and restrictions. Anyone interested in “automated retrieval of articles in machine-readable formats in PubMed Central (PMC)” is encouraged to explore the “several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses)”. However, there are “Restrictions on the Systematic Downloading of Articles”– see https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/

When researchers try to bulk download a large amount of content via the regular PMC web interface on their own, PMC’s systems notice the increased activity and block the IP range(s) responsible as this is in violation of the terms of the PMC Copyright Notice which states that “Systematic downloading of batches of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.”

From: https://www.ncbi.nlm.nih.gov/pmc/about/copyright/:

PMC makes certain subsets of articles (i.e., the PMC Article Datasets) accessible through auxiliary services that may be used for automated retrieval and downloading. These are:

These services are the only services that may be used for this purpose. Do not use any other automated processes for downloading articles, even if you are only retrieving articles from the PMC Article Datasets (including the PMC Open Access Subset).

Questions? Be sure to Ask Us at the MSK Library!

New eBook: Handbook of Chemical Biology of Nucleic Acids

The Handbook of Chemical Biology of Nucleic Acids is the first to comprehensively cover nucleic acids from fundamentals to recent advances and applications. It is divided into 10 sections where authors present not only basic knowledge but also recent research. Each section consists of extensive review chapters covering the chemistry, biology, and biophysics of nucleic acids as well as their applications in molecular medicine, biotechnology and nanotechnology. 

This handbook is a valuable resource not only for researchers but also graduate students working in areas related to nucleic acids who would like to learn more about their important role and potential applications.

Common Errors in PubMed Searches

To err is human, and that includes when we search databases. And while there is always a possibility of typos and errors in searches, the chance of errors increases with the complexity and length of search strategy used.

Most databases, including Pubmed, have mechanisms of alerting a searcher of certain errors found within a search by providing “warnings”, but unfortunately not all typos cause a functional error, and in those instances the database will not provide a warning and the user may end up with unintended search results.

Functional Errors in PubMed

Functional errors in PubMed trigger a warning, as they impede in the function of the database to conduct the search as written.

Quoted phrase not found in phrase index

Not all phrases (string of words you enclose in double quotes) can be found in PubMed due to how PubMed indexes phrases. 

“progenitor cell transplantation”

The easiest solution to this error is remove the double quotes, however this can lead to unintended results if you are not careful. There are several things to keep in mind if you simply remove the quotation marks. 

  • If you remove double quotes from a phrase not found, but are using a specific field code, the search would be broadened by implying that there are Boolean operator AND in between each word, but the field code prevents PubMed from automatically mapping.  

“progenitor cell transplantation”[tiab]progenitor cell transplantation[tiab]

  • If you remove double quotes from a phrase not found, but are not using any field codes at the end of your phrase, the automatic translation by PubMed would become much broader than intended as it will add additional mapping to MeSH terms and word variations for each separate term.

“progenitor cell transplantation” → progenitor AND cell AND transplantation

PubMed recommends using proximity searching to fix this error. Proximity searching is a newer feature in PubMed that allows the user to control how close terms are to one another. In the example below it would only retrieve results in which all 3 terms were found within 3 words of one another.

“progenitor cell transplantation”[tiab:~3]

The last and most extreme solution to this functional error is to switch to a database that does not restrict phrase searching, such as Embase.

The asterisk in your search was ignored

If you are using an asterisk as a wildcard (truncation) in a search strategy, you must use 4 or more characters.

The easiest way to fix this error is simply lengthen the root word to at least 4 characters to truncate and include all possible endings.

The following term(s) were ignored:

This error is usually caused by a typo where something in your search is unbalanced or unpaired, including parentheses, quotation marks, and duplicate boolean operators. 

If you are unable to quickly locate where the issue is, go to Advanced Search and click on the ! under details. This will expand out your entire search strategy and highlight where the error is located.

Common Search Typos

Since these are typos, they often do not trigger a warning within PubMed so it’s important to carefully check your search strategy to make sure everything is correct.

Boolean Operators

Boolean operators (AND, OR, NOT) must be fully capitalized. If they are not capitalized or only the first letter is capitalized, the search translates it as a term and not a Boolean operator, meaning Or would find the word Or in the record but would not OR together two terms. 

If a Boolean operator is omitted PubMed will automatically insert the AND operator. Since AND and OR produce significantly different results, an unintended AND where the search needs an OR would cause a serious alteration in the results, but since it is a legitimate search technique, there would be no warning from PubMed.