Systematic Bulk Downloading of Articles from PubMed Central (PMC)

In this era of artificial intelligence (AI) and machine learning (ML), there is increased interest in accessing large numbers of full-text articles to train deep learning models and/or evaluate their performance. The U. S. National Library of Medicine (NLM)’s PubMed Central (PMC) full-text article repository is a popular choice with AI/ML researchers who are often looking for a free, openly accessible source of the scholarly biomedical literature. For a recent example of research carried out using the PMC Open Access Subset, see PMID: 37094464:

Although the NLM is generally accommodating of researchers using and even building upon all the tools and resources that it develops and supports, there is an expectation on the part of NLM that researchers will work within their rules and restrictions. Anyone interested in “automated retrieval of articles in machine-readable formats in PubMed Central (PMC)” is encouraged to explore the “several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses)”. However, there are “Restrictions on the Systematic Downloading of Articles”– see https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/

When researchers try to bulk download a large amount of content via the regular PMC web interface on their own, PMC’s systems notice the increased activity and block the IP range(s) responsible as this is in violation of the terms of the PMC Copyright Notice which states that “Systematic downloading of batches of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.”

From: https://www.ncbi.nlm.nih.gov/pmc/about/copyright/:

PMC makes certain subsets of articles (i.e., the PMC Article Datasets) accessible through auxiliary services that may be used for automated retrieval and downloading. These are:

These services are the only services that may be used for this purpose. Do not use any other automated processes for downloading articles, even if you are only retrieving articles from the PMC Article Datasets (including the PMC Open Access Subset).

Questions? Be sure to Ask Us at the MSK Library!

ChatGPT and Fake Citations: MSK Library Edition

Since the launch of ChatGPT, an artificial intelligence chatbot developed by OpenAI, we at the MSK Library have seen an uptick in requests to track down what turn out to be fake citations for studies related to cancer research.

We decided to pick a topic we were recently asked to conduct a literature search on (survival outcomes, recurrence, and pathology characteristics of poorly differentiated thyroid carcinoma) to see how ChatGPT handled it. Below are screenshots from our conversation. 

Looks pretty good, right? We asked for the full citations. 

Voila, ChatGPT delivered! We then attempted to verify these citations. We first looked them up in databases and citation indexes like PubMed and Google Scholar. Then we checked the DOIs, or digital object identifiers. Finally, we went directly to the journals these “articles” were “published” in to see if they appeared in the same journal, issue, and volume ChatGPT cited, or if they appeared in these journals at all. These citations didn’t appear to be legitimate, so we let ChatpGPT know.

ChatGPT gave the same incorrect citations again. We asked if it was fabricating this information.

Still no dice. It appeared that ChatGPT was “hallucinating.” Learn more about this phenomenon here and here

We asked ChatGPT why it was creating these fake citations, and its response was illuminating. 

Our interaction with ChatGPT isn’t surprising – it’s a large language model and not a database or citation index. ChatGPT is great for some aspects of research, but not others. Check out Duke University Libraries’ blog post ChatGPT and Fake Citations for more information. 

Learn more about AI by visiting our Artificial Intelligence guide.  Need help finding evidence based information? Ask Us

New eJournal – Radiology: Artificial Intelligence

The Library now subscribes to the journal Radiology: Artificial Intelligence, published by the Radiological Society of North America. This eJournal highlights the emerging applications of machine learning and artificial intelligence in the field of imaging across multiple disciplines.

Other ideas and concepts covered include education about AI, AI’s role to educate radiologists, referring providers, and patients, and the ethical, legal, and social issues surrounding AI.