Preprints and PubMed Version Control

As preprints become more pervasive in biomedical research, many of us may be wondering:

“How will bibliographic indexes and biomedical literature databases like PubMed be handling the version control issue presented by preprints?”

Although the introduction of preprints into PubMed is still in its pilot test phase, enough time has now passed since it began in June 2020 for some of those preprint publications to have been officially published as peer-reviewed journal articles.

How does PubMed indexing work in general?

To get a better understanding of what you can expect to see in PubMed, it’s useful to know how PubMed indexing in general works (and has worked for many years). More than any other bibliographic index, PubMed does a terrific job of quality control. The reason for this can be attributed to their strict policy of only ever assigning one PMID (unique identifier) to an individual published item.

In other words, as a publication evolves from version to version, going from its “Online ahead of print” or “prepub” version that the publisher might make immediately available to readers on their website to the eventual “final published version”, only one PubMed record is created and one PMID is assigned for that item. The PubMed record actually tracks (in the case of many but not all publishers) the history of how this one published item is processed from the point of manuscript submission to its release into the public scientific record.

Take for example this item:

Robilotti EV, Babady NE, Mead PA, Rolling T, Perez-Johnston R, Bernardes M, Bogler Y, Caldararo M, Figueroa CJ, Glickman MS, Joanow A, Kaltsas A, Lee YJ, Lucca A, Mariano A, Morjaria S, Nawar T, Papanicolaou GA, Predmore J, Redelman-Sidi G, Schmidt E, Seo SK, Sepkowitz K, Shah MK, Wolchok JD, Hohl TM, Taur Y, Kamboj M. Determinants of COVID-19 disease severity in patients with cancer. Nat Med. 2020 Aug;26(8):1218-1223. doi: 10.1038/s41591-020-0979-0. Epub 2020 Jun 24. PMID: 32581323.

In the full PubMed catalog citation record for this item dates are included related to its interaction history with the publisher (i.e., the date the manuscript was received and accepted), plus dates for when the item first entered the NLM system and was indexed in PubMed and processed for inclusion in Medline. See:

EDAT- 2020/06/26 06:00

MHDA- 2020/08/28 06:00

CRDT- 2020/06/26 06:00

PHST- 2020/04/30 00:00 [received]

PHST- 2020/06/11 00:00 [accepted]

PHST- 2020/06/26 06:00 [pubmed]

PHST- 2020/08/28 06:00 [medline]

PHST- 2020/06/26 06:00 [entrez]

Managing the indexing in this way ensures better version and quality control, as all steps are tracked and applied to the same record (i.e., only ONE record is ever created for one published item).

Note: When things are not handled in this way – as is the case with some other database vendors – you often end up with two database records for the same item, particularly if the two versions appeared in different calendar years (for example if the prepub ahead-of-print appeared in December 2019 and the final published version appeared in March 2020) and the two records are “missed” (i.e., not identified as duplicates and purged) by the database producer.   

How will PubMed indexing work in the case of preprints?

Keeping in mind that dealing with preprints is still a work in progress for the National Library of Medicine (NLM) and that their cataloging policies may likely evolve as lessons are learned from their pilot – below is an overview of what PubMed has been doing so far with preprints.

PubMed is essentially handling preprints like other database vendors (that index conference proceedings) handle meeting abstracts. In the same way that there is no guarantee that research presented as a conference abstract will not be added to (data or otherwise) if and when it appears as a published peer-reviewed journal article, there is no way of ensuring that the preprint will be exactly the same informationally once it appears as a final, peer-reviewed article. And so, logically, one should assume that the preprint (which by definition has not yet undergone peer-review) will very likely undergo considerable improvement/change as it undergoes the peer-review process and is confirmed as such.

The folks at PubMed, therefore, are creating a separate database record for the preprint and a separate record for the related journal article, each record with its own unique PMID. And because the research reported in each may not be identical (even if they may have identical titles, one could be reporting on preliminary or partial data, etc.), the two records are not being connected via citation record linking, for example, in the way that a Retraction or a Comment might be. (The preprint record, however, will appear in the results of the “Similar Articles” search algorithm and so may be brought to the readers’ attention in that way.)

The preprint citation record related to the citation above, therefore, is a unique one and looks like this in PubMed:


For more information on preprints and preprint citation records – be sure to Ask Us at the MSK Library