Preprints and PubMed Version Control

As preprints become more pervasive in biomedical research, many of us may be wondering:

“How will bibliographic indexes and biomedical literature databases like PubMed be handling the version control issue presented by preprints?”

Although the introduction of preprints into PubMed is still in its pilot test phase, enough time has now passed since it began in June 2020 for some of those preprint publications to have been officially published as peer-reviewed journal articles.

How does PubMed indexing work in general?

To get a better understanding of what you can expect to see in PubMed, it’s useful to know how PubMed indexing in general works (and has worked for many years). More than any other bibliographic index, PubMed does a terrific job of quality control. The reason for this can be attributed to their strict policy of only ever assigning one PMID (unique identifier) to an individual published item.

In other words, as a publication evolves from version to version, going from its “Online ahead of print” or “prepub” version that the publisher might make immediately available to readers on their website to the eventual “final published version”, only one PubMed record is created and one PMID is assigned for that item. The PubMed record actually tracks (in the case of many but not all publishers) the history of how this one published item is processed from the point of manuscript submission to its release into the public scientific record.

Take for example this item:

Robilotti EV, Babady NE, Mead PA, Rolling T, Perez-Johnston R, Bernardes M, Bogler Y, Caldararo M, Figueroa CJ, Glickman MS, Joanow A, Kaltsas A, Lee YJ, Lucca A, Mariano A, Morjaria S, Nawar T, Papanicolaou GA, Predmore J, Redelman-Sidi G, Schmidt E, Seo SK, Sepkowitz K, Shah MK, Wolchok JD, Hohl TM, Taur Y, Kamboj M. Determinants of COVID-19 disease severity in patients with cancer. Nat Med. 2020 Aug;26(8):1218-1223. doi: 10.1038/s41591-020-0979-0. Epub 2020 Jun 24. PMID: 32581323.

In the full PubMed catalog citation record for this item dates are included related to its interaction history with the publisher (i.e., the date the manuscript was received and accepted), plus dates for when the item first entered the NLM system and was indexed in PubMed and processed for inclusion in Medline. See:

EDAT- 2020/06/26 06:00

MHDA- 2020/08/28 06:00

CRDT- 2020/06/26 06:00

PHST- 2020/04/30 00:00 [received]

PHST- 2020/06/11 00:00 [accepted]

PHST- 2020/06/26 06:00 [pubmed]

PHST- 2020/08/28 06:00 [medline]

PHST- 2020/06/26 06:00 [entrez]

Managing the indexing in this way ensures better version and quality control, as all steps are tracked and applied to the same record (i.e., only ONE record is ever created for one published item).

Note: When things are not handled in this way – as is the case with some other database vendors – you often end up with two database records for the same item, particularly if the two versions appeared in different calendar years (for example if the prepub ahead-of-print appeared in December 2019 and the final published version appeared in March 2020) and the two records are “missed” (i.e., not identified as duplicates and purged) by the database producer.   

How will PubMed indexing work in the case of preprints?

Keeping in mind that dealing with preprints is still a work in progress for the National Library of Medicine (NLM) and that their cataloging policies may likely evolve as lessons are learned from their pilot – below is an overview of what PubMed has been doing so far with preprints.

PubMed is essentially handling preprints like other database vendors (that index conference proceedings) handle meeting abstracts. In the same way that there is no guarantee that research presented as a conference abstract will not be added to (data or otherwise) if and when it appears as a published peer-reviewed journal article, there is no way of ensuring that the preprint will be exactly the same informationally once it appears as a final, peer-reviewed article. And so, logically, one should assume that the preprint (which by definition has not yet undergone peer-review) will very likely undergo considerable improvement/change as it undergoes the peer-review process and is confirmed as such.

The folks at PubMed, therefore, are creating a separate database record for the preprint and a separate record for the related journal article, each record with its own unique PMID. And because the research reported in each may not be identical (even if they may have identical titles, one could be reporting on preliminary or partial data, etc.), the two records are not being connected via citation record linking, for example, in the way that a Retraction or a Comment might be. (The preprint record, however, will appear in the results of the “Similar Articles” search algorithm and so may be brought to the readers’ attention in that way.)

The preprint citation record related to the citation above, therefore, is a unique one and looks like this in PubMed:

For more information on preprints and preprint citation records – be sure to Ask Us at the MSK Library


Double Screening in Systematic Reviews

As anyone who has worked on a systematic review (SR) knows, screening references for the study selection stage of the SR process can be quite time consuming and labor intensive. Ideally, the screening should be done by two people working independently, so it is a lot of work – times two! It’s not surprising, therefore, that many researchers wonder:

  • if they can get away with single screening
  • if there exists some way to automate part, or all, of the screening stage

Single Screening vs. Double Screening

An August 2020 paper by Mahtani et al. explores the latest evidence on this topic (see some examples listed below) and summarizes the guidance from leading evidence synthesis organizations/producers like the Cochrane Collaboration, the Joanna Briggs Institute, the Campbell Collaboration, and the Institute of Medicine (US) Committee on Standards for Systematic Reviews of Comparative Effectiveness Research – all of whom recommend (in their handbooks and documentation) that at least two people working independently be involved in the screening process.

Mahtani KR, Heneghan C, Aronson J. Single screening or double screening for study selection in systematic reviews? BMJ Evid Based Med. 2020 Aug;25(4):149-150. doi: 10.1136/bmjebm-2019-111269. Epub 2019 Nov 13. PMID: 31722997

Waffenschmidt S, Knelangen M, Sieben W, Bühn S, Pieper D. Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019 Jun 28;19(1):132. doi: 10.1186/s12874-019-0782-0. PMID: 31253092; PMCID: PMC6599339

Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002 Jun 15;21(11):1635-40. doi: 10.1002/sim.1190. PMID: 12111924. 

Conventional vs. Automated or Semi-Automated Screening

Quite a bit of research is currently being done on automating steps of the systematic review process, particularly investigating using AI/machine learning or text mining/natural language processing to replace the second reviewer (ie. semi-automated screening) and/or to reduce the number of records needed to be screened. There are already software tools in existence that have introduced relevance prediction/screening prioritization capabilities (for example, Abstrackr, DistillerSR/DistillerAI, EPPI-Reviewer, RobotAnalyst, etc.) but their performance is largely still under evaluation.

As technology improves, it’s highly likely that we will someday soon see acceptance of automated screening tool use for study selection in systematic reviews by leaders in the evidence synthesis field, but we are still far from there yet.  Progress in this area is already being made, however, as demonstrated by the creation and efforts of the International Collaboration for the Automation of Systematic Reviews (ICASR):

Beller E, Clark J, Tsafnat G, Adams C, Diehl H, Lund H, Ouzzani M, Thayer K, Thomas J, Turner T, Xia J, Robinson K, Glasziou P; founding members of the ICASR group. Making progress with the automation of systematic reviews: principles of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst Rev. 2018 May 19;7(1):77. doi: 10.1186/s13643-018-0740-7. PMID: 29778096; PMCID: PMC5960503.

O’Connor AM, Glasziou P, Taylor M, Thomas J, Spijker R, Wolfe MS. A focus on cross-purpose tools, automated recognition of study design in multiple disciplines, and evaluation of automation tools: a summary of significant discussions at the fourth meeting of the International Collaboration for Automation of Systematic Reviews (ICASR). Syst Rev. 2020 May 4;9(1):100. doi: 10.1186/s13643-020-01351-4. PMID: 32366302; PMCID: PMC7199360.

Be sure to check out the MSK Library’s Systematic Review Service LibGuide or Ask Us for more information if you are thinking about embarking on a systematic review project.

Covidence: Better SR Data Quality & Integrity

The Covidence systematic review (SR) data management software is essentially a research electronic data capture tool, similar to REDCap. In a SR, however, the “study population” consists not of patients, but rather of literature database search results (i.e., references), while the “survey” administered to each “study subject” consists of the inclusion and exclusion criteria. 

Different than in a typical clinical study, a unique feature of the systematic review study design is that all the information captured is done so in duplicate (ideally), by two human screeners/reviewers working independently of each other. In other words, the same “survey” is administered twice to the same “study subject” and the two data captures are then compared to identify any disagreements.

This is where REDCap differs in its functionality from Covidence. Covidence not only documents the decisions of the two reviewers but it also compares them, and then automatically separates out any conflicts that need to be resolved – providing built-in quality control.

In fact, Covidence requires that reviewers address all screening discrepancies before allowing them to move on to the next stage of the review. In the full-text review stage, where explanations for exclusions must be provided, even if both reviewers vote similarly to exclude an item, Covidence will flag any exclusion reason discrepancies and force the team to resolve the conflicts before being allowed to proceed.

Data integrity features are also prominent in Covidence. For example, reviewers have the ability to reverse a decision (ie. make changes to collected data), however, if the second reviewer has already voted on that item, both reviewers will have to re-screen the record from the beginning in order to re-capture both reviewers’ judgements (i.e., this undoes all of the votes associated with the reference from that stage).

Also, in order to minimize the introduction of bias into the review process, the individual decisions made by the two reviewers are blinded to the team so that if a conflict has to be resolved by a third party, the third party will not be influenced by knowing who made which decision (as they may unconsciously side with the more senior reviewer, etc.). Even though a specific batch of records cannot be assigned to/linked to a particular reviewer, a particular task in the review process can, however, be assigned to a specific team member (for example, resolving conflicts may be set to be solely handled by the project PI).

Another feature of Covidence that leads to better data is its quality assessment and data extraction process. If two reviewers are assessing each study for bias, a comparison of assessments and consensus of judgements will be needed to complete this stage. The data extraction completed by two reviewers independently is also followed by a consensus step. If the consensus step is skipped, data will appear blank in the data export as it is only the “consensus judgements of data extraction” that can be exported to Excel. In other words, if the data is not first “cleaned” by the team, they will literally not be able to get it out of Covidence.

Although Covidence does not include any data visualization or data/statistical analysis functionality, it does allow you to export the data in a spreadsheet. “The goal of this format is to facilitate import of data extracted in Covidence into statistical programs such as Stata, R, or SPSS.”

To learn more about Covidence, register for an upcoming workshop or Ask Us