Last month, the MSK Data Catalog hit 1,000 records, marking a milestone in the advancement of data discoverability at MSK. As the Research Data Management team here at the MSK Library looks back on the work we’ve done with the catalog thus far, we’d like to share some of the insights we’ve gained. We also encourage all members of the research community at MSK to reach out to us if you’d like to learn more about how we can collaborate to help promote your research data.
The MSK Data Catalog is a searchable and browsable online collection of metadata-only records describing the contents of datasets created or utilized by MSK researchers. Catalog records include enriched metadata and provide access instructions for those wishing to explore the data for their own research.

How do we identify data for inclusion in our catalog?
The way catalogers find data has shifted over time as we exhaust various “pools” of data findable by different methods. Our early cataloging methods used datasets cited in publications by MSK authors, which necessitated searching for and scouring data availability statements in PubMed and PMC. We soon realized that this was excluding data not explicitly cited and linked to in publications. The lack of consistent standards applied by publishers to Data Availability Statements also made this an onerous task. As a result, we shifted tactics to searching public repositories commonly used by MSK researchers for data storage. This new strategy allowed us to focus on the consistent metadata structure, develop discovery strategies for bulk retrieval, and streamline workflows within each repository. It also presented new problems, as inconsistent metadata made it difficult to determine how much MSK-affiliated data there was in many of the repositories. “Affiliation” itself is a problematic field infrequently identified, with little standardization, often excluded from indexing, and rife with misspellings. Ultimately, we have found that we get the best results by combining these two searching methods, taking advantage of both the more consistent institutional affiliations associated with publications and the more precise descriptions available in data repositories. We are still seeking methods for refining this process, including partnering with departments, individuals, and core facilities at MSK to identify valuable research data earlier in its lifecycle.
How do we enrich data?
The data catalog adds additional layers of enriched metadata to each dataset, providing easily searchable catalog records for MSK-affiliated data. By connecting data to their publications, we make it easier for authors to find data supporting the articles useful to their research; by cataloging data not cited in publications, we make the data easier to find, reuse, and cite. To this end, we have been leveraging the bibliographic resources in the MSK Library’s institutional authors and publications database, Synapse. We have implemented bi-directional linking between the systems to highlight the many-to-many relationship between data and publication. We further enrich the metadata in the MSK Data Catalog with information already included in the data repository record, as well as librarian-supplied fields from standardized sources and controlled vocabularies such as Medical Subject Headings (MeSH terms). Making connections between data, creators, publications, and institutions in a searchable database improves overall findability for the datasets in accordance with FAIR (link) values. As the catalog continues to grow, articulating these interwoven, many-to-many relationships will add increasing value to this “web” of related resources.
What are the challenges—and our responding solutions—of cataloging data?
A lack of widely used metadata standards, highly variable levels of searchability in data repositories, limited normalization of fields between repositories, and the sheer volume of data produced by biomedical research all contribute to the difficulty of data cataloging. There are also inherent challenges when selecting research data to include due to versioning and pinpointing the appropriate time in its lifecycle when data should be captured. Some of these challenges can be mitigated by outreach to researchers through education on best practices for data management; others are being partially addressed via the new NIH Data Management and Sharing Policy and the uptick in Data Availability Statement adoption by publishers. We are still in the process of finding solutions for problems of scale. Implementing automation where possible and adding members to our cataloging team would both help us address the volume of MSK-affiliated data to be cataloged, but our current resources and technological support are limited, and research data production continues to grow rapidly. Continuing to cast a wide net and implement multiple methods in our searches for data will help mitigate the difficulty of findability, as will continuing to encourage researchers to deposit data in repositories that have useful search functions.
What comes next?
Confronted with these issues, we face a daunting task in tracking the use and reuse of data. As we review our trajectory thus far, however, we feel newly invigorated to continue making improvements to our cataloging process. We have recently begun an evaluation of our current catalog system and potential alternatives to ensure that we are working with a scalable, well-supported system. In the next phase of the data catalog, we hope to include elements of automation and/or streamlining to cut down the time it takes to produce a record. This includes pursuing new integrations with external systems such as MeSH, as a supplement to our current integrations. We also hope that the next phase of the data catalog will give us better analytics allowing us insights into how users find and interact with the data records. Finally, we will continue and expand our outreach efforts to researchers and repositories. Educating data producers and stewards about the best practices for describing, storing, and sharing data will help create information infrastructure for future cataloging. Ultimately, our goal with the data catalog is to help elevate data as a product of the research process, with all the dedicated resources and information systems that may entail.
Celebration and call to action!
The data catalog is intended to be used by both MSK-affiliated and external researchers searching for previously published datasets, or for a way to enhance the findability of their own data. We encourage researchers to suggest their own datasets for cataloging. Our team will provide enhanced metadata as well as link the data records to their relevant publications in Synapse. These tools are a foundation for future measures of promotion and usage analytics for the data catalog. Join us in celebrating this milestone of 1,000 records and find out how you can participate in elevating the value of MSK research data!
- MSK Library Research Data Management Team:
- Anthony Dellureficio, Associate Librarian for Data Management Services
- Klara Pokrzywa, Data Management Librarian