Google Dataset Search, a dataset-discovery tool 

With data sharing increasingly being encouraged in academic research and datasets increasingly being added to data repositories and being published on the Web, it makes sense that a Web browser company like Google would dedicate resources towards the goal of developing a Web discovery tool that is optimized for finding datasets.

How does it work?

Google Dataset Search, a dataset-discovery tool, basically uses Google’s web crawl technology to search for datasets that have been made available on the Web, identifying them based on their metadata (standardized descriptions of the datasets added to them by their owners/publishers).“ Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable.”

For an in-depth overview of how Google Dataset Search has been developed – please see:

Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. “Discovering datasets on the web scale: Challenges and recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4 (2024).

How can you search it?

To get started with using Google Dataset Search, go to: Dataset Search at https://datasetsearch.research.google.com/

If you are looking for something specific, you can refine your search results by limiting your search to a particular website domain (for example, site:nih.gov) or adding additional terms to your search. You can also filter your results by when the dataset was last updated, by format, by usage rights, topic/discipline, and whether the dataset is freely-available. Furthermore, you can save your search results, link-out to the external source website where you can download the datasets, and you can easily cite the dataset by copying the citation information that is generated when you click on the citation button (i.e. the quotation mark button).

To learn more – see:

Dataset Search Quick Start Guide –
https://newsinitiative.withgoogle.com/resources/trainings/dataset-search-quickstart-guide/

User Support Center – https://datasetsearch.research.google.com/help

Dataset Developer Page –
https://developers.google.com/search/docs/appearance/structured-data/dataset

How is it being used?

It appears that biomedical researchers have already started using Google Dataset Search in their scholarly projects. Some examples focusing on finding image datasets include:

  1. Abbad Andaloussi M, Maser R, Hertel F, Lamoline F, Husch AD. Exploring adult glioma through MRI: A review of publicly available datasets to guide efficient image analysis. Neurooncol Adv. 2025;7(1):vdae197. Epub 20250128. doi: 10.1093/noajnl/vdae197. PubMed PMID: 39877749; PMCID: PMC11773385.

  2. Rozhyna A, Somfai GM, Atzori M, DeBuc DC, Saad A, Zoellin J, Müller H. Exploring Publicly Accessible Optical Coherence Tomography Datasets: A Comprehensive Overview. Diagnostics (Basel). 2024;14(15). Epub 20240801. doi: 10.3390/diagnostics14151668. PubMed PMID: 39125544; PMCID: PMC11312046.

  3. Wen D, Khan SM, Ji Xu A, Ibrahim H, Smith L, Caballero J, Zepeda L, de Blas Perez C, Denniston AK, Liu X, Matin RN. Characteristics of publicly available skin cancer image datasets: a systematic review. Lancet Digit Health. 2022;4(1):e64-e74. Epub 20211109. doi: 10.1016/s2589-7500(21)00252-1. PubMed PMID: 34772649.

  4. Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, Keane PA, Sebire NJ, Burton MJ, Denniston AK. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51-e66. Epub 20201001. doi: 10.1016/s2589-7500(20)30240-5. PubMed PMID: 33735069.

Questions? Ask Us at the MSK Library!

Making Research Data Available on Mendeley Data When You Publish in an Elsevier Journal

Most people who are familiar with Mendeley know it as the web-based citation manager that has been around for about 15 years (owned by Elsevier since 2013) and that MSK now has an institutional subscription to. Another Elsevier product, Mendeley Data, was released in April 2016 and is “an open, free-to-use research data repository, which enables researchers to make their research data publicly available.” The tool is freely-available to researchers in all disciplines and can be used to share unpublished data privately within a research team or to upload and publish data linked to/from a published journal article.

From Elsevier Support:

“Many Elsevier journals now offer authors the ability to submit research data as part of the article submission process, and research datasets submitted in this way will be stored and independently available on Mendeley Data, linked to/from your published article. The Guide for Authors for the journal you are planning to submit to will indicate if this is available.”

For an example of what this looks like in practice – see:

Article: https://pubmed.ncbi.nlm.nih.gov/34375669/

    • Stewart JR, Lang ME, Brewer JD. Efficacy of nonexcisional treatment modalities for superficially invasive and in situ squamous cell carcinoma: A systematic review and meta-analysis. J Am Acad Dermatol. 2022 Jul;87(1):131-137. doi: 10.1016/j.jaad.2021.07.067. Epub 2021 Aug 8. PMID: 34375669.

Dataset: https://data.mendeley.com/datasets/dcvzp8y5g4/1

    • Stewart, Jacob; Lang, Margaret; Brewer, Jerry (2021), “Non-excisional treatment of SCC and SCCIS Supplemental”, Mendeley Data, V1, doi: 10.17632/dcvzp8y5g4.1

There are multiple advantages to having the option of making datasets available on Mendeley Data, including overcoming some annoying realities of using scholarly literature. First, not all journals are able to provide authors with unlimited space to share their research data – whether that is within the published article or within the Supplemental Materials section/Appendices (which may or may not be available as an option). Second – if the journal is behind a paywall and not openly-available, the supplemental materials will generally need to be obtained by the reader (who does not have access to a paid subscription) separately via inter-library loan (ILL) if the datasets are needed since article supplemental materials are not typically obtained by default via ILL, only by special request.

As such, having an open, independent place online where readers can easily access any related datasets makes it more likely that they will go to them if a question arises when they are reading the research paper. Also, Mendeley Data assigns published datasets persistent DOIs (digital object identifiers) and provides usage metrics thanks to integration with Plum Analytics. Furthermore – all published datasets in the repository can be searched and discovered independent of the published paper as each dataset has its own metadata, making it more likely to be found and potentially re-used/properly cited by other researchers.

Learn more:

Swab, M. Mendeley Data (2016). Journal of the Canadian Health Libraries Association, 37 (3), pp. 121-123. https://journals.library.ualberta.ca/jchla/index.php/jchla/article/download/28162/20988

Garcia Morgado, J.Open data – How to make the data available with Mendeley Data
(2019) XVIII Workshop REBIUN de Proyectos Digitales/VIII Jornadas OS Repositorios
September 25-27, 2019, León [Online]. Available at https://buleria.unileon.es/handle/10612/11221

Haak W, García Morgado J, Rutter J, Zigoni A, Tucker D. Mendeley Data. Research Data Sharing and Valorization: Developments, Tendencies, Models: Wiley; 2022. p. 153-73.
https://onlinelibrary.wiley.com/doi/abs/10.1002/9781394163410.ch9

Questions? Ask Us at the MSK Library!