Love Data Year Round

Love Data Week (February 14-18) is an important reminder of the role that data creation and management play in research. It’s a week where we collectively recognize the value of datasets, code, analytical tools, and all the people who engage with data to provide informational support for science.

Of course, like many of you in the MSK community, the Library loves data all year round! For the past few years, we have been developing a Research Data Management program which has focused on developing services and applications to integrate with researchers’ workflows, lower administrative burdens, and encourage partnerships to support researchers throughout the life of their experiments, from planning to publication, and beyond.

Recently ResearchDataQ, a digital platform from the Association of College and Research Libraries Digital Scholarship Section, published an editorial profile written by the Associate Librarian for Data Management Services and the Library Director at MSK about the origin of our program, our new service and application initiatives, and our roadmap for future development. Read more here.

So, if you are interested in learning more about any of our initiatives, especially if you love data the same way we do, then send us an email, sign up for a class, request some one-on-one time with a data librarian, or ask us to set up a personalized data session for your lab. We hope you’ll take some time to love data, this week and throughout the year!

Citing Code via GitHub

As we were taught in school, whenever someone quotes, paraphrases, summarizes, or otherwise references another scholar’s research, they must properly attribute that research with a citation in their work. This same rule applies to code!

Citing codes is not only required as part of the publication process, its value also includes:

  • contributing to ethical and transparent science,
  • recognizing the contributions of programmers to a research project,
  • tracking reuse of code over time, and
  • reinforcing the value of non-traditional bibliographic research outputs (like code, datasets, and software).

Code can be challenging to cite because the traditional bibliographic elements are not always readily apparent. Often the only citation information in a code repo has to be garnered from a README.md file or from the original publication that references that code, if such a publication exists.

If you are maintaining your code in GitHub, you have a few options to encourage proper citation by self-identifying contributors and citation elements.

DOI for Code. In 2016, GitHub partnered with Zenodo, the CERN-operated open-source data repository, to mint Digital Object Identifiers (DOI) for archived repos. A DOI is a persistent identifier registered in an internationally recognized database which gives your code (or data) a disambiguated, permanent redirect. DOIs are a great first step in ensuring that the correct version of code is being clearly identified with proper attribution.

To take advantage of this, create a free account with Zenodo and be prepared to archive a specific version of your code. Read more information on how to generate the webhooks between your repos and Zenodo! 

Citation Support for Code. Recently (August 2021), GitHub announced enhanced support for citation adding a ruby-cff RubyGem to their code to incorporate .cff citation files. Adding a CITATION.cff file to one’s GitHub repository lets the owner identify attribution elements, and automatically generates a simple ‘Cite this repository’ button in the repo with APA and BibTex citation formatting.

Some of the elements a repo owner can include are:

  • code author names,
  • author ORCID iDs,
  • preferred software name,
  • DOI, and
  • other info related to date and version.

In particular, ORCID iDs and DOIs have value as disambiguation elements which ensure that credit is correctly identified. Read more information on how set up citation support in GitHub!  or Schema elements for .cff

If you need help understanding how to set this up or want to discuss how you can get and/or give proper citation to code, data, or software, please reach out to Anthony Dellureficio, Associate Librarian, Research Data Management.

MSK Data Catalog: We’ve Reached a Milestone!

The Library’s Research Data Management team is happy to announce that thanks to the efforts of our cataloging crew, we’ve reached a milestone of 200+ datasets in the MSK Data Catalog!

The MSK Data Catalog employs enhanced metadata to help increase discoverability of MSK research data, connect researchers working on similar topics, and describe how one can access publicly available datasets. Some of the features include:

  • Application of taxonomies, such as OncoTree and MeSH (Medicine Subject Headings),
  • Identification of analytical tools and software used to create or manipulate data,
  • Filters by subject/repository/author/etc.,
  • Persistent links to datasets and, wherever possible, DOIs (Digital Object Identifiers),
  • Connections to Synapse for data authors and associated publications,
  • Technical info, such as size and format of datasets,
  • Instructions on how to access datasets.

Many of the records we’ve recently added describe MSK datasets in the cBioPortal, Gene Expression Ombinus, dbGap, and the Protein Databank. If you’d like to know what the Library can do to help you increase the discoverability of your research data, please reach out to us!