LEADS-4-NDP Data Science Bootcamp

Posted on Tues 12 June 2018

About the Bootcamp

To kick off the LEADS-4-NDP summer fellowship, LEADS fellows participated in a 3-day intensive data science bootcamp at Drexel University from June 7th-9th, guided with lectures from Drexel University Information Science faculty.

Sam Grabus, LEADS-4-NDP

Topics ranged from data science processes; an introduction to crunching data with R, SQL, NOSQL, and NewSQL; linked data at OCLC; metadata quality and integration; data pre-processing techniques with R; data visualization and visual analytics; data mining and machine learning methods, working with Hadoop and Spark; text processing and natural language processing; as well as automated data analysis tools.

Each LEADS fellow is partnered with a remote fellowship library site, where they will be working with mentors on site-specific data science projects over the next 10 weeks.

LEADS-4-NDP Fellowship at Temple University's Digital Scholarship Center

Posted on Wed 2 May 2018

About the Fellowship

Sam Grabus, LEADS-4-NDP

I am thrilled to announce that I have been selected for a LEADS-4-NDP (LIS Education and Data Science for the National Digital Platform) summer fellowship with the Digital Scholarship Center at Temple University Libraries. The fellowship consists of an online preparatory curriculum, an intensive 3-day data science bootcamp, and a ten-week data science internship with the Digital Scholarship Center.

I was selected to collaborate with Temple University English Professor Peter Logan, who is working on taking four editions of the Encyclopedia Britannica (from 1790 to 1911), and analyzing how discrete concepts within the "official account" of the body of knowledge (from the colonizing British male empiricist perspective, no folk authors, no women, no POC, etc.) changed over time. The dataset will be about 100 million words, and we will be looking to analyze these discrete concepts through the application of Library of Congress Subject Headings and other time-appropriate ontologies.

Research Data Alliance Eleventh Plenary Meeting, Berlin, Germany

Posted on Tues 20 March 2018


As a Research Data Alliance (RDA) Data Share Fellow, I am attending my second RDA plenary conference to present research. The 11th RDA plenary meeting is taking place in Berlin, Germany, at the bcc Berlin Congress Center.

Berlin Cathedral

I arrived in Germany on the Sunday the 18th, and took advantage of the opportunity to explore the city on Monday. I started off Monday morning with Frühstück at Hilde, then visited the Berliner Dom (Berlin Cathedral).

I spent some time in the impressive Dussmann-Haus bookstore, visited the Brandenburger Tor, and did some somber exploration of the Memorial to the Murdered Jews of Europe, and Brauereigaststätte Leibhaftig for Abendessen.


On the Tuesday the 20th, I attended a co-located IEEE Big Data Governance and Metadata Management (BDGMM) workshop. The purpose of the workshop was to identify opportunities for the development of IEEE Standards for Big Data governance and metadata management.


NSF Science Advisor for Public Access Beth Plale spoke about PID Kernel Information: the idea of sticking a bit of provenance metadata in a persistent identifier to enable an internet-scale data client to navigate a list of 100,000,000 PIDs.

ISO Working group 13 project leader Ismael Caballero Muñoz-Reja introduced MAMD: Modelo Alarcos de Mejora de Datos, an ISO 8000-60 compliant framework used as a guideline to improve data access and governance.

Jane Greenberg discussed our current progress with the NSF Northeast Big Data Innovation Hub's Data Sharing spoke project, "A Licensing Model and Ecosystem for Data Sharing," a collaboration between researchers at Drexel's Metadata Research Center, MIT's CSAIL, and Brown's Computer Science department. The project seeks to develop technical solutions for facilitating the sharing of restricted data in a secure environment.

Tobias Weber, at the Leibniz Supercomputing Centre, began his presentation by sharing a clip from Star Trek Voyager, in which Captain Janeway was able to customize her holodeck creation by providing the computer with precise specifications for what she wants. Tobias demonstrated that in terms of accessing FAIR data, we are not yet at the point of being able to ask a computer to retrieve data so precisely, and need to improve compliance to standards, quality control, automatic annotation on ingest/during curation, and possibly OAI-PMH 3.0

RDA Plenary

The Research Data Alliance (RDA) 11th plenary meeting, here in the beautiful bcc Berlin Congress Center, kicked off with some great keynote speeches, including a fascinating talk about the complications and necessity of sharing massive amounts of complex neuroimaging data, as part of the Human Brain Project, presented by Prof. dr. med. Katrin Amunts.

I attended a variety of meetings over the course of the three-day RDA conference, including the RDA/NISO Privacy Implications of Research Data Sets interest group, an "Ethics in FAIR data" joint meeting of several interest groups, and a "Birds of a Feather" session about "Sensitive Data for Open Science." These sessions were all essential for understanding the current landscape of issues surrounding data sharing for restricted data types.

Poster Session

While I was definitely embarrassed (and hoping to get my €40 back!), I laughed it off, and had some great conversations about the poster and subsequent paper: a lay of the land of rights and licensing initiatives that seek to facilitate data sharing: rights management, licensing standardization, metadata, technological infrastructure, & community-driven efforts.

Love Data 2018 Metadata Mixer

Posted on Mon 16 Feb 2018

The Presentation

The Metadata Research Center hosted a "Metadata Mixer" at Drexel's College of Computing and Informatics on Tuesday, February 2013th, in honor of "Love Data Week."I presented my current progress surveying the landscape of rights management and licensing initiatives for facilitating the data sharing progress for sensitive and private data types. Initiatives were organized into 6 overlapping categories: Rights Management, Licensing Standardization, Metadata and Ontologies, Community-Driven Efforts, Technological Infrastructure and tools, and Informational Resources.

Sam Grabus

Dr. Jane Greenberg and I also discussed the NSF spoke project, A Licensing Model and Ecosystem for Data Sharing, which seeks to facilitate the data sharing process for sensitive and private data between industry, academia, and government. The project has three interconnected components: a licensing framework/generator, a data sharing platform, called ShareDB, to enforce aspects of the licenses, and robust metadata for the datasets as well as the developed licenses, to communicate data handling and rights specifications in a machine-readable way.

I provided a brief demonstration of current progress on the ShareDB platform, which is designed to facilitate the sensitive and private data sharing process through the automation of anonymization and de-identification techniques within a secure data sharing infrastructure. Questions and discussion of the platform followed the presentation.