Toward a Metadata Framework for Sharing Sensitive and Closed Data: An Analysis of Data Sharing Agreement Attributes

Posted on Mon 30 Nov 2017

The Presentation

Today I presented my research at the 11th International Conference on Metadata and Semantics Research, in Tallin, Estonia.

MTSR 2017

Below you may view my presentation slidedeck, followed by my presentation notes.

View Presentation Notes

Exploring Tallinn

I was lucky to have the opportunity to explore the beauty of Tallinn on my first day in Estonia. This was my first international flight, and my first trip to Europe. Some highlights from my tourism was the Alexander Nevsky Cathedral, the "Wool Wall," Raekoja Plats, and the National Library of Estonia.

Alexander Nevsky Cathedral

Presenting at the 11th International Conference on Metadata and Semantics Research (MTSR)

Posted on Mon 27 Nov 2017


I will be presenting preliminary research at the 11th International Conference on Metadata and Semantics Research (MTSR), from November 28th – December 1st 2017, in Tallinn, Estonia.

I will be presenting during the track on "Open Repositories, Research Information Systems, and Data Infrastructures,” with my presentation entitled "Toward a Metadata Framework for Sharing Sensitive and Closed Data: an analysis of data sharing agreement attributes.” This correlates with the title of the paper that Dr. Jane Greenberg and I wrote together, which will be published through the conference proceedings in Springer journals. The research is related to our involvement with NSF’s Northeast Big Data Innovation Hub Data Sharing project, called “A Licensing Model and Ecosystem for Data Sharing.”


I will be giving an overview of the project, which seeks to facilitate the process of data sharing among industry, academia, and government. The project has 3 interconnected components: a licensing framework/generator, a data sharing platform, called DataHub, to enforce aspects of the licenses, and robust metadata for the datasets as well as the developed licenses, to communicate data handling and rights specifications in a machine-readable way. In order to automate and enforce aspects of the data sharing agreement process, we needed to first understand organizational data sharing needs, so we collected a sample of data sharing agreements and performed a content analysis to identify general categories of data sharing needs, along with the more specific attributes of the agreements.


The six high level categories we identified were General, privacy and protection, access, responsibility, compliance, and data handling. For example, the category of “data handling” discusses the specifics of permissible interactions with the data, such as publication of the data, and the category of responsibility includes the legal, financial, ownership, and rights information pertaining to the data, such as establishment of data ownership, or indemnity clauses.

I will also be presenting on my current metadata-related progress, which seeks to take these agreement attributes and map them to existing metadata schemas to develop a metadata framework that could communicate these specific data sharing needs. Having a robust metadata framework can ensure that essential provenance, rights, and data handling information is conveyed throughout the entire lifecycle of the dataset, even after the data is shared downstream.

Research Data Alliance 10th Plenary

Posted on Tue 19 Sept 2017

Presenting Preliminary RDA Data Share Fellow Research

Tuesday September 20th officially kicked off the 10th Research Data Alliance (RDA) Plenary, which is taking place in at the Centre Mont-Royal in Montreal, Canada, from September 19th-21st.


My poster, titled “Advancing rights management metadata best practices across open and closed data sharing communities,” demonstrates preliminary efforts to develop a set of Rights Management Metadata best practices for Institutional Review Boards (IRB), based on researcher data sharing needs. This poster provides a literature review of existing research surrounding IRBs and Data Sharing; steps taken to identify researcher data sharing needs, through the examination of data sharing agreements; and preliminary efforts in identifying existing metadata standards that support rights management.

Of particular interest to my research were the RDA/NISO Interest Group on Privacy Implications of Research Data Sets, as well as the RDA/CODATA Legal Interoperability Interest Group. The Privacy Interest Group was particularly relevant and interesting to me, since they seek to develop a framework to assist stakeholders in understanding precautions related to research data. Part of this effort entails gathering documents related to data exchange and privacy to identify commonalities. My work in mining data sharing licenses for common attributes fits this need precisely.

The Legal Interoperability Group looked at several case studies for problematic data licensing scenarios--scenarios in which Creative Commons license options are not sufficient to address data sharing needs. The subsequent breakout session identified the need for an ontology to classify the licensing information—-formal knowledge representation that would allow for the creation of a more robust rubric to assess license computationally and create better both user and creator decision trees. Another idea for the interest group included a scorecard to compare against principles for legal interoperability.

Day three of the conference brought a lively discussion from the Ethics and Social Aspects of Data Interest Group. Collaborative groups discussed specific concerns among the "Dimensions of Ethics," which inculdes issues related to data subjects, community and society, data owners/stewards, and data consumers. I'm particularly interested in remaining involved with this interest group for the long term, in relation to my potential research about the ethics of sharing metadata. Group members discussed more common concerns, such as informed consent, and sufficient anonymization/de-identification processes, but also less frequently discussed concerns, such as the ambiguity of using freely available social media data that can be linked back to the individual through search engines, power balances making research participants feel that they have to consent, and issues related to parents giving consent for their child's particpation, and what happens to that data when that child turns 18.

Paper Accepted for the 11th International Conference on Metadata and Semantics Research

Posted on Wed 9 Aug 2017

Today I received notification that my paper with co-author Dr. Jane Greenberg, Toward a Metadata Framework for Sharing Sensitive and Closed Data: an analysis of data sharing agreement attributes, was accepted for the 2017 Metadata and Semantics Research Conference. The paper presents preliminary research findings. The conference will take place at Tallinn University, in Tallinn, Estonia, from November 28th to December 1st. The paper will subsequently be published in the forthcoming conference proceedings, through Springer Journals.


Posted on Tues 24 May 2017

Metadata Mixer: Carolee Mitchell, from Data.World

During today's Metadata Mixer, at Drexel's College of Computing and Informatics, we heard from Carolee Mitchell, the Manager of User Operations representing the Austin-based startup company, Data.World.


Carolee elaborated about the purpose of the data sharing platform, and the importance of linking not only data, but the people who are creating the data.

We currently have both data and people in separate silos, with each data user having to perform their own data "janitorial work" on the same data set. It's an extraordinary waste of a researcher's time. Data.World addresses this by serving as a "self-service data prep" tool, and incorporates visualization tools to help create meaningful representations.

One key issue that crossed my mind, which was addressed by Dr. Jane Greenberg, is that in order for data to truly be linked (behind the confines of Data.World's own ecosystem), they need to incorporate ORCID identifiers and other standard vocabularies, such as Name Authority Records.

More information about Data.World can be found on the site's overview or their brief introductory video:

Research Data Alliance Fellowship

Posted on Tues 9 May 2017

RDA Fellowship Overview

In April, 2017, I was awarded a 12-month Research Data Alliance (RDA) Data Share Fellowship. My goal is to help facilitate safe and trustworthy data sharing between seemingly disparate open and closed data communities, creating a means by which researchers can share their datasets without worrying that sensitive data will be mismanaged or misused in the hands of a third-party participant.

In order to ensure compliance throughout the entire life-cycle of a dataset, even in the hands of a third-party, datasets must convey comprehensive rights metadata that communicates how the dataset can be used, re-used, shared, and accessed. The work will contribute to this goal by establishing institutional rights metadata best practices, guided by the needs established in individual data sharing agreements.

My goals for this fellowship are:

  1. Evaluate a sample of current Institutional Review Board protocols for data rights management and metadata
  2. Conduct a crosswalk analysis of existing rights metadata standards
  3. create recommendations for IRB protocol best practices for the creation of rights management metadata.

RDA Fellowship Timeline

Fellow Orientation

The RDA Fellow Orientation took place at Rensselaer Polytechnic Institute (RPI) in Troy, NY on May 16th and 17th. RDA leaders spoke to the 2017 Data Share cohort about what to expect as fellows in the program, what our proposed projects plan to accomplish, and what our first steps should be. We also discussed potential collaboration opportunities with other cohort members as well as interest/working group chairs.

RDA Panel

Research Data Alliance Career Panel.

Day two of the orientation revolved around a four-person panel who spoke about data as a career. One important takeaway from the discussion was the notion that when you work in the data field, you have to be comfortable (and even embrace) that you won't be the smartest person in the room--but are nevertheless necessary facilitators. In Beth Plale's words, "the algorithmic model plus human intuition = an open space for solutions."

Texas A & M NSF Workshop: Using Smart Grids Big Data

Posted on Tues 9 May 2017

Workshop Overview

On April 18th, 2017, Dr. Mladen Kezunovic, of Texas A & M University, hosted an NSF Workshop on using Big Data in Smart Grids.

Texas A & M Smart Grid Workshop

The goals of the workshop were to discuss and innovate ways that big data (defined by Volume, Velocity, Veracity, Variety, & Value) can be used to confront new challenges faced by the smart grid community. Panel topics included Big Data Availability & Management; International Experiences: Synchrophasors BD; Data Analytics & tools; and Future Efforts.

Poster Presentation

Sam Grabus, Jane Greenberg

Pictured from left to right: Florence Hudson, Internet 2; Sam Grabus, Phd student(me); Dr. Jane Greenberg, Drexel University, Metadata Research Center.

Dr. Jane Greenberg and PhD student Sam Grabus travelled down to Texas A & M University for the Smart Grids workshop, where Sam shared her poster, “ShareDB: A Licensing Model and Ecosystem for Data Sharing.”

Many speakers throughout the day discussed the difficulties they face with data sharing, whether the barriers are proprietary rights, size, the need for real-time data, or the sensitive nature of the Critical Energy/Electric Infrastructure Information (CEII) being shared. Mark Rice, of the Pacific Northwest National Laboratory (PNNL) commented that "Nationally, we just don't share data."

2017 Joint PI Meeting: NSF BIGDATA and Big Data Hubs & Spokes

Posted on Fri 18 March 2017

Meeting Overview

NSF 2017 Joint PI Big Data Meeting

The NSF 2017 Joint PI Big Data Meeting ran from March 15th-17th in Washington D.C., hosted at the historical Omni Shoreham hotel.

The meeting was an opportunity to bring together PIs and students from all currently-funded NSF Big Data initiatives. The lightning talks and panels outlined progress on projects across all regional hubs and spokes, and identified current challenges.

Here is a full list of speakers and their presentation slides for all 3 days.

Data Sharing Challenges

Many project PIs across the various regional hubs and spokes stressed the difficulties that they are currently experiencing with cross-organizational data sharing, particularly in terms of licensing, intellectual property, and trust.

data sharing barriers

PI Sam Madden (MIT), spoke about our current progress on the data sharing spoke initiative within the Northeast Big Data Innovation Hub, addressing many of the barriers that researchers face when trying to share their data.

The poster session on the 16th was a great success, with several attendees engaging in discussion about the data sharing spoke initiative. We made connections for potential collaboration across the regional hubs, and the data sharing license agreement examples are starting to filter in.

Annual Northeast Big Data Innovation Hub Workshop

Posted on Fri 26 February 2017

Workshop Overview

The annual Northeast Big Data Innovation Hub Workshop was held at Columbia University on February 24th, 2017. Academic and Industry professionals from the 6 spokes (health, energy, cities & regions, finance, big data, discovery), spoke about progress on current cross-sector hub initiatives and cross-hub collaboration.

Speaker Highlights

The workshop lightning talks addressed current initiatives, challenges, and upcoming events:

Chirag Patel, from Harvard Medical School, addressed finding a way to link disease and environmental data via exposome data warehouse and OHDSI.

Carsten Binnig, from Brown University, spoke about creating a data sharing platform with built-in licensing agreements to facilitate easier/safer data sharing between industry and academia, building on top of MIT’s pre-existing data sharing technology. This work is part of the larger project I'm on with the Metadata Research Center, Drexel University, and the Northeast Big Data Innovation Hub Data Sharing spoke. Our last workshop was held at Drexel University on September 29-30th, 2016. The workshop slides and final report are available on the Metadata Research Center website. The next data sharing workshop will be in Fall 2017.

Beverly Woolf, from the University of Massachusetts Amherst, discussed creating personalized education based on predictive models, helping to create more effective training approaches via adaptive technologies.

Rebecca Wright, from Rutgers University,focused on area-specific privacy and security concerns related to data and integrating solutions into technology, and mentioned two forthcoming related workshops (April 24-25 and Fall 2017).

Next, we heard from Penn State's John Yen, who addressed the current lack of existing resources for sharing near real-time cyber threat information within the trusted community. The next related workshop will be on Nov 11th, featuring stakeholders from Penn state, Rutgers, Dartmouth, and Columbia.

Stephen Uzzo, from the New York Hall of Science, addressed approaches toward data literacy: we are collecting more data than we have the capability to analyze—there is currently not enough academic training to help close the big data divide.

In an announcement by Microsoft, Vani Mandava spoke about cloud-based solutions can help to connect partnering hospitals for patient risk-admission predictive analysis.

Breakout sessions highlighted common themes of algorithmic bias, data sharing, privacy/security, and scale of focus e.g., hyperlocal.