IBM’s Center for Open-Source Data and AI Technologies (CODAIT) recently unveiled a pair of carefully curated databases designed to provide machine learning developers models and datasets for AI projects.
MAX, or Model Assets Exchange, is an online open-source repository for trainable/deployable AI models. You don’t necessarily have to be an AI expert to use the database – there’s even a tutorial that’ll walk you through developing an AI that can write captions – but some of the models available will probably only appeal to enterprise developers.
CODAIT also launched the Data Assets Exchange (DAX). Where MAX hosts full AI models, DAX contains datasets that can be used to train your own. As open-source training datasets for AI aren’t exactly rare, but well-curated ones are, TNW reached out to Fred Reiss, the Chief Architect at CODAIT, to find out what was so special about DAX.
What does IBM mean when it says the datasets will be “carefully curated?” Are they checked for bias or accuracy?
Most of the places you might go to find a list of datasets online take a very hands-off approach to vetting. Someone creates a list or a database, and random people from the Internet submit links to data. It’s up to you, the dataset consumer, to figure out whether a given dataset is useful. You need to answer a number of questions: What is the scientific merit of a given dataset? Who owns the data? Did the person who posted it have a right to post it? Do I have the right to download it? Can I safely use the data in a business application?
We experienced frustration with this lack of vetting firsthand while training models for the Model Asset eXchange — DAX’s sister site on developer.ibm.com with state-of-the-art deep learning models. For example, we had to expend a great deal of effort to obtain a usable data set to train our Named Entity Tagger model.
Here at IBM’s CODAIT lab, we spend a good part of our time contributing to the open source software that underlies today’s AI systems — projects like Kubeflow, TensorFlow, PyTorch, Apache Spark, and Jupyter notebooks. One of the main functions of our organization is to help ensure that the code governance and quality of these open-source AI software components is up to IBM’s standards. We wanted to bring the same level of quality to the open source data that you run through this open source software. So we’re following a much more controlled approach with DAX, compared with other repositories of data sets you might find online.
Every dataset in DAX is shepherded by a member of our team and reviewed by multiple other people within IBM. We start by collecting detailed information about the origins of the dataset and what kinds of problems the dataset would be a good fit for. When possible, we reach out to the original creator of the data. We collect detailed metadata about where the data comes from. We familiarize ourselves with the research papers behind the datasets. We even look at the actual data items themselves to check for potential legal and data quality issues. Every dataset goes through IBM’s own internal legal review process. Only then does a dataset go “live” on the site.
And we don’t stop with just posting this vetted data. There are additional steps that we plan to take after datasets go up on DAX to create additional parallel content. You should start seeing the results of these efforts soon. We’re creating Jupyter notebooks that show how to read and analyze the contents of each dataset, either on your own laptop or on the IBM Cloud. And we’re writing ready-made training scripts for training deep learning models on the data. Users will be able to to try these scripts for free on IBM Watson Machine Learning, taking advantage of our GPU-accelerated cloud.
What other kinds of datasets are planned for DAX?
For every dataset currently on the site, there are roughly three more currently in plan. For the near term, we are continuing our focus on IBM Research data. Some datasets are currently waiting for peer-reviewed articles to be published before we can post them. The data types of the new data in queue are mostly natural language text, but there’s also some image and audio data coming up.
As it happens, one of our team members is in the middle of creating a cool demo notebook for the double pendulum dataset. It should be out soon.
Some of the data sets on DAX are for advancing core science, while others have more immediate business applications. The double pendulum dataset is more in the former category, and it has a number of interesting scientific uses. The proposed challenge from the researchers who produced the dataset is a time series prediction task: create a model that predicts the state of the chaotic pendulum system. Predicting chaotic systems is a useful task for validating new kinds of models for numeric time series prediction and natural language analysis (natural language text being a sequence of words).
You could also use the video as a sanity check for deep pose estimation algorithms. The physical configuration of the pendulum is designed such that the parts of the pendulum can be localized with subpixel accuracy without using machine learning. A generic machine learning algorithm that doesn’t have that domain knowledge should still be able to approach the same level of precision.
Will developers be able to upload datasets to DAX?
We certainly plan to add that capability in the future. The key challenge there is to maintain the current level of curation and to make the entire process open. There’s a lot of depth within the company that we can draw on to expand our collection of high-quality data in the near term.
Our current focus is on enabling consumption by developers worldwide. Having this collection of vetted datasets opens up some exciting possibilities for other related parts of developer.ibm.com. Now we can add new Code Patterns that show how to use these data sets to cover end-to-end use cases. For example, the Financial Proposition Bank data set has some really cool applications for analyzing public companies’ quarterly reports. Also, we can use DAX datasets as a starting point for developers to train customized versions of our Model Asset Exchange models by mixing the DAX data with a little bit of their own local data.