The Workshop on Common Model Infrastructure at KDD 2018 will focus on infrastructure for model lifecycle management—to support discovery, sharing, reuse, and reproducibility of machine learning, data mining, and data analytics models.
The rapidly increasing use of machine learning, data mining, and data analytics techniques across a broad range of applications provides opportunities for sharing and reuse of models, algorithms, and code, to help increase speed to solution and reduce duplication of effort. The recently announced Google Tensorflow Hub (https://www.tensorflow.org/hub/) is one example of a related effort. Workshop topics include model lifecycle management; scenarios for model reuse and ease of reuse; model reproducibility; and other related issues. The tentative workshop agenda includes keynote talks; panel discussions; lightning talks; and discussion sessions.
Clemens Mewald, Product Lead, TensorFlow Extended (TFX), Research & Machine Intelligence group, Google Talk Title: What is the code version control equivalent for ML and data science workflows?
Abstract: ML is introducing a new paradigm to software development workflows. Previously, developers primarily dealt with code that was compiled or interpreted, usually with deterministic behavior. With ML, products now rely on behaviors and patterns, often expressed as predictions, that are a function of code and evolving data + models, leading to dynamic behavior. These challenges have to be met with changes in how code, data, and all derived artifacts (including models) are indexed, updated, and shared. This talk will give an overview of specific challenges faced by researchers and software engineers, and how Google AI is addressing them through infrastructure projects such as TensorFlow Extended and TensorFlow Hub.
Bio: Clemens Mewald is a Product Manager in Google’s Research & Machine Intelligence group. He is the product lead for TensorFlow Extended (TFX), an end-to-end ML platform based on TensorFlow, and several other TensorFlow products. Clemens holds an MSc in Computer Science from UAS Wiener Neustadt (Austria) and an MBA from MIT Sloan.
Robert Grossman, University of Chicago
Talk Title: Why it is Important to Understand the Differences Between Deploying Analytic Models and Developing Analytic Models?
Abstract: There are two cultures in data science and analytics – those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them. We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Bio: Robert L. Grossman is the Frederick H. Rawson Professor of Medicine, a Professor of Computer Science, and the Jim and Karen Frank Director of the Center for Translational Data Science (CTDS) at the University of Chicago. Since 2011, he has been the Chief Research Informatics Officer (CRIO) of the Biological Sciences Division, and, since 2016, he has been the Co-Chief of the Section of Computational Biomedicine and Biomedical Data Science in the Department of Medicine at the University of Chicago. He has been a Partner of Analytic Strategy Partners LLC since 2017 and was the founder and Managing Partner of Open Data Group from 2002-2016. Today, Open Data Group provides products and associated services so that companies can deploy analytic models. He is also the Chair of the not-for-profit Open Commons Consortium that develops and operates clouds to support research in science, medicine, health care, and the environment. He has also been active in the development of the Predictive Model Markup Language (PMML) and Portable Format for Analytics (PFA) standards in analytics.
Marc Millstone, Allen Institute for Artifical Intelligence
Talk Title: Beaker: A collaborative platform for rapid and reproducible research
Abstract: A researcher’s core focus should be discovering and validating new ideas that advance their field, yet computational researchers spend a significant amount of time on ancillary details such as organizing results and scaling experiments. Researchers address these problems by stringing together tools which narrowly focus on specific engineering problems in each stage of the research process. Beaker is a robust experimentation platform for computational researchers that streamlines the reproducible training, analysis and dissemination of machine learning results. Beaker is designed collaboratively with AI2 researchers to achieve unprecedented ease-of-use for AI research workflows.
In this talk, we will present the goals and design of Beaker, including systems like Docker and Kubernetes that help reduce the cognitive overhead of infrastructure. Finally, we will touch on the philosophy and reality of “The Reproducibility Crisis” in AI and how containers are one tool that can help push computational science forward.TBA
Bio: Marc Millstone leads the Aristo Engineering team at the Allen Institute for Artificial Intelligence, AI2. Before joining AI2, Marc spent two years in the Mathematical Sciences group at IBM Watson Research, as both a post-doc and Research Staff Member. He obtained his Ph.D. from the Courant Institute of Mathematical Sciences at NYU, where he worked jointly with Lawrence Berkeley National lab, specializing in methods for solving large-scale practical optimization problems. After some time in research, he decided to jump into the crazy world of startups as an early engineer at Socrata and eventually formed and led the Analytics and Machine Learning (AniML) team.
Context: The Missing Piece in the Machine Learning Lifecycle, Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E. Gonzalez, Joseph M. Hellerstein, UC Berkley
Fighting Redundancy and Model Decay with Embeddings, Dan Shiebler, Luca Belli, Jay Baxter, Hanchen Xiong, Abhishek Tayal, Twitter Cortex
An Open Platform for Model Discoverability, Vani Mandava, Amit Arora, Yasin Hajizadeh, Microsoft
Recommender System for Machine Learning Pipelines, Raymond E. Wright, Jorge Silva, Ilknur Kaynar-Kabul, SAS Institute
Building a Reproducible Machine Learning Pipeline, Peter Sugimura, Florian Hartl, Tala
Knowledge Aggregation via Epsilon Model Spaces, Neel Guha, Stanford University
Welcome & Introduction, Chaitan Baru, Univ of California San Diego; Vandana Janeja, Univ Maryland Baltimore County
Keynote talk: What is the code version control equivalent for ML and data science workflows?, Clemens Mewald, TFX, Google
Keynote talk: Why it is Important to Understand the Differences Between Deploying Analytic Models and Developing Analytic Models?, Bob Grossman, University of Chicago
Short talk: Recommender systems for machine learning pipelines, Raymond Wright, Jorge Silva, Ilknur Kaynar-Kabul, SAS Institute Inc.
Short talk: Building a reproducible machine learning pipeline, Peter Sugimura and Florian Hartl, Tala
Short talk: Knowledge Aggregation via Epsilon Model Spaces, Neel Guha, Stanford University
Invited talk: Beaker: A collaborative platform for rapid and reproducible research, Marc Millstone, Allen Institute for AI
Long talk: Context: The Missing Piece in the Machine Learning Lifecycle, Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E. Gonzalez, Joseph M. Hellerstein, UC Berkeley
Long talk: Fighting Redundancy and Model Decay with Embeddings, Dan Shiebler, Luca Belli, Jay Baxter, Hanchen Xiong, Abhishek Tayal, Twitter Cortex
Short talk: An Open Platform for Model Discoverability, Vani Mandava, Amit Arora, Yasin Hajizadeh, Microsoft
Next steps, conclusion
May 25, 2018, 11:59 PM PST
Notification to Authors
June 8, 2018, 11:59 PM PST
June 15, 2018, 11:59 PM PST
June 22, 2018, 11:59 PM PST
August 20, 2018
Call for Papers
Motivation for the Workshop
The continuing, rapid accumulation of large amounts of data, gives rise to the question of how to manage the increasingly complex modeling process, and the large numbers of data-driven models that are generated. Current modeling practices are rather ad hoc, often depending upon the experience and expertise of individual data scientists and types of pre-processing used, which may be specific to domains. There is a need for cataloging, sharing and discovering the models. Different application domains/disciplines may use similar models and modeling tools, yet, sharing is limited; modeling results often have poor reproducibility; information on when/how a model works, and when it may fail, is oftentimes not clearly recorded; and, model provenance and the original intent behind the knowledge discovery process should be well-recorded. And, many predictive analytics algorithms are not transparent to end-users.
The need for cataloging analytics procedures and for model management has emerged as a key issue for the KDD community—this workshop will provide a forum for researchers to discuss emerging challenges and solutions in this area. We believe there is a need and opportunity for R&D and infrastructure to support discovery, sharing, and use/reuse of machine learning, data mining, statistical analysis, and analytics models. This workshop will focus on the principles, services and infrastructure needed to help data scientists of every ilk—whether scientific researchers, industry analysts, or other practitioners—share data analytics models, reproduce analysis results, support transfer learning, reuse pre-constructed models, etc.
The workshop will include invited talks, short talks and lightning talks, posters, and a panel discussion, with time provided for open discussions.
Submission Guidelines for research papers/posters
There will be two tracks for paper submissions:
Archival papers: submissions may be up to 4 pages long.
Non-archival papers: submissions should be 1-page extended abstracts.
All submissions will be peer reviewed. All accepted papers from both tracks will receive poster slots.
Our plan is to provide short talk slots for accepted archival papers, and lightning talk slots for accepted non-archival papers. Details, including length of presentations, will be determined based on number of accepted papers and available time.
We invite submission of original research ideas, vision papers and descriptions of work-in-progress or case studies, which are not under review elsewhere. The submitted papers must be written in English and formatted according to the ACM Proceedings Template (Tighter Alternate style). The papers should be in PDF format and adhere to the 4-page limit for archival papers and 1-page limit for non-archival papers.
Important: If accepted, at least one of the authors must attend the workshop to present the work.
Topics of Interest
Topics of interest include (but are not limited to):
Model lifecycle management
Cataloging, searching, recommending, and discovering models
Model sharing and reuse; transfer learning
Privacy and security of sharing models
Implications for bias and misuse when sharing models
Model metadata for recording when/how a model works, and when it may fail
Model storage, versioning, exchange, and provenance management
Transparency of predictive analytics algorithms / models
Publishing and reusing data transformation and feature extraction pipelines for models
Integration with existing modeling tools and data analytics infrastructure
Reports on experiences with model management infrastructure, model exchange formats, etc., from practice in industry and elsewhere.