One of the largest barriers to technology innovation in the enterprise is data silos. Some of the most common reasons for these silos are:

  • Multiple Geographies w/ Different Data Protection Standards

  • Vendor IT Systems

  • Customer-bases with non-trusting parties

  • Highly sensitive data sets

These reasons are not mutually exclusive. In fact, some businesses we work with have even articulated "all-of-the-above". Have you ever found an AI initiative shut down for one of the following reasons?

  • The data owner inside your organization is unwilling to provision access

  • A POC worked but can't go to production due to insufficient data availability

  • The required databases don't talk with one another

  • "We decided to just implement RPA"

Despite ripe ROI opportunities for adopting AI in the enterprise, it still hasn't gained steam in real production settings. A common story amongst the brave who have dared:

"Despite project approval, we didn't get anything done in 6 months because we couldn't get access to the data"

The companies that have demonstrated successful production adoption are (by and large) the market leaders in their industries, and the rest of their markets are priced out of the solutions or cannot source the talent required to build internally.

A Glimmer of Hope

The research community has rallied around a set of technologies to overcome these challenges. Broadly speaking, they are private machine learning and transfer learning. Together they offer the keys to accelerate innovation and mitigate its risks as well. In this post, we'll simply touch on how they can accelerate innovation. This field is rich with many exciting technologies, but in this post we only cover the technologies that we use in the DataFleets platform.

Private Machine Learning

This is an umbrella term used to refer to technologies that allow data to remain private while still enabling machine learning models to be developed. Some of the core subcategories include federated learning, encrypted learning and inference, and differential privacy.

Federated Learning (FL)

This technology was born at Google in 2015. Originally referred to as "federated optimization", the technology was created to enable a single machine learning model to be optimized across many devices without aggregating device data. It is the fastest growing research area in private machine learning (see chart below). At DataFleets, we use federated learning to optimize machine learning models across data silos without aggregating the data.

*Google scholar was used to perform this analysis. For each month in the time range considered (2015-2019), papers that had the phrase "Federated Learning" or "Federated Optimization" in the paper were counted.

Some open-source projects for federated learning that we love:

Further reading:

Encrypted Learning & Inference

Cryptography is one of the oldest disciplines in computer science, but its use in tandem with machine learning and optimization has a younger corpus of research. Encrypted learning and inference allows models to be trained on encrypted data, predictions to be made on encrypted data, and/or to enable encryption of the models themselves to keep them safe from untrusted parties. Several encryption schemes have been researched for machine learning, but the most published on are secure multi-party computation and homomorphic encryption. Both have seen steady growth in attention from the research community over the past 3 decades.

*Google scholar was used to perform this analysis. For each year in the time range considered (1990-2018), papers that had the phrase "Homomorphic Encryption", "Secure Multiparty Computation", and/or "Secure MPC" in the paper were counted.

Some open-source projects focused on encrypted learning and/or inference we love:

Further reading:

Differential Privacy

Machine learning methods have proven to be extremely powerful- to the point that they can memorize the data on which they're trained. This means that the models themselves can compromise data privacy. Although federated learning and encryption enable models to be trained on unseen data, for highly sensitive data it is necessary to ensure the model does not learn data secrets. This is where differential privacy comes in- in a nutshell it injects noise into the analysis being performed in a manner that allows population-level patterns and information to be learned without learning individual-specific information.

*Google scholar was used to perform this analysis. For each year in the time range considered (2003-2018), papers that had the phrase "Differential Privacy" in the paper were counted.

Further reading:

Weak Supervision

This technology is concerned with bringing disparate knowledge sources together and using them to train machine learning models that encapsulate those knowledge sources. It offers an exciting set of tools that allow information to be "packaged" and reused across the organization. The below image from Stanford's AI blog summarizes the interaction of these technologies.

In this paradigm, data scientists no longer spend their time engineering complex features, but instead developers can focus on integrating cross-modal organizational resources to create vast amounts of labeled training data. The advantage of this approach is that the onus of learning feature representations is on the powerful deep learning models, and developers can spend more time bringing their organization's resources to bear.

*Google scholar was used to perform this analysis. For each year in the time range considered (2003-2018), papers that had the phrases "Weak Supervision" AND "Machine Learning" in the paper were counted.

Some open-source projects in the space that we love:

Further reading:

Note: As we see it, the jury is out on what the "umbrella term" for this area of machine learning will be called, and it also brings with it paradigm shifts in the way ML problems are solved. Andrej Karpathy has referred broadly to Software 2.0 as a paradigm shift in the development of software in which the developer's focus is curating and managing data with which machine learning algorithms can use optimization to learn the desired functionality. We take inspiration from this concept in our understanding and usage of transfer learning.

We are thrilled at the opportunities presented by these technologies. That is why we are creating the DataFleets platform: the world's first software that integrates private ML and transfer learning into a single, easy-to-use system.

If your systems could talk and you could bring all your organizations resources to bear in a single machine learning system: what would you build? We'd love to hear! Get in touch :)