Five steps to preserve data privacy and reduce risk in AI

By Camille Morhardt, Director of Security Initiatives, Intel

January 27, 2023

undefined mins

Share this article

Prioritise Us on Google

Share this article

Prioritise Us on Google

Taking steps to preserve the privacy and confidentiality of data used by AI and ML models is crucial to maintain customer trust and preserve reputation

From driver assistance systems to early healthcare diagnostics, humans are benefitting from insights generated by AI and ML models. To generate and iterate these ML models, algorithms often comb through copious amounts of raw data, which may include personally identifiable information (PII) or intellectual property (IP). Taking steps to preserve the privacy and confidentiality of that data within AI/ML is crucial to maintain customer trust and preserve reputation.

There are several approaches for doing this, among them are well-established encryption algorithms for data at rest and in transit, as well as data abstraction or anonymisation. There are also relatively new methods in AI that learn from data without ever moving it. More recently, methods have emerged to better protect data while it is in use, broadly referred to as ‘confidential computing’.

There are two primary models to enable confidential computing. One is homomorphic encryption, which is moving from research labs to select production deployments, with the expectation of significant advances in silicon support for acceleration. The other approach is using a trusted execution environment (TEE), which is a big focus currently in the industry to help protect data in-use. These provide enhanced protection to data and workloads, irrespective of how they are packaged – virtual machines, containers or even bare-metal native applications. Each of these methods operates differently, and their benefits and implementations need to be taken into consideration along with data sensitivity and workload optimization.

To determine the level and location of risk exposure to your AI and ML, you can walk through the entire chain of data, from point of origin through inference. When doing that, consider five key questions: What? Where? Who? Why? And How? From there, you can begin to determine how best to protect your models and data. Let’s dive into these in more detail.

What collected the data?

Data is often generated at “the edge,” that is, far away from the central learning model that will process the data, learn, and make inferences from it. The edge could for example be an imaging machine in a hospital generating an MRI of a patient, a satellite taking pictures of Earth, or a human being talking on her mobile phone as she walks around the city. In each case, the device at the edge is collecting data and doing some processing of the data on-site. To trust the data, you must first trust the device that is collecting it.

To assess device security, begin at the hardware layer, and work your way up. Is the device the device you think it is? Do you have a mechanism to authenticate that the device and the components within it that run active firmware are who they say they are? Do you have the digital assurance that the device hasn’t been tampered with between manufacture and provisioning? Do you have a way to verify that the device has been provisioned correctly, and is running the version of the firmware and operating system that you expect? Is the device up to date with hardware and software patches, and is it running security and manageability software?

Where will the data travel?

In a Federated Learning model, raw data will remain where it originated, for example at the hospital where the patient was imaged, or locally, on the phone. Model aggregators ferry insights from the data to the central model. In this case, at minimum, you should consider how to protect the aggregators to preserve the integrity of the model itself. However, it’s far more common in AI and ML for data to move to a central learning model for further processing and analysis. Data encryption while in transit is a well-established protocol. Future proofers will want to take into consideration how long their products will operate in the world, and whether that timeframe could intersect with quantum computing. There are steps you can take now that offer some protection, including longer encryption keys or sending the encryption key in an out-of-band channel from the encrypted data itself.

Who has access to the data?

Once you have taken the steps above, you will need to verify the identity of any person who has access to the device. This includes end users and administrators. Multi-factor authentication has become commonplace. And biometrics are becoming increasingly user friendly. Technology is expanding from widely used fingerprint and facial recognition to typing patterns and even heart rhythms. Even without advanced verification tools, you can easily apply these best practices: set strong passwords. Allow access only to those people who require it. Even then, apply what is known as the principle of least privilege: allow access only to that portion of data required by those who have access to complete the task at hand. There are new advances in Confidential Computing that are designed to enable even verified administrators to run public clouds without the ability to access data ever – even when it is being processed.

Why are you collecting the data?

This is both an old-school and a bleeding-edge question. Data can be a gift or a liability. The least risky approach is never to collect or store any PII or proxies for PII that could be used to infer PII. If you can abstract or differentiate or separate PII from other parts of the dataset, then do so. Other simple questions to ask are, “Do I require this information or is it a nice-to-have? What would I do differently if I did not have this information?” Ultimately, collect as little data as possible to accomplish the task, and store even less.

The latest thinking on responsible AI encourages data scientists to consider the reasons for data collection, who will have access to the data, and whether the questions asked are pertinent and fair to the people who are responding. Some very basic questions to start this process include, “Do I have permission to collect this data? Have I explained what the intended uses are and any risks to the people implicated? Am I able to delete the data upon request? Am I able to share my inferences from the data with those who provided the data? Am I structuring the data and asking questions of the data in a manner that is consistent with the desires of the people who are providing the data?”

Data privacy and risk assessment are key to ensuring a healthy and productive AI. If you understand the risks, you can better protect against them, and use AI and ML to improve overall cybersecurity and meet Zero Trust requirements. If you want to learn more about data privacy and risk or AI, check out https://partnershiponai.org/ .

How can data be protected when running in the cloud?

As noted already in this article, most machine learning models run in a cloud environment. Clouds can be on-premise, but for many reasons including flexibility, scalability, economics, and security, organizations are increasingly moving workloads—including machine learning—to public clouds with shared infrastructure. In some cases, ML models are expressly run in public clouds to facilitate multi-party collaboration.

With the advent of Confidential Computing, data that was once processed in the clear in such environments can now be processed from within a Trusted Execution Environment (TEE). Before moving sensitive data or workloads to a public cloud environment, be sure to inquire whether the cloud provider can provide a Confidential Computing environment. For an industry-wide definition, check out the Confidential Computing Consortium https://confidentialcomputing.io/ .