Methods for Introducing Differential Privacy into Analytics APIs

Inspiration

The focus of my project was to prototype tools that make it easier for developers to integrate differential privacy algorithms into the applications they build. Differentially private algorithms seek to ensure that an observer of an aggregate value cannot determine if any given individual's information was used to produce that value. One method of ensuring differential privacy is by adding noise drawn from specific distributions to aggregate values.

The math behind these systems can be challenging, but I was interested in differential privacy and wanted to study the algorithms for a few days. By writing packages/extensions to manage statistical noise, my hope is that developers lean on these packages to write privacy-preserving systems. This should be possible (1) without deep knowledge of statistics and (2) minimal changes to existing services.

What it does / How we Built It / Submission Detail and Approaches

This project introduces two possible approaches for easily integrating differential privacy (DP) into APIs dealing with sensitive data. Before discussing details, we should examine a simple service architecture. An API that aggregates and serves sensitive data from a database might have an architecture like the following:

api-flow

In the diagram above:

  • (1,2) User sends a request to the URL of an API (more often a load balancer distributing traffic across multiple API instances) and waits for a response.
  • (3) API performs a query against DB and returns the aggregate value to User once the value is ready.

Where in this flow can we use a differential privacy algorithm to modify our "true" value? This repo proposes two solutions, one at the API layer, and one at the database layer.

API Side - FastAPI Differential Privacy Middleware

Our first choice is on API (circle A), API could receive a "true" value from the DB and then apply a DP algorithm onto that value before sending it back to User. In this way, User only ever observed a modified value.

In ./dp-middleware/noisemw, I write a middleware for the FastAPI framework that applies statistical noise from the Laplace distribution to all requests. This middleware is pip installable as a python package, and can integrate with any API using FastAPI with only a few new lines of code.

In the API below, an HTTP call to /population/sensitive-test-results/ calls CallToAnEHRDB and returns a sensitive test result value (K).

from fastapi import FastAPI
app = FastAPI()

# >>> START: new code to integrate Laplacian Noise Middleware
import noisemw
app.add_middleware(noisemw.LaplaceNoiseMiddleWare, 
privacy_budget_per_call=0.1,
result_field="result", 
endpoints=["/population/sensitive-test-results/"]
)
# <<< END

@app.get("/population/sensitive-test-results/")
async def test_results(*args):
    return CallToAnEHRDB(*args)

With the noise middleware added, the result of the HTTP call is no longer consistently going to be K, but rather K + LaplaceNoise(*params-defined-in-middleware*). Presuming the parameters have been set properly, this noise makes it substantially more difficult for an attacker to derive individual test results from the aggregate value. The demo API for this project uses LaplaceNoiseMiddleWare on the /dp/stats-mw endpoint, you can try it here.

Database Side - PostgreSQL Functions & Extensions

An alternative approach is to push down the application of statistical noise to the database (circle B). In this case, DB returns a value that has already had noise applied before sending it to API. This has two benefits.

  • (1) - The DB, and only the DB ever holds "true" data. This allows developers to focus on security of the data at rest, as the DB is the only location that ever holds results data.

  • (2) - Although FastAPI is relatively popular, there are hundreds of API frameworks across many languages that developers build APIs with. Thus, FastAPI makes up a small portion of the overall market share. In contrast, PostgreSQL (along with MySQL and SQLite) is incredibly popular and writing extensions for a few databases would have the same impact as writing dozens of language and framework specific middlewares. In short, this should enable more developers to immediately use DP algorithms in their services more easily.

In this approach, the developer makes a 1 LOC change in a SQL query their API makes, example below:

-- Original :: without diff-priv -> liable to leak exact individual results 
SELECT AVG(a_sensitive_test_result) FROM tests 
WHERE age > 50 AND patient_zip == 11225;

-- Modified :: 1 LOC change that introduces statistical noise
-- Assumes `laplace_noise` extension + functions loaded onto DB
SELECT ADD_LAPLACE_NOISE(AVG(a_sensitive_test_result), 0.5) FROM tests 
WHERE age > 50 AND patient_zip == 11225;

Under this approach, the API sees the value K + LaplaceNoise(*params-defined-in-db-function*) and simply returns it to the User. The demo API for this project uses this method on the /dp/stats-db endpoint, you can try it here.

Challenges we ran into / Accomplishments that we're proud of / What we learned

The math and statistics here is challenging and were tough to grasp at first. Although I was able to get some of the intuition around this concept, I still have a long way to go. I was particularly proud of making use of the fact that given X,Y ~ U(0,1) , LN(X/Y) ~ LAP(0,1) . This is a cool fact and let me write a really compact PostgreSQL Laplace noise function.

What's next for Differential Privacy Middlewares / Further Work

There are a few places that need immediate attention for this to become production ready. For reasons stated above, I'm more likely to continue to develop the PostgreSQL extension, but both approaches can be improved.

  • [API] Integrate With Rate Limiting - In contexts where DP algorithms are applied, there's a concept called a privacy budget. This term considers the inverse of the overall amount of noise added to requests from a given User. The idea being that if a user can make hundreds of requests with bounded noise, they can exploit the API in the same way they might exploit an API without DP controls. Much like developers rate-limit the total number of calls to an API, I'd like to explore the idea of rate-limiting User by the amount/lack of noise on all their calls.

  • [PG] Port to C - My intention was to write the PostgreSQL implementation as a PostgreSQL extension in C rather than a pl/pgsql function. Right now performance is acceptable because of an interesting relationship between U(0, 1) and LAP(0, 1). For this weekend, this was OK, but writing these functions in C and compiling them to a PG extension would be a more formal, performant solution.

  • [BOTH] Calibration For Sample Size - To make correct assurances about a specific degree of privacy, there needs to be proper epsilon tuning. In the paper Calibrating noise to sensitivity in private data analysis (link), the authors discuss these methods. This weekend's project does not implement these adjustments, but this is a logical next step.

  • [BOTH] Verify Statistical Validity - My middleware implementation uses python's very well-tested numpy library to generate Laplacian noise, and my PostgreSQL implementation produces results statistically identical to numpy. As mentioned above, there is some additional context (e.g. handling min/max is not the same as handling a median or average) that requires more study of DP algorithms and literature.

Resources

Built With

Share this project:

Updates