“Amnesia” - A Selection of Machine Learning Models That Can

“Amnesia” - A Selection of Machine Learning Models

That Can Forget User Data Very Fast

Sebastian Schelter

New York University

[email protected]

ABSTRACT

Software systems that learn from user data with machine

learning (ML) have become ubiquitous over the last years.

Recent law requires organisations that process personal data

to delete user data upon request (enacting the “right to be

forgotten”).

However, it is not suﬃcient to merely delete the user data

from databases. ML models that have been learnt from the

stored data often resemble a lossy compressed version of the

data. We therefore argue that the user data should also

be removed from ML models due to potential privacy risks.

Typically, this requires an ineﬃcient and costly retraining

of the aﬀected ML models from scratch, where the training

infrastructure has to re-access the original training data and

redeploy the retrained model.

We address this performance and operational issue by for-

mulating the problem of “decrementally” updating trained

ML models to “forget” the data of a user, without the need

to re-access the training data. We present eﬃcient decre-

mental update procedures for a selection of four popular

ML algorithms, together with Rust-based single-threaded

implementations.

In an experimental evaluation on nine real world datasets,

we ﬁnd that our decremental update approach is several or-

ders of magnitude faster than model retraining in the major-

ity of cases. We additionally describe parallel implementa-

tions of our update procedures in Diﬀerential Dataﬂow and

discuss the limitations of our approach.

1. INTRODUCTION

Software systems that learn from user data with machine

learning (ML) have become ubiquitous over the last years [24],

and participate in many critical decision making processes,

e.g., to decide about loans, job applications and medical

treatments. Recent laws such as the General Data Protec-

tion Regulation (GDPR) and the “right to be forgotten” re-

quire companies and institutions that process personal data

to delete user data upon request [11]. We argue that it is not

suﬃcient to merely delete this user data from primary data

This article is published under a Creative Commons Attribution License

(http://creativecommons.org/licenses/by/3.0/), which permits distribution

and reproduction in any medium as well allowing derivative works, pro-

vided that you attribute the original work to the author(s) and CIDR 2020.

10th Annual Conference on Innovative Data Systems Research (CIDR ‘20)

January 12-15, 2020, Amsterdam, Netherlands.

stores such as databases. ML models that have been learnt

from the stored data resemble a lossy compressed version of

the data in many cases. Not updating these models might

introduce a privacy risk, e.g., when the ML model can be

leveraged to make elaborate guesses about the deleted data

(or even to restore it). We discuss an example of such a case

in Section 2.

While relational databases oﬀer transactional deletes and

corresponding updates for materialized views [14] over the

data, there exists no such automated deletion mechanism

for ML models derived from the data. Instead, the dele-

tion of user data typically requires an ineﬃcient and costly

retraining of the aﬀected ML models from scratch on the

original training data, for example on a nightly basis. This

retraining is especially complicated from an operational per-

spective, as the training infrastructure has to reaccess the

training data, and redeploy the retrained model afterwards.

We propose to alleviate this problem by describing how to

“decrementally” update a selection of trained ML models in

an eﬃcient way, such that they “forget” the user data. In

other terms, the decrementally updated model is identical to

a model that would have been trained from scratch without

the data of the user to remove.

We formalize this problem in Section 3, and describe eﬃ-

cient decremental update procedures for four ML algorithms

from diﬀerent ML domains (recommendation, classiﬁcation,

regression) in Section 3.2. We experimentally evaluate our

approach on nine real world datasets in Section 6. Our ap-

proach is applicable to a family of ML models with non-

iterative training, which exhibit sparse computational de-

pendencies [3, 9] between the model and the data. We dis-

cuss the generality of our approach in Section 7.

In summary, we provide the following contributions:

• We introduce the problem of making trained ML models

“forget” user data via decremental updates without reac-

cessing the training data (Section 3).

• We describe corresponding decremental update procedures

for four algorithms from diﬀerent ML domains (recom-

mendation, classiﬁcation, regression) (Section 3.2).

• We provide single-threaded and dataﬂow implementations

our proposed algorithms (Section 4).

• We conduct an experimental evaluation on nine real world

datasets, and ﬁnd that our decremental update is several

orders of magnitude faster than model retraining in the

majority of cases (Section 6).

2. NEGATIVE PRIVACY IMPLICATIONS

We outline the potential privacy implications posed by not

updating ML models after data removal on a ﬁctitious ex-

ample scenario, which leverages real world data. The Good-

books dataset

contains six million ratings for ten thousand

books given by ﬁfty thousand (deidentiﬁed) users. Consider

user 51881 who has read the following four books: Thirteen

Reasons Why, Speak, All the Bright Places, My Heart and

Other Black Holes. This information is potentially sensitive

as all the books are dealing with the topic of teenage suicide

(and are tagged with suicide-notes in the dataset). This

user (or the parents of the user) might ask the data owner

to remove this sensitive information from their database,

e.g., under the legal umbrella of the “right-to-be-forgotten”

as part of the European General Data Protection Regula-

tion (GDPR) [11]. The data owner would implement this

by removing the records for the user from the books_read

table as shown in Figure 1.

However, the data owner might still have ML models that

have been learnt from the version of the database prior to

the removal of our user’s records. An example could be a

recommendation model built with user-based collaborative

ﬁltering [17]. This model would maintain a matrix of sim-

ilarities between pairs of users, which can be leveraged to

recommend books to other users (e.g., by choosing books

from the reading histories of highly similar users). The sim-

ilarity of two users in this scenario could for example be

computed as the Jaccard similarity between their sets of

reading histories.

Unfortunately, such a user similarity matrix can be used to

make elaborate guesses about the (already deleted) reading

history of user 51881 as we show in Figure 1. The mean sim-

ilarity of our user to other users (as measured by a random

sample of 100 users) is only about 0.04. We can however

identify other users with signiﬁcantly higher similarity by

inspecting the recommendation model: examples are user

8364 with a similarity of 0.197, user 44035 with a similarity

of 0.248 or user 46992 with a similarity of 0.207. If we in-

spect the intersection of the reading histories of these highly

similar users, we again encounter many occurrences of the

sensitive books that were contained in the deleted reading

history of user 51881: three occurrences of Thirteen Rea-

sons Why, two occurrences of Speak and one occurrence of

All the Bright Places. Based on this observation, we can

conclude that it is very likely that these books also existed

in the (now deleted) reading history of user 51881 and that

the high similarity in the recommendation model is a result

of the overlapping interest in books about teenage suicide.

3. APPROACH

In order to address the previously described privacy risks,

we develop methods to eﬃciently update trained ML models

after data removal. We ﬁrst formalize the problem of eﬃ-

cient “decremental” updates for having models “forget” data,

and describe decremental (and incremental) variants of four

popular ML algorithms in detail afterwards.

3.1 Problem Statement

Let t

learn

(D, θ) denote a procedure to learn an ML model f

(such as a classiﬁer) from training data D with hyperpa-

rameters θ. Let D = {d

, d

, . . . , d

} denote the data of

http://fastml.com/goodbooks-10k

books_read

books

user_id book_id

... ...

51881 146

51881 233

51881 920

51881 5278

... ...

44035 146

44035 233

... ...

46992 146

46992 920

...

8364 146

8364 233

... ...

book_id tle tags

... ... ...

146 Thirteen reasons why {suicide-notes, ...}

233 Speak {suicide-notes, ...}

920 All the bright places {suicide-notes, ...}

5278 My heart and other black holes {suicide-notes, ...}

... ... ...

... 8364 ... 44035 ... 46992 ...

... ... ... ... ... ... ... ...

51881 ... 0.197 ... 0.248 ... 0.207 ...

... ... ... ... ... ... ... ...

user similarity matrix from

recommendaon model

Figure 1: Example scenario for negative privacy im-

plications of not updating trained ML models af-

ter data deletion. A recommendation model of user

similarities enables elaborate guesses about sensi-

tive deleted data (a reading history of books about

teenage suicide).

m users (our training data), and let f = t

learn

(D, θ) denote

the ML model learnt from the user data D. Now, assume

that we are required to delete the data d

of the m-th user

from the model. That means we are interested in obtaining

another ML model f

= t

learn

(D\{d

}, θ), which never saw

the data d

of user m during its training. This new model

can be trivially obtained by repeating the training procedure

learn

on D \ {d

Conducting a complete model retraining might however be

ineﬃcient and costly in many cases, and we can assume

that the loss of data from a single user (or a handful of

users) does not signiﬁcantly change the predictive power of

the model, or require a diﬀerent choice of hyperparameters.

Instead, we would be interested in an eﬃcient procedure

forget

that can inspect d

and update the existing model f

to the desired model f

, such that f

= t

forget

(f, {d

}, θ) =

learn

(D \ {d

}, θ). This approach is refered to as “decre-

mental learning” in the ML literature [5].

In summary, the decremental update procedure t

forget

must

satisfy the following property, and provide a runtime for

forget

that is much smaller than the runtime of t

learn

for

retraining:

forget

learn

(D, θ), {d

}, θ) = t

learn

(D \ {d

}, θ)

This approach bears some resemblance to online learning,

where we incrementally update an existing model with new

observations. A major diﬀerence is that we not only need

a merge operation to integrate updates into the global re-

sult, but also an inverse operation to remove such updates.

This is problematic for learning algorithms such as gradient

descent (which naturally supports online learning), as they

compute a ﬁxpoint using a series of global aggregations on

all input tuples in every step, where the result of each iter-

ation depends on all tuples and all previous results.

Therefore, we focus on non-iterative ML algorithms which

exhibit sparse computational dependencies [9, 27]. We fur-

ther discuss the generality of our approach in Section 7.

3.2 Decremental / Incremental Models

Our approach is based on the general idea of having the

training algorithm for an ML model retain intermediate re-

sults during the model computation, which can be eﬃciently

updated in both an incremental manner (to incorporate new

user data) as well as a decremental manner (to remove user

data) later on. In the following, we detail three algorithms

for diﬀerent ML tasks (recommendation, regression, clas-

siﬁcation) and describe eﬃcient incremental / decremental

update procedures for them. For each model, we describe

the decremental update via the FORGET function, the in-

cremental update via the PARTIAL-FIT function, and detail

how to derive predictions from it via the PREDICT function.

3.2.1 Item-Based Collaborative Filtering

(Recommender Systems)

Item-based collaborative ﬁltering [23, 25] is a popular rec-

ommendation approach, which compares user interactions

to ﬁnd related items in the sense of: ‘people who like this

item also like these other items’. In a movie recommen-

dation case for example, the system records which movies

often cooccur in the viewing histories of users. These pairs

of cooccurring movies are then ranked and form the basis for

recommendations later on. In its simplest form, the input

data for an item-based recommender comprises of a binary

history matrix H ∈ {0, 1}

|U|×|I|

which denotes the observed

interactions between a set of users U and a set of items I.

An entry H

equals 1 if user j interacted with item i, and 0

otherwise. Such binary observation data can be easily gath-

ered by recording user actions (such as purchases in online

shops or plays of a video on a movie platform).

Model & Prediction. The model of an item-based recom-

mender comprises of a similarity matrix S ∈ R

|I|×|I|

which

denotes the interaction similarity between pairs of items.

A common way to train this model is to ﬁrst compute a

cooccurrence matrix C = H

H which denotes how many

users interacted with each pair of items. Additionally, we

need a vector n =

j∈U

denoting the number of in-

teractions per item (the row sums of H). Next, we can

compute the similarity matrix S by inspecting the cooccur-

rence counts. Many similarity measures between items can

be computed from the cooccurrence matrix [25]. A popular

choice is the Jaccard similarity S

between items i

and

, computed via S

= C

/(n

+ n

− C

). Rec-

ommendations can be retrieved by querying the similarity

matrix S. We retrieve item-to-item recommendations by

querying for the most similar items per item (e.g., by in-

voking the PREDICT function in Algorithm 1), and gener-

ate items to recommend for a particular user by computing

preference estimates via a weighted sum between item sim-

ilarities and the corresponding user history [23].

Decremental and incremental updates. We require the

following intermediate data structures to enable incremental

and decremental updates for the item-based recommender:

the item cooccurrence matrix C, the item interaction count

vector n and the item similarity matrix S.

The FORGET function in Algorithm 1 details how to remove

the data of the u-th user (which corresponds to the u-th

row H

of the user-item interaction matrix H) from the

model. We loop over all pairs of items (i

, i

) in the user

history H

and decrement the corresponding cooccurrence

count C

. Finally, we iterate over each item i

in the

user history and recompute its corresponding row S

in the

similarity matrix. The incremental update of the model,

illustrated in the PARTIAL-FIT function in Algorithm 1 works

analogously, with the diﬀerence that we increment (instead

of decrement) the cooccurrence counts.

Space and complexity. In general, an item-based recom-

mendation model consists of a similarity matrix S ∈ R

|I|×|I|

which requires space quadratic in the number of items. The

intermediate data structures for the incremental/decremen-

tal variants double the required memory, as we additionally

need to maintain the cooccurrence matrix C ∈ N

|I|×|I|

and

the vector n ∈ N

|I|

. An update requires |H

adjustments

of the cooccurrence matrix, and (in the worst case) the re-

computation of |H

| · |I| entries in the similarity matrix S,

which means that an update has a quadratic complexity of

O(|I|

) in the worst case. The vast majority of users will

however have only interacted with a very small number of

items in real world data. Additionally, both the memory

required for the intermediate data structures as well as the

update complexity can be reduced by only retaining the top-

k entries per item in S and by introducing bounds on the

maximum number of interactions to consider per user and

item, an approach which we discussed and evaluated in de-

tail in previous work [26].

Algorithm 1 Decremental / incremental update procedures

for an item-based recommendation model.

Input: item cooccurrence matrix C, item interaction counts n,

item similarity matrix S, user history H

1: function partial-fit(H

, C, n, S)

2: n ← n + H

3: for item pair (i

, i

) ∈ H

4: C

← C

+ 1

5: for item i

∈ H

6: for item i

∈ C

7: S

← C

/(n

+ n

− C

)

8: function forget(H

, C, n, S)

9: n ← n − H

10: for item pair (i

, i

) ∈ H

11: C

← C

− 1

12: for item i

∈ H

13: for item i

∈ C

14: S

← C

/(n

+ n

− C

)

15: function predict(j, k, S)

16: return top-k items from S

Restoration of deleted data from a stale model. Next,

we discuss how deleted user data could be restored from a

stale model. We look at the case where the data of a single

user has been deleted from the database. That means that

the interactions data now consists of a matrix

H which is

the original database H with the row corresponding to the

deleted user removed. If we still have access to the (now

stale) model S computed from H, we can restore the deleted

user data as follows. We compute the similarity matrix

corresponding to the updated database

H, and compare it

to the stale model S. We identify all items i for which we

observe diﬀerences in entries of the similarity matrices (e.g.,

∃j S

). The set of these items i is exactly the items

that were contained in the interaction history of the deleted

user.

3.2.2 Ridge Regression (Regression)

Ridge regression [15] is a widely used technique to predict

a continuous target variable in regression scenarios. The in-

put data to the regression model is a matrix X ∈ R

m×d

m d-dimensional observations, and a corresponding numeric

target variable y ∈ R

. Without loss of generality, we as-

sume that the data of a particular user i is captured in the

i-th row vector X

of the input (if more than one row would

denote data about a particular user, we could simply run

the decremental update procedure several times).

Model & Prediction. A common way to compute a ridge

regression model in the form of the weight vector w is to

solve the normal equation w = (X

X+λI)

−1

y. The reg-

ularization constant λ is typically selected via cross-validation.

The regression estimate for a new observation x

new

can be

computed via its dot product with the weight vector: ˆy

new

(as shown in the PREDICT function in Algorithm 2).

Decremental and incremental updates. We detail how

to remove the data of a user u (in the form of the u-th row

of the observation matrix X) from a ridge regression

model. We aim for an eﬃcient way to compute:

w = (X

X − X

+ λI)

−1

y − X

)

We therefore maintain the following intermediates from the

computation: the vector z = X

y and a QR factorization

QR = qr(X

X + λI) of the regularized gram matrix. We

can now recompute z by subtracting X

, and we invoke a

fast rank-one update algorithm [30] with − Q

and X

as argument to update the QR decomposition Q, R. Af-

terwards, we can eﬃciently solve for the updated model w

(see the FORGET function in Algorithm 2). If we substitute

the subtractions with additions, this gives us an incremen-

tal update algorithm as well, as illustrated by the PARTIAL-

FIT function of Algorithm 2.

Space and complexity. The decremental variant of the

model requires us to maintain two additional matrices Q ∈

d×d

and R ∈ R

d×d

as well as the vector z ∈ R

. This

means that the required space is quadratic in the number of

features d (which is typically much smaller than the number

of examples m), but independent of the number of examples.

An update requires 2d operations to scale and add to z,

operations for the matrix vector multiplication Q

26d

operations for the rank-one QR update [30], another

operations for the matrix vector multiplication Q

z and

again d

operations for solving for w via back substitution.

In summary, an update has a complexity of O(d

), which

is an improvement over retraining which requires O(md

)

operations.

Restoration of deleted data from a stale model. In

contrast to the other approaches presented here, it is more

challenging to infer information about a deleted user feature

vector X

from a regression model w. If we still had access

to the complete target variable y, we could constrain candi-

date vectors to the subspace deﬁned by w

= y

(with

some uncertainty due to the contained prediction error). Ad-

ditional knowledge about X (e.g., about the ranges of certain

features or about row-wise normalisation techniques), would

further restrict the space of candidate vectors.

The QR-factorisation is a decomposition of a matrix A into a prod-

uct A = QR of an orthogonal matrix Q and an upper triangular

matrix R, which is particularly useful to solve linear least squares

problems [30].

Algorithm 2 Decremental / incremental update procedures

for a ridge regression model.

Input: z ← X

y, QR ← qr(X

X + λI), user observation X

1: function partial-fit(X

, y

, Q, R, z)

2: z ← z + X

3: QR ← qr

update(Q, R, Q

, X

)

4: solve Rw = Q

z for w

5: function forget(X

, y

, Q, R, z)

6: z ← z − X

7: QR ← qr update(Q, R, −Q

, X

)

8: solve Rw = Q

z for w

9: function predict(x

new

,w)

10: return w

new

3.2.3 k-Nearest Neighbors with Locality Sensitive

Hashing (Classiﬁcation)

Our next example leverages a nearest neighbor-based classi-

ﬁcation algorithm, which assigns the label of the k nearest

neighbors to an unknown observation. It applies a common

approximation technique to speed up the search for the near-

est neighbors called locality sensitive hashing (LSH) [13].

The input data for the classiﬁer comprises of a matrix X ∈

m×d

of m d-dimensional observations, and the target vari-

able y ∈ {1, . . . , c}

which denotes the assignments of the

corresponding c categorical labels. We again assume that

each row X

corresponds to observations for a particular

user i.

Model & Prediction. The algorithm leverages approxi-

mate similarity search in a high-dimensional space by build-

ing an index over the data. This index comprises of several

hash tables where observations which are close in terms of

Euclidean distance have a high chance of ending up in the

same hash bucket. We leverage random projections as hash

function to compute the bucket indexes. We compute H

hash tables T

, . . . , T

with b-dimensional bucket indexes.

We generate a Gaussian random matrix Ω

for each hash

table T

, and leverage this matrix to conduct a random pro-

jection of our data via sgn(XΩ

). The resulting bit vec-

tors denote the hash keys, and we assign each row X

from

X to its corresponding bucket with key sgn(X

Ω

) in the

hash table T

. In order to assign a label to an unseen

observation X

new

, we collect its nearest neighbors as fol-

lows. For each hash table T

, we compute the bucket index

= sgn(X

new

Ω

), and collect all observations from this

bucket. Next, we compute the k exact nearest neighbors of

new

in the retrieved observations, and assign the majority

label from these as the predicted label ˆy

new

to X

new

. This

can be eﬃciently implemented by maintaining a binary heap

with the k closest examples, as shown in the PREDICT func-

tion of Algorithm 3.

Decremental and incremental updates. Removing an

existing observation X

for a user u from our index works as

depicted in the FORGET function of Algorithm 3. We com-

pute the bucket index b

of X

via the random projection

sgn(X

Ω

), and remove X

from bucket b

in hash table

. We repeat this procedure for all hash tables. An incre-

mental update works analogously, as listed in the PARTIAL-

FIT function in Algorithm 3.

Space and complexity. Our approach requires no addi-

tional space, as it simply leverages the hash tables T

, . . . , T

and the projection matrices Ω

, . . . , Ω

, which are required

for deriving predictions from the model anyways. A decre-

mental update requires H matrix-vector multiplications be-

tween a projection matrix of size d×b and the feature vector

of dimensionality d for determining the bucket indices, plus

H subsequent hashmap inserts, leading to a complexity of

O(Hdb).

Algorithm 3 Decremental / incremental update procedures

for approximate k-NN.

Input: hash tables T

, . . . , T

, projection matrices Ω

, . . . , Ω

observation X

1: function partial-fit(X

, Ω, T )

2: for h = 1 . . . H do

3: b

= sgn(X

Ω

)

4: add X

to bucket b

in hash table T

5: function forget(X

, Ω, T )

6: for h = 1 . . . H do

7: b

= sgn(X

Ω

)

8: remove X

from bucket b

in hash table T

9: function predict(x

new

, Ω, T )

10: P ← heap with capacity k for closest examples

11: for h = 1 . . . H do

12: b

= sgn(X

Ω

)

13: update P for all examples from bucket b

14: return majority label from k examples in P

Restoration of deleted data from a stale model. The restora-

tion of deleted data from the model is trivial as the model

stores copies of the feature vectors.

3.2.4 Multinomial Naive Bayes (Classiﬁcation)

The last algorithm we discuss is again a classiﬁcation algo-

rithm called Naive Bayes [22], where we focus on a variant

for categorical data called Multinomial Naive Bayes (MNB).

The input data for the classiﬁer comprises of a matrix X ∈

m×d

of m d-dimensional observations, and the target vari-

able y ∈ {1, . . . , c}

which denotes the assignments of the

corresponding c categorical labels. We again assume that

each row X

corresponds to observations for a particular

user i.

Model & Prediction. The Naive Bayes model is built on

the assumption that features are conditionally independent,

given the class label. If we use a uniform prior estimate

for the sake of simplicity, then the MNB classiﬁer assigns

its class prediction ˆy as follows: ˆy = argmax

log X

where the probability P(x

= k|y) = θ

of encountering the

value k for feature j in class y is estimated using a smoothed

version of the maximum likelihood estimate

+α

Here, N

denotes the number of times the value for feature

j occurs in class y, N

, a uniform prior is used for

the sake of simplicity, α

is the smoothing term per feature,

and α the sum of the smoothing terms.

Decremental and incremental updates. Adding or re-

moving a user feature vector X

with a given class label y

for MNB is trivial, as we just need to increment or decre-

ment the corresponding feature counts N

and N

for all

non-zero features with indexes j ∈ X

. Note that this ap-

proach also works for the Gaussian variant of Naive Bayes

for continuous data, where we have to maintain the mean

and variance per feature and class label instead of the raw

counts.

Space and complexity. Our incremental/decremental vari-

ant requires us to maintain the raw feature counts N with

space in O(dc) which requires the same space as the proba-

bility vectors θ. The number of operations for incremental

and decremental updates linear in the number of non-zero

features of the vector X

to add or remove.

Algorithm 4 Decremental / incremental update procedures

for Multinomial Naive Bayes .

Input: conditional feature counts N

1: function partial-fit(X

, c

, N)

2: for (i, X

) ∈ X

3: N

← N

+ X

4: N

← N

+ X

5: function forget(X

, Ω, T )

6: for (i, X

) ∈ X

7: N

← N

− X

8: N

← N

− X

9: function predict(x

new

, N)

10: return argmax

(j,x

)∈x

new

log x

+α

Restoration of deleted data from a stale model. We

again discuss how deleted user data could be restored from

a stale model, and focus on the simpliﬁed case where the

data of a single user has been deleted from the database. In

this case, we can recompute all paramater estimates

(new)

and detect the features contained in the deleted data by

identifying all diﬀerences

(new)

in the new and old

parameter estimates.

4. IMPLEMENTATION

We provide single-threaded and parallel implementations of

our proposed algorithms in Rust.

Single-Threaded. We depend on the fnv crate for fast

hashmaps, the ndarray crate with bindings to OpenBLAS

for the random projections in approximate k-NN, as well

as on the GSL crate bindings to the GNU Scientiﬁc Library

which provides the routines for updating the QR factoriza-

tions required for ridge regression. Note that we implement

mini-batch versions of the PARTIAL-FIT function for eﬃcient

retraining and apply a ‘user interaction cut’ [26] of 500 in-

teractions to our recommender implementation.

Diﬀerential Dataﬂow. We additionally implement dataﬂow

versions of our algorithms in Diﬀerential Dataﬂow [20, 19]

(DD) to enable parallelisation and distribution, as well as to

beneﬁt from its automatic incrementalisation and decremen-

talisation. We ﬁnd that three out of four of the algorithms

ﬁt well into the dataﬂow model.

A simpliﬁed implementation of our incremental/decremen-

tal item-based recommender is shown in the following. The

most expensive operations is the matrix matrix multiplica-

tion C = H

H of the binary user interaction history matrix

H to obtain the cooccurrence matrix C. We implement this

with a self-join followed by a count aggregation. We would

https://github.com/schelterlabs/projects-amnesia

like to highlight that our actual implementation contains a

set of performance improvements not included in the listing,

e.g., the use of ‘arranges’ [18] to re-use intermediate results,

the application of an interaction-cut [26] to handle ‘power

users’, and thresholding of the similarity matrix.

// Maintain interactions per item n

let num_interactions_per_item = interactions

.map(|(_user, item)| item)

.count_total();

// Maintain the cooccurence matrix C

let cooccurrences = interactions

.join(&interactions,

|_user, &item_a, &item_b| (item_a, item_b))

.count();

// Maintain the similarity matrix S

let jaccard_similarities = cooccurrences

.map(|record|, key_by_item_a(record))

// Find the number of interactions for item a

.join(&num_interactions_per_item,

|record| collect_for_item_a(record))

.map(|record|, key_by_item_b(record))

// Find the number of interactions for item b

.join(&num_interactions_per_item,

|record| collect_for_item_b(record))

// Compute Jaccard similarty

.map(|record| compute_jaccard(record));

Listing 1: Implementation of the proposed item-

based recommender in DD (simpliﬁed).

For the k-nn classiﬁer, we form the cartesian product be-

tween the projection matrices and features, and execute the

projections afterwards to obtain the bucket keys. The Naive

Bayes implementation only requires us to maintain grouped

counts by label and feature per label.

let features_per_label =

examples.explode(|example| {

let label = example.label;

example.features.into_iter()

.map(|(index, value)| ((index, label), value)));

let feature_per_label_counts =

features_per_label.count();

let label_counts = features_per_label

.map(|(_, label)| label)

.count();

Listing 2: Implementation of the proposed

multinomial naive bayes classiﬁer in DD (simpliﬁed).

5. RELATED WORK

Ensuring that data processing technology adheres to legal

and ethical standards in a fair and transparent manner [29]

is an important research direction, which starts to gain at-

tention from law makers and governments [10]. The machine

learning community has pioneered some work on removing

data from models under the umbrella of “decremental learn-

ing” for support vector machines [5, 16].

In contrast to our work, these approaches need to re-access

the training data for updating the model though, which in-

troduces operational complexity as model training (and the

corresponding data) and serving is typically handled in dif-

ferent systems and infrastructures [21]. Cao et al. [4] propose

an approach similar to ours to remove adversarial inputs

from ML models, however they again reaccess the training

data for model updates. For that reason, their approach

also works for iterative, gradient-descent based learning al-

gorithms, as they basically just restart gradient descent from

the iteration where the sample to forget was used ﬁrst. They

adopt a popular summation form of ML algorithms (devel-

oped as a general model to parallelize training on multi-

core architectures), which for example does not capture our

QR factorization-based update methods for linear regression

well. Ginart et. al. [12] explore a problem setting similar to

ours for stochastic algorithms, in particular for variants of

k-Means clustering.

An orthogonal technique to protect the privacy of user data

in machine learning use cases is diﬀerential privacy [8]. How-

ever, it is very diﬃcult to design diﬀerentially private algo-

rithms (even for experts [7]), and this approach requires a

diﬃcult decision on the limit of the acceptable privacy loss

in practice.

The security community studies inference attacks against

ML models, e.g., to infer whether a given record was part

of the training data of a model [28] or to infer statistical

information about the training data [2].

The approach used in this paper bears resemblance to gen-

eral approaches for incremental data processing [9, 19, 27,

14] with the diﬀerence that it requires inverse operations

for removing data, which are not covered by many existing

approaches.

6. EVALUATION

Datasets. We run experiments

on nine real-world datasets,

more precisely on three diﬀerently sized datasets for each

proposed algorithm, which are detailed in Table 1. For our

recommendation algorithm, we leverage joke ratings (jester ),

movie ratings (movielens), and DVD ratings (ciaodvd), which

are publicly available via the Konect network repository.

We use data about mushrooms, phishing websites, and car-

tographic forest data (covtype) for our classiﬁcation exper-

iments. Finally, we train our regression models on data

about house prices (housing, cadata) and music (YearPre-

dictionMSD). The classiﬁcation and regression datasets are

available via the LibSVM dataset repository.

6.1 Beneﬁts of Decremental Updates

Setup. We experiment using our single-threaded implemen-

tations in Rust 1.38 on a machine with an Intel i7-7700HQ

CPU and 16GB of RAM, running Ubuntu Linux 16.04. We

load the input data into memory and train a model on each

dataset. Next, we measure the time to update the model to

“forget” a random user versus the time to retrain the model

from scratch without the data of that particular user. We

repeat every experiment for twenty randomly chosen users

and report the median runtime as well as the 90th and 10th

percentile of the runtime distribution in Figure 2. Note that

the runtime on the y-axis is displayed on a logarithmic scale,

and that the runtimes for retraining are overly optimistic as

we do not include the time to parse and read the training

data into the measurements (which would depend on the

infrastructure).

Code to reproduce our experiments is available at https:

//github.com/schelterlabs/projects-amnesia.

http://konect.cc/networks/

https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/

(a) Collaborative Filtering. (b) Ridge Regression. (c) k-Nearest Neighbors. (d) Multinomial Naive Bayes.

Figure 2: Comparison of the median runtimes for removing a random user’s data from a trained ML model.

“Retrain” retrains a model from scratch without this user’s data, while “forget” decrementally updates the

trained model to forget the user data. The runtime is displayed on a logarithmic scale and the error bars

denote the 90-th and 10-th percentiles of the observed runtime distribution.

dataset #examples #features

mushrooms 8,124 112

phishing 11,055 68

covtype 581,012 54

housing 506 13

cadata 20,640 8

YearPredictionMSD 463,715 90

dataset #interactions #users #items

jester 1,728,847 50,692 140

movielens 1,000,209 6,040 3,706

ciaodvd 1,625,480 21,019 71,633

Table 1: Datasets for evaluation.

Results. The results for our recommendation algorithm ap-

plied to the jester, movielens, and ciaodvd data are shown

in Figure 2(a). The decremental update is three to ﬁve

orders of magnitude faster than retraining for the jester

and ciaodvd data, and two orders of magnitude faster for

the movielens dataset, where the update time varies much

stronger. This diﬀerence is due to the dependence of the

update time on the user history length which is very low

for the jester and ciaodvd data with 20 and 34 interactions

on average, and much higher for the movielens data with a

mean of 143 interactions.

The results for our ridge regression experiments on the hous-

ing, cadata and YearPredictionMSD datasets are illustrated

in Figure 2(b). We observe that the update is one to two or-

ders of magnitude faster than retraining for the small hous-

ing and cadata datasets, and more than three orders of mag-

nitude faster for the YearPredictionMSD data, which has

more than 500K observations. This illustrates the fact that

the update runtime depends on the number of features only

and is independent of the number of examples (as outlined

Section 3.2). Therefore, we observe a larger runtime diﬀer-

ence for datasets with more examples.

We show the results for the approximate k-NN model on the

mushrooms, phishing and covtype datasets in Figure 2(c).

We used 20 hash tables, set k to 10, and apply random

projections as hashing function to approximate euclidean

distance. The decremental update is more than three orders

of magnitude faster in all cases, which is an expected result

as the update just involves a ﬁxed number of hash table

modiﬁcations.

We show the results for the Multinomial Naive Bayes model

on the mushrooms, phishing and covtype datasets in Fig-

ure 2(d). We binarized the input features. The decremental

update is more than three orders of magnitude faster in all

cases, and can often be conducted in less than a microsec-

ond, which is an expected result as the update just involves

a ﬁxed number of updates to a dense array.

In short, the experimental results conﬁrm that our algo-

rithms satisfy the requirement for the decremental update

procedures: the update is several orders of magnitude faster

than retraining in the majority of cases, and can be con-

ducted in sub-second time for the datasets from our evalua-

tion.

6.2 Beneﬁts of Parallelisation

Next, we evaluate the scalability our implementation in Dif-

ferential Dataﬂow. We focus on the item-based recommender,

as we found the other approaches to have update times of

less than a millisecond in our previous experiments (which

obliviates the need for parallelisation). We train a recom-

mender model for the jester, movielens, and ciaodvd datasets,

and measure the time to remove 100 randomly chosen user

histories. We repeat each experiment 20 times for an in-

creasing number of workers (threads), and plot the resulting

runtimes in Figure 3. Note that we ﬁx the random seeds for

comparability as the runtime is extremely sensitive to the

length of the (randomly) chosen user histories.

Figure 3: Scalability of our item-based recom-

mender implementation in diﬀerential dataﬂow for

a decremental update of 100 random users.

We observe that the parallelisation eﬀectively reduces the

runtime for the movielens, and ciaodvd. The updates for the

jester dataset can be conducted in less than 25 milliseconds

in all setups, due to the extremely low number of items in

the dataset.

7. DISCUSSION

We discuss operational implications of our approach, and

describe limitations in its generality as basis for further re-

search.

7.1 Operational Implications

A major operational implication of our proposed approach

is that our t

forget

procedure does not require access to the

original training data D, which simpliﬁes the integration

of our approach into complex real-world ML deployments.

In such deployments, model training typically requires an

oﬄine training process on a cluster or a powerful machine

in separate infrastructure with access to the training data

(which typically must not be accessible from serving systems

for security reasons). After the completion of the training,

the updated model has to be re-deployed to a model serv-

ing system [21, 1] via complex deployment pipelines. As

our approach does not require access to the training data,

we enable a setup with much lower operational complexity

because the model can be updated in-place in the serving

systems.

This operational perspective and the goal to avoid having to

reaccess training data is a novel and diﬀerentiating factor of

our approach in constrast to existing work which falls back

to reaccessing training data and retraining models in most

cases [5, 16, 4].

7.2 Generality of our Approach

Our approach is based on the idea of recomputing the parts

of the model that have been aﬀected by the user data which

we aim to remove. This is eﬃcient if (i) only a small part of

the model is aﬀected, and (ii) the recomputation is asymp-

totically cheaper than full retraining. These conditions are

the basis for the experimental results we observed in the pre-

vious section. A general property that a model must have

for our approach to work are so-called sparse computational

dependencies [9, 27]. A formal way to reason about these

is via a bipartite dependency graph G(V

, V

, E) [3] where

the vertex set V

denotes partitions of our input data, the

vertex set V

denotes partitions of the ﬁnal model, and an

edge (v

, v

) ∈ E between a vertex v

and a vertex v

de-

notes that the input i is necessary to compute the m-th part

of the model. This graph allows us to determine which parts

of the model need to be recomputed if we remove a partic-

ular input v

. If the graph is sparse (e.g., the vertices in V

have a low degree), only a small part of the model must be

recomputed. If the graph was fully connected, the removal

of a single input would trigger a complete retraining of the

model.

0 1 1 0 0 0

interactions H

similarities S

items

Figure 4: Illustration of the sparse computational

dependencies in item-based collaborative ﬁltering.

Each (sparse) row of the input H aﬀects only a frac-

tion of the model S, indicated by the non-zero en-

tries in the row.

The algorithms discussed in this paper exhibit such sparse

computational dependencies: For approximate k-NN, the

vertex set V

comprises of the user observations (e.g., the

rows of the input matrix X, and the vertex set V

consists

of the buckets of the hash tables T

, ..., T

. The resulting

dependency graph is sparse, because we hash each input X

to H buckets only, which means the degree of each v ∈ V

is exactly H, and upon removal of an input, only H buckets

must be updated.

The dependency graph between the inputs (the user histo-

ries H

) and the cooccurrences (entries of the cooccurrence

matrix C for item-based collaborative ﬁltering is sparse as

well, as shown in Figure 4. Each similarity S

depends on all

user histories where the item pair (i, j) occurs. That means

that a single user history contributes to a quadratic number

of cooccurrences (all possible item pairs). As a result, the

resulting dependency graph is sparse as well, because the

majority of users only interact with a very small subset of

the overall items.

The dependency graph model also explains why our ap-

proach is not applicable to models learnt with gradient de-

scent, as this learning algorithm exhibits dense computa-

tional dependencies. In general, we learn a supervised model

w on labeled examples {(x

, y

), . . . , (x

, y

)} as follows us-

ing gradient descent:

(t+1)

= w

(t)

− λ

x,y

∇

(t)

`(x, y, w

(t)

)

We obtain w

(t+1)

by updating w

(t)

(according to the learn-

ing rate λ) in the negative direction of the gradient of the

model’s loss function, which we compute as the sum of the

gradients over the losses ` of the individual examples (x, y).

That means that each intermediate model w

(t)

is dependent

on all inputs with a non-zero gradient and recursively de-

pends on all previous model versions w

(t−1)

, . . . , w

(0)

, giving

rise to a dense computational dependency graph, as sketched

in Figure 5.

(0)

(1)

(2)

...

features & label

Figure 5: Illustration of the dense computational

dependencies in stochastic gradient descent-based

learning. Each model iterate w

(t)

depends on the

whole input and the previous iterate w

(t−1)

This illustrates that our approach works best in case of non-

iterative learning procedures with sparse computational de-

pendencies. While such methods are conceptually simpler

than currently popular approaches (e.g., deep neural net-

works), they are nevertheless widely used in industry (e.g.,

in the Amazon SageMaker platform

, which oﬀers k-nearest

neighbors, linear models, and k-Means). Furthermore, it

has recently been shown [6] that classical itembased collab-

orative ﬁltering outperforms recurrent neural networks for a

wide variety of recommendation datasets.

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.

html

7.3 Future Directions

The next directions for this line of research are two-fold: we

will explore decremental updates for more ML algorithms,

and extend the decremental updates to the whole ML pipeline

(e.g., provide decremental/incremental data preprocessing

and feature encoding operations). We additionally intend

to explore changes to the training procedures of gradient-

descent-based models to also enable decremental updates

for them.

We would furthermore like to stress that challenges with

respect to reliably deleting user data are not constrained to

ML models and pipelines; they extend to the entire data

management lifecycle, which often includes data replication

when run in cloud environments.

Acknowledgements. We would like to thank Frank Mc-

Sherry and Ted Dunning for helpful comments on improving

implementations and algorithms in this paper.

8. REFERENCES

[1] P. Andrews, A. Kalro, H. Mehanna, and A. Sidorov.

Productionizing machine learning pipelines at scale. ML

Systems workshop at ICML, 2016.

[2] G. Ateniese, G. Felici, L. V. Mancini, A. Spognardi,

A. Villani, and D. Vitali. Hacking smart machines with

smarter ones: How to extract meaningful data from

machine learning classiﬁers. arXiv preprint

arXiv:1306.4447, 2013.

[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed

computation: numerical methods, volume 23. Prentice hall

Englewood Cliﬀs, NJ, 1989.

[4] Y. Cao and J. Yang. Towards making systems forget with

machine unlearning. IEEE Symposium on Security and

Privacy, pages 463–480, 2015.

[5] G. Cauwenberghs and T. Poggio. Incremental and

decremental support vector machine learning. NeurIPS,

pages 409–415, 2001.

[6] M. F. Dacrema, P. Cremonesi, and D. Jannach. Are we

really making much progress? a worrying analysis of recent

neural recommendation approaches. RecSys, pages 101–109,

2019.

[7] Z. Ding, Y. Wang, G. Wang, D. Zhang, and D. Kifer.

Detecting violations of diﬀerential privacy. CCS, pages

475–489, 2018.

[8] C. Dwork, F. McSherry, K. Nissim, and A. Smith.

Calibrating noise to sensitivity in private data analysis.

Theory of Cryptography, pages 265–284, 2006.

[9] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl.

Spinning fast iterative data ﬂows. PVLDB,

5(11):1268–1279, 2012.

[10] Executive Oﬃce of the President. White House Report on

Big Data: Seizing Opportunities, Preserving Values, 2014.

[11] GDPR.eu. Recital 65: Right of rectiﬁcation and erasure.

https://gdpr.eu/recital-65-right-of-rectiﬁcation-and-erasure

[12] A. Ginart, M. Y. Guan, G. Valiant, and J. Zou. Making AI

forget you: Data deletion in machine learning. CoRR,

abs/1907.05012, 2019.

[13] A. Gionis, P. Indyk, and R. Motwani. Similarity search in

high dimensions via hashing. VLDB, pages 518–529, 1999.

[14] A. Gupta, I. S. Mumick, et al. Maintenance of materialized

views: Problems, techniques, and applications. IEEE Data

Eng. Bull., 18(2):3–18, 1995.

[15] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased

estimation for nonorthogonal problems. Technometrics,

12(1):55–67, 1970.

[16] M. Karasuyama and I. Takeuchi. Multiple incremental

decremental learning of support vector machines. NeurIPS,

pages 907–915, 2009.

[17] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker,

L. R. Gordon, and J. Riedl. Grouplens: applying

collaborative ﬁltering to usenet news. Communications of

the ACM, 40(3):77–87, 1997.

[18] F. McSherry, A. Lattuada, and M. Schwarzkopf. K-pg:

Shared state in diﬀerential dataﬂows, 2018.

[19] F. McSherry, D. G. Murray, R. Isaacs, and M. Isard.

Diﬀerential dataﬂow. CIDR, 2013.

[20] D. G. Murray, F. McSherry, R. Isaacs, M. Isard,

P. Barham, and M. Abadi. Naiad: a timely dataﬂow

system. SOSP, pages 439–455, 2013.

[21] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao,

F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke.

Tensorﬂow-serving: Flexible, high-performance ml serving.

ML Systems workshop at NeurIPS, 2017.

[22] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger.

Tackling the poor assumptions of naive bayes text

classiﬁers. ICML, pages 616–623, 2003.

[23] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, et al.

Item-based collaborative ﬁltering recommendation

algorithms. WWW, 1:285–295, 2001.

[24] S. Schelter, F. Biessmann, T. Januschowski, D. Salinas,

S. Seufert, and G. Szarvas. On challenges in machine

learning model management. Data Engineering, 2018.

[25] S. Schelter, C. Boden, and V. Markl. Scalable

similarity-based neighborhood methods with mapreduce.

ACM RecSys, pages 163–170, 2012.

[26] S. Schelter, U. Celebi, and T. Dunning. Eﬃcient

incremental cooccurrence analysis for item-based

collaborative ﬁltering. SSDBM, 2019.

[27] S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All roads

lead to rome: optimistic recovery for distributed iterative

data processing. CIKM, pages 1919–1928, 2013.

[28] R. Shokri, M. Stronati, C. Song, and V. Shmatikov.

Membership inference attacks against machine learning

models. pages 3–18, 2017.

[29] J. Stoyanovich, S. Abiteboul, and G. Miklau. Data,

responsibly: Fairness, neutrality and transparency in data

analysis. EDBT, 2016.

[30] C. F. Van Loan and G. H. Golub. Matrix computations.

Johns Hopkins University Press, 1983.