Transfer Learning

Transfer Learning

Transfer learning involves two interrelated learning problems with the goal of using knowledge about one set of tasks to improve performance on a related task. Standard machine learning algorithms rely on inductive bias provided by a person, whereas "transfer-aware" algorithms incorporate inductive bias learned from one or more auxiliary tasks. The graphical model at left illustrates how transfer can be accomplished in a Bayesian framework by linking two data sources with a common hyper-distribution.

For this project, we developed a "transfer-aware" version of the naive Bayes classification algorithm, implemented using an extension of the slice sampling method by Radford Neil (2003). We tested the algorithm on a meeting acceptance task, where the goal was to predict whether a person would accept or reject a meeting invitation given previously gathered information about the person's schedule and relationships. Twenty-one individuals participated and supplied a total of 3966 labeled examples. Each example was represented using 15 features that captured relational information about the inviter, proposed meeting, etc.

The learning curves below illustrate that, not surprisingly, the benefits of transfer learning depend on the similarity of the two data sources. Note that filled circles denote statistically significant differences (p<0.05) with the corresponding "B-only" baseline value. The inset graph shows that overall, many pairs of individuals result in statistically significant positive transfer (blue bars) compared to the transfer-unaware, B-only algorithm. However, about as many pairs yield negative transfer (red bars) at the smallest "B" training sizes. Not explicitly shown is the nearly identical performance of both algorithms at larger training sizes.

In summary, hierarchical Bayesian methods are well-suited for transfer learning since they avoid negative transfer by detecting differences between data sources. For some tasks, however, the remaining challenge is to detect negative transfer with very little data from the target source.

results

Related Publications

To transfer or not to transfer
M.T. Rosenstein, Z. Marx, L.P. Kaelbling, and T.G. Dietterich. To appear, NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, 2005. [pdf]

Transfer learning with an ensemble of background tasks
Z. Marx, M.T. Rosenstein, L.P. Kaelbling, and T.G. Dietterich. To appear, NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, 2005. [pdf]

Movies

File Description

hyperposterior.gif
245 KB Animated GIF that illustrates how the arrival of training data changes the posterior distribution of the hyperparameters. With no data, the hyperprior is uniformly distributed along the horizontal axis but somewhat concentrated about a mean of 50 along the vertical axis. (See the figure below.) This is equivalent to setting a hyperprior that encourages transfer but is otherwise noninformative about the mean value of the lower-level parameter. (In this simple example, the lower-level parameter is the probability of observing heads when tossing a biased coin.) Roughly speaking, the value along the vertical axis acts as a weight that determines how to combine the lower-level data with the higher-level mean value. As auxiliary data arrive, the distribution becomes concentrated about the mean value for the auxiliary data. Then, when target data arrive, the distribution changes dramatically for the case where the two data sources are very different. Note that each frame of the animation shows one target data source (with a mean value depicted by the "B" vertical line) and two different auxiliary data sources (with means depicted by the "A" vertical lines).
hyperprior
distribution

updated 12-Dec-2005

Related Publications

To transfer or not to transfer
	M.T. Rosenstein, Z. Marx, L.P. Kaelbling, and T.G. Dietterich. To appear, NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, 2005. [pdf]

Transfer learning with an ensemble of background tasks
	Z. Marx, M.T. Rosenstein, L.P. Kaelbling, and T.G. Dietterich. To appear, NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, 2005. [pdf]

Movies
File	Description
hyperposterior.gif 245 KB	Animated GIF that illustrates how the arrival of training data changes the posterior distribution of the hyperparameters. With no data, the hyperprior is uniformly distributed along the horizontal axis but somewhat concentrated about a mean of 50 along the vertical axis. (See the figure below.) This is equivalent to setting a hyperprior that encourages transfer but is otherwise noninformative about the mean value of the lower-level parameter. (In this simple example, the lower-level parameter is the probability of observing heads when tossing a biased coin.) Roughly speaking, the value along the vertical axis acts as a weight that determines how to combine the lower-level data with the higher-level mean value. As auxiliary data arrive, the distribution becomes concentrated about the mean value for the auxiliary data. Then, when target data arrive, the distribution changes dramatically for the case where the two data sources are very different. Note that each frame of the animation shows one target data source (with a mean value depicted by the "B" vertical line) and two different auxiliary data sources (with means depicted by the "A" vertical lines).