DokuWiki

Problem Formulation

Let $w = (w_1,...,w_n)$ denote a sentence. We represent each word $w_i$ and it's context as a feature vector $\phi(w,i)$ of dimension $d$ (this is for you to specify). This feature vector should generalize to unseen words.

Each annotated sentence w has an associated set of arcs $(i\rightarrow j)$ , and we denote this set $T(w)$ . Let $W$ be the annotated corpus. A function that maps an arc to a value is given by

$f(i, j,w) = \phi(w,i)^T A \phi(w,j) = \text{tr}(A \phi(w,j) \phi(w,i)^T) + \eta^T d(i,j)$

where $A\in \mathbb{R}^{d\times d}$ , $d(i,j)$ are features depending on the distance between word $i$ and word $j$ . Let $\mathcal{Y}(w)$ be the set of feasible parse trees for sentence $w$ . $Y(w)\in \mathcal{Y}(w)$ is a collection of arcs that forms a parse tree which associates a score

$s(Y(w)) = \sum_{(i,j) \in Y(w)} f(i,j, w) = \text{tr}(A \sum_{(i,j) \in Y(w)} \phi(w,j) \phi(w,i)^T) + \eta^T \sum_{(i,j) \in Y(w)}d(i,j)$

Let $\tilde{Y}(w) = \text{argmax}_{Y(w) \in \mathcal{Y}(w)} s(Y(w))$ . Assume we are given $m$ sentences and their Dependence parsing $(\hat{w}^i, \hat{Y}^i(w))$ , we try to solve following optimization problem,

$\begin{eqnarray*} &\min_{A,\eta} & f(A,\eta) = \frac{1}{m} \sum_{i=1}^m \ell(A,\eta;(\hat{w}^i,\hat{Y}^i)) \\ &s.t. & \|A\|_* + \lambda_\eta\|\eta\|_2 \leq C \end{eqnarray*}$

where $\ell(A,\eta;(\hat{w},\hat{Y}))$ is a convex and differentiable¹⁾ loss function. We are using a ``hard'' regularization constraint instead of regularization terms in the objective, and later we'll apply online SDP algorithm (Jaggi & Sulovsky, 2010) (Hazan, 2008).

Trace Norm Regularization to SDP

The regularization constraint of $A, \eta$ can be re-written using block matrix $W=[\ [\ A, 0\ ], [\ 0, \eta\ ]\ ]$ ,

$\|W\|_* = \left\|\left( \begin{array}{cc} A & \mathbf{0} \\ \mathbf{0} & \lambda_\eta \eta \end{array} \right) \right\|_* \leq C$

(Jaggi & Sulovsky, 2010) showed the above constraint is equivalent to the constraint using positive semi-definite matrix,

$Z = \frac{1}{2C} \left( \begin{array}{cc} P & W \\ W^T & Q \end{array} \right) \succeq 0 \quad\text{and}\quad \text{Tr}(Z)=1$

and therefore the original optimization problem can be cast as a semi-definite programming (SDP) problem of the following form,

$\min_{Z\in \mathcal{S}, Z\succeq 0, \text{Tr}(Z)=1} \hat{f}(Z)$

where $\hat{f}(Z)=f(W)$ is the objection function.

Hazan's algorithm

For the semi-definite optimization problem shown above, (Hanza, 2008) gives an gradient descent type algorithm with $O(\frac{1}{\epsilon})$ convergence rate. Another advantage of this algorithm is that the parameter matrix is maintained using a list of rank-1 component, and the rank of the matrix is at most $k$ after $k$ iterations.

$\begin{enumeration} \item Initialize $Z^{(1)} := v_0 v_0^T$ for arbitrary unit vector $v$. \item For $k = 1$ to $T$ do \item $\quad\quad$ Compute $v_k := \text{ApproxEV}\left(-\nabla f(Z^{(k)}),\frac{C_f}{k^2}\right)$. \item $\quad\quad$ Set $\alpha_k := \frac{2}{k}$. \item $\quad\quad$ Set $Z^{(k+1)} := Z^{(k)} + \alpha_k\left(v_k v_k^T - Z^{(k)} \right)$. \item End for \end{enumeration}$

where $\text{ApproxEx}(M,\epsilon)$ is a sub-routine used to calculate the $\epsilon$ -approximate largest singular value of matrix $M$ and $C_f$ is a constant called modulus of convexity (or curvature constant) of function $f$ ,

$C_f := \sup_{Z,V\in \mathcal{S}, \alpha\in \mathcal{R}, \atop Z'=Z+\alpha(V-Z)} \frac{1}{\alpha^2} \left( f(Z') - f(Z) + \langle Z'-Z,\nabla f(z) \rangle \right),$

and $\alpha\in [0,1]$ in this application and algorithm.

References

Martin Jaggi and Marek Sulovsky, A Simple Algorithm for Nuclear Norm Regularized Problems. ICML 2010 PDF
Elad Hazan, Sparse Approximate Solutions to Semidefinite Programs. LATIN'08 PDF
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007 PDF

¹⁾ discuss if non-differentiable is doable

DokuWiki

User Tools

Site Tools

Table of Contents

Problem Formulation

Trace Norm Regularization to SDP

Hazan's algorithm

References

Page Tools