We'll use notations similar to [[project:tensor_dp:old_draft|batch learning draft]]. ==== Problem Formulation ==== Let $w = (w_1,...,w_n)$ denote a sentence. We represent each word $w_i$ and it's context as a feature vector $\phi(w,i)$ of dimension $d$ (this is for you to specify). This feature vector should generalize to unseen words. Each annotated sentence w has an associated set of arcs $(i\rightarrow j)$, and we denote this set $T(w)$. Let $W$ be the annotated corpus. A function that maps an arc to a value is given by $$ f(i, j,w) = \phi(w,i)^T A \phi(w,j) = \text{tr}(A \phi(w,j) \phi(w,i)^T) + \eta^T d(i,j) $$ where $A\in \mathbb{R}^{d\times d}$, $d(i,j)$ are features depending on the distance between word $i$ and word $j$. Let $\mathcal{Y}(w)$ be the set of feasible parse trees for sentence $w$. $Y(w)\in \mathcal{Y}(w)$ is a collection of arcs that forms a parse tree which associates a score $$ s(Y(w)) = \sum_{(i,j) \in Y(w)} f(i,j, w) = \text{tr}(A \sum_{(i,j) \in Y(w)} \phi(w,j) \phi(w,i)^T) + \eta^T \sum_{(i,j) \in Y(w)}d(i,j) $$ Let $\tilde{Y}(w) = \text{argmax}_{Y(w) \in \mathcal{Y}(w)} s(Y(w))$. Assume we are given $m$ sentences and their Dependence parsing $(\hat{w}^i, \hat{Y}^i(w))$, we try to solve following optimization problem, \begin{eqnarray*} &\min_{A,\eta} & f(A,\eta) = \frac{1}{m} \sum_{i=1}^m \ell(A,\eta;(\hat{w}^i,\hat{Y}^i)) \\ &s.t. & \|A\|_* + \lambda_\eta\|\eta\|_2 \leq C \end{eqnarray*} where $\ell(A,\eta;(\hat{w},\hat{Y}))$ is a convex and differentiable((discuss if non-differentiable is doable)) loss function. We are using a ``hard'' regularization constraint instead of regularization terms in the objective, and later we'll apply online SDP algorithm [[project:tensor_dp:draft#references|(Jaggi & Sulovsky, 2010)]] [[project:tensor_dp:draft#references|(Hazan, 2008)]]. \\ ==== Trace Norm Regularization to SDP ==== The regularization constraint of $A, \eta$ can be re-written using block matrix $W=[\ [\ A, 0\ ], [\ 0, \eta\ ]\ ]$, $$ \|W\|_* = \left\|\left( \begin{array}{cc} A & \mathbf{0} \\ \mathbf{0} & \lambda_\eta \eta \end{array} \right) \right\|_* \leq C $$ [[project:tensor_dp:draft#references|(Jaggi & Sulovsky, 2010)]] showed the above constraint is equivalent to the constraint using positive semi-definite matrix, $$ Z = \frac{1}{2C} \left( \begin{array}{cc} P & W \\ W^T & Q \end{array} \right) \succeq 0 \quad\text{and}\quad \text{Tr}(Z)=1 $$ and therefore the original optimization problem can be cast as a semi-definite programming (SDP) problem of the following form, $$ \min_{Z\in \mathcal{S}, Z\succeq 0, \text{Tr}(Z)=1} \hat{f}(Z) $$ where $\hat{f}(Z)=f(W)$ is the objection function. \\ === Hazan's algorithm === For the semi-definite optimization problem shown above, [[project:tensor_dp:draft#references|(Hanza, 2008)]] gives an gradient descent type algorithm with $O(\frac{1}{\epsilon})$ convergence rate. Another advantage of this algorithm is that the parameter matrix is maintained using a list of rank-1 component, and the rank of the matrix is at most $k$ after $k$ iterations. \begin{enumeration} \item Initialize $Z^{(1)} := v_0 v_0^T$ for arbitrary unit vector $v$. \item For $k = 1$ to $T$ do \item $\quad\quad$ Compute $v_k := \text{ApproxEV}\left(-\nabla f(Z^{(k)}),\frac{C_f}{k^2}\right)$. \item $\quad\quad$ Set $\alpha_k := \frac{2}{k}$. \item $\quad\quad$ Set $Z^{(k+1)} := Z^{(k)} + \alpha_k\left(v_k v_k^T - Z^{(k)} \right)$. \item End for \end{enumeration} where $\text{ApproxEx}(M,\epsilon)$ is a sub-routine used to calculate the $\epsilon$-approximate largest singular value of matrix $M$ and $C_f$ is a constant called //**modulus of convexity**// (or //**curvature constant**//) of function $f$, $$ C_f := \sup_{Z,V\in \mathcal{S}, \alpha\in \mathcal{R}, \atop Z'=Z+\alpha(V-Z)} \frac{1}{\alpha^2} \left( f(Z') - f(Z) + \langle Z'-Z,\nabla f(z) \rangle \right), $$ and $\alpha\in [0,1]$ in this application and algorithm. \\ ==== References ==== * Martin Jaggi and Marek Sulovsky, **A Simple Algorithm for Nuclear Norm Regularized Problems**. ICML 2010 [[http://www.m8j.net/data/List/Files-149/fastRegNuclearNormOptimization.pdf|PDF]] * Elad Hazan, **Sparse Approximate Solutions to Semidefinite Programs**. LATIN'08 [[http://ie.technion.ac.il/~ehazan/papers/SparseSDP.pdf|PDF]] * Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, **Pegasos: Primal Estimated sub-GrAdient SOlver for SVM**. ICML 2007 [[http://eprints.pascal-network.org/archive/00004062/01/ShalevSiSr07.pdf|PDF]]