DokuWiki

Formulation

Let $x$ denote a sentence and $\mathcal{Y}(x)$ be the set of dependency trees. In first order dependency parsing, the score of a tree $y \in \mathcal{Y}(x)$ is the sum of arc scores $S(x,y) = \sum_{(h,m) \in y} s(h\rightarrow m)$ , where (h,m) is a (head, modifier) arc in the tree.

We define the arc score $s(h\rightarrow m)$ as follows,

$s(h,m) = \gamma \left< \theta,\ \phi_{h\rightarrow m} \right> + (1-\gamma) \left< A,\ \phi_h\otimes \phi_m \otimes dis(h-m) \right>$

where $\phi_{h\rightarrow m}$ is the feature vector defined on arc $(h,m)$ , $\phi_h$ and $\phi_m$ are feature vectors defined on word $h$ and $m$ respectively, $dis(x)$ is a vector indicating the distance, and $\gamma \in [0,1]$ is a hyper-parameter controling the importance of the two terms. Model parameters are vector $\theta \in \mathbb{R}^L$ and tensor $A^{N\times N \times 15}$ .

Assuming $A$ has low-rank, it is explicitly represented using Kruskal form $A=\sum_r u^r\otimes v^r \otimes w^r$ , where $U=[u^r]^T \in \mathbb{R}^{R \times n}$ , $V=[v^r]^T \in \mathbb{R}^{R\times N}$ and $W=[w^r]^T \in \mathbb{R}^{R\times 15}$ .

Learning

We use passive-aggressive online update to learn model parameters $(\theta, U, V, W)$ in an “alternating” fashion. For each training sentence and its annotated parse tree $(\hat{x}, \hat{y})$ , we solve one of the following optimization problems:

$\begin{eqnarray} & \min_{\nabla{\theta},\ \nabla{U},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{U}\|^2 + C \xi \\ \nonumber \\ \text{or,} & \min_{\nabla{\theta},\ \nabla{V},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{V}\|^2 + C \xi \\ \nonumber \\ \text{or,} & \min_{\nabla{\theta},\ \nabla{W},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{W}\|^2 + C \xi \\ \nonumber \\ & \text{s.t.} \quad S(\hat{x}, \hat{y}) \geq S(\hat{x}, \tilde{y}) + \|\hat{y}-\tilde{y}\|_1 - \xi \nonumber \end{eqnarray}$

The above optimization problem (1) has closed form solution:

$\begin{eqnarray*} \nabla{\theta} &=& \min\left\lbrace C, \frac{\text{loss}}{\gamma^2 \|d\theta\|^2 + (1-\gamma)^2 \|du\|^2} \right\rbrace \gamma d\theta \\ \nabla{U} &=& \min\left\lbrace C, \frac{\text{loss}}{\gamma^2 \|d\theta\|^2 + (1-\gamma)^2 \|du\|^2} \right\rbrace (1-\gamma) du \end{eqnarray*}$

where

$\begin{eqnarray*} \text{loss} &=& S(\hat{x},\tilde{y}) + \|\hat{y}-\tilde{y}\|_1 - S(\hat{x},\hat{y}) \\ d\theta &=& \sum_{h\rightarrow m\ \in\ \hat{y}} \phi_{h\rightarrow m} - \sum_{h\rightarrow m\ \in\ \tilde{y}} \phi_{h\rightarrow m} \\ du &=& \sum_{h\rightarrow m\ \in\ \hat{y}} \left[ (V\phi_m) * (W dis(h-m)) \right] \otimes \phi_h - \sum_{h\rightarrow m\ \in\ \tilde{y}} \left[ (V\phi_m) * (W dis(h-m)) \right] \otimes \phi_h \end{eqnarray*}$

( $*$ denotes element-wise product, i.e. $(x * y)_i = x_i y_i$ . )

Parameter Averaging

Similar to MST parser and Turbo parser, we averaged parameters when evaluating / testing the model. We average $U$ , $V$ and $W$ independently. Although the effect of averaging $U$ and $V$ in the term $UV^T$ is not very much understood, we find it emperically works well.