User Tools

Site Tools


project:tensor_dp:draft_dec

Table of Contents

Formulation

Let $x$ denote a sentence and $\mathcal{Y}(x)$ be the set of dependency trees. In first order dependency parsing, the score of a tree $y \in \mathcal{Y}(x)$ is the sum of arc scores $S(x,y) = \sum_{(h,m) \in y} s(h\rightarrow m)$, where (h,m) is a (head, modifier) arc in the tree.

We define the arc score $s(h\rightarrow m)$ as follows,

$$
s(h,m) = \gamma \left< \theta,\  \phi_{h\rightarrow m} \right> + (1-\gamma) \left< A,\  \phi_h\otimes \phi_m \otimes dis(h-m) \right>
$$

where $\phi_{h\rightarrow m}$ is the feature vector defined on arc $(h,m)$, $\phi_h$ and $\phi_m$ are feature vectors defined on word $h$ and $m$ respectively, $dis(x)$ is a vector indicating the distance, and $\gamma \in [0,1]$ is a hyper-parameter controling the importance of the two terms. Model parameters are vector $\theta \in \mathbb{R}^L$ and tensor $A^{N\times N \times 15}$.

Assuming $A$ has low-rank, it is explicitly represented using Kruskal form $A=\sum_r u^r\otimes v^r \otimes w^r$, where $U=[u^r]^T \in \mathbb{R}^{R \times n}$, $V=[v^r]^T \in \mathbb{R}^{R\times N}$ and $W=[w^r]^T \in \mathbb{R}^{R\times 15}$.



Learning

We use passive-aggressive online update to learn model parameters $(\theta, U, V, W)$ in an “alternating” fashion. For each training sentence and its annotated parse tree $(\hat{x}, \hat{y})$, we solve one of the following optimization problems:

\begin{eqnarray}
& \min_{\nabla{\theta},\ \nabla{U},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{U}\|^2 +  C \xi \\
\nonumber \\
\text{or,} & \min_{\nabla{\theta},\ \nabla{V},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{V}\|^2 +  C \xi \\
\nonumber \\
\text{or,} & \min_{\nabla{\theta},\ \nabla{W},\ \xi\geq 0} \quad \frac{1}{2} \|\nabla{\theta}\|^2 + \frac{1}{2} \|\nabla{W}\|^2 +  C \xi \\
\nonumber \\
& \text{s.t.} \quad S(\hat{x}, \hat{y}) \geq S(\hat{x}, \tilde{y}) + \|\hat{y}-\tilde{y}\|_1 - \xi \nonumber
\end{eqnarray}

The above optimization problem (1) has closed form solution:

\begin{eqnarray*}
\nabla{\theta} &=& \min\left\lbrace C, \frac{\text{loss}}{\gamma^2 \|d\theta\|^2 + (1-\gamma)^2 \|du\|^2} \right\rbrace  \gamma d\theta \\
\nabla{U} &=& \min\left\lbrace C, \frac{\text{loss}}{\gamma^2 \|d\theta\|^2 + (1-\gamma)^2 \|du\|^2} \right\rbrace (1-\gamma) du
\end{eqnarray*}

where

\begin{eqnarray*}
\text{loss} &=& S(\hat{x},\tilde{y}) + \|\hat{y}-\tilde{y}\|_1 - S(\hat{x},\hat{y}) \\
d\theta &=& \sum_{h\rightarrow m\ \in\ \hat{y}} \phi_{h\rightarrow m} - \sum_{h\rightarrow m\ \in\ \tilde{y}} \phi_{h\rightarrow m} \\
du &=& \sum_{h\rightarrow m\ \in\ \hat{y}} \left[ (V\phi_m) * (W dis(h-m)) \right] \otimes \phi_h - \sum_{h\rightarrow m\ \in\ \tilde{y}} \left[ (V\phi_m) * (W dis(h-m)) \right] \otimes \phi_h
\end{eqnarray*}


( $*$ denotes element-wise product, i.e. $(x * y)_i = x_i y_i$. )

Parameter Averaging

Similar to MST parser and Turbo parser, we averaged parameters when evaluating / testing the model. We average $U$, $V$ and $W$ independently. Although the effect of averaging $U$ and $V$ in the term $UV^T$ is not very much understood, we find it emperically works well.



References

project/tensor_dp/draft_dec.txt · Last modified: 2014/01/05 21:21 by taolei