==== Problem Formulation ====
We use $x$ and $y$ to denote a sentence and a dependency tree respectively. We consider score functions of the form $S(x,y)=\theta\cdot\phi(x,y)$, in which $\phi(x,y)$ may include any local and global features.
Given the parameter $\theta$, the decoding is a structural prediction task which aims to solve the maximization problem:
$$
\tilde{y}\ =\ \text{argmax}_{\,y\in\mathcal{T}(x)}\ \ S(x,y)
$$
For arbitrary score function $S(x,y)$, the maximization problem could be NP-hard even on 2nd order non-projective parsing. However, we are going to show that a greedy hill climbing algorithm can work empirically and theoretically (under a certain condition).
\\
==== Hill Climbing Algorithm ====
The Hill Climbing algorithm operates similarly to Gibbs sampling, by changing the head of a modifier word each time.
Let $\mathcal{T}(y,m)$ be the set of trees that can differ from tree y only at the head of word m. The algorithm is as follows,
\\
^ Input: | $\theta,\quad x$ |
^ Output: | $\tilde{y}\ \approx\ \text{argmax}_{\, y\in\mathcal{Y}}\ \ S(x,y)$ |
^ Algorithm: |
\begin{enumerate}
\item Randomly initialize tree $y^{(0)}\sim Q$
\item t = 0
\item $\text{Re}$peat $i=1,\ 2,\ \cdots$ do:
\item $\quad\ $ list = reverse depth first traverse of $y^{(t)}$
\item $\quad\ $ For each word $m$ in list do:
\item $\quad\ $$\quad\ $ $y^{(t+1)} = \text{argmax}_{\,y'\in \mathcal{T}(y^{(t)},m)}\ S(x,y')$
\item $\quad\ $$\quad\ $ $t = t+1$
\item Until no change has been made in i-th iteration
\item Return $\tilde{y} = y^{(t)}$
\end{enumerate}
|
\\
Note that (1) the algorithm initializes $y^{(0)}$ from a distribution $Q$; (2) the algorithm repeatedly checks each word in **//reverse DFS order//** (thus a word that is far away from the root comes first); and (3) when changing the head of m, it has to keep $y^{(t)}$ a valid tree at any point $t$; (4) it operates until no changes are possible to increase the score.
\\
==== Justification of the Algorithm ====
Let us first denote a given sentence as $\hat{x}$ and the corresponding correct dependency tree as $\hat{y}$, which is also provided during training. In testing scenario, we just assume there is such a correct tree $\hat{y}$ but it is unknown.
To justify the algorithm, two questions need to be answered:
- Does the algorithm (and under what condition) find the best tree $\hat{y}$ ?
- What is and how hard is the learning problem ?
\\
=== Q1: Sufficient Conditions ===
We can first consider how things work in a traditional structural prediction model. The traditional model aims to learn a score function that grades $\hat{y}$ higher than all other trees:
\begin{eqnarray*}
S(\hat{x}, \hat{y}) \geq S(\hat{x}, y) + \text{err}(y) \quad\quad y\in \mathcal{T}(\hat{x}) \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (1)
\end{eqnarray*}
where $\text{err}(y)$ is the number of wrong dependency arcs in this tree (i.e. $\ell_1$ structural loss). When condition (1) holds, solving the maximization problem would give us $\tilde{y}=\hat{y}$, but the problem is generally NP-hard.
\\
Alternatively, we can force the model to learn a similar but stronger score function, which would make simpler maximization algorithm possible:
\begin{eqnarray*}
S(\hat{x}, y) \geq S(\hat{x}, y') + 1 \quad\quad y \in \mathcal{T}(y',m) \quad \text{err}(y)=\text{err}(y')-1 \quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ (2)
\end{eqnarray*}
What condition (2) states is that, the score will increase by at least 1 if we correct the head of a modifier. Note that **condition (2) implies condition (1)**. If condition (2) holds, we expect the Hill Climbing algorithm should successfully climb to the global optimal $\hat{y}$.
There is a non-trivial point left -- we need to prove that when the algorithm tries to correct each head **//in some order//**, the structure will **//remain a valid tree//** at any point t. The following proof supports the //correctness// of the Hill Climbing algorithm.
=== Claim: ===
Suppose condition (2) holds, the Hill Climbing algorithm terminates in $t=O(n)$ and returns the correct tree $\hat{y}$, where $n$ is the number of words in $x$. Specifically, for $t=0,\ 1,\ \cdots,\ n$:
$$
err(y^{(0)}) \ge err(y^{(1)}) \ge \cdots \ge err(y^{(n)}) = 0
$$
and therefore $y^{(n)}=\hat{y}$.
++++ Proof: |
We show for each time $t$ and the modifier word $m$ been processed, (a) correcting the head of $m$ gives a valid tree, which is in $\mathcal{T}(y^{(t)},m)$; and (b) the argmax tree (line 6 in the algorithm) is the tree by correcting the head of $m$.
**(a)** Note that each $m$ is processed in reverse DFS order. When $t=0$ the word $m$ is a leaf word of $y^{(0)}$, and therefore changing its head to any other word is still a valid tree.
Generally at any time $t$, we consider the sub-tree of root $m$ in $y^{(t)}$. There is a non-obvious fact that all words in the sub-tree are already checked before time $t$ and their heads are corrected. This can be shown by contradiction. Assume a word $v$ in this sub-tree is not processed yet and its current head is denoted as $h(v)$. The arc $h(v)\rightarrow v$ is both in $y^{(0)}$ and $y^{(t)}$ because it is not processed yet. Because the algorithm processes words in reverse DFS order, $h(v)$ is also not processed. Similarly $h(h(v))$, $\cdots$, $m$ are not processed. This contradicts to that $m$ is the next immediate word in the reverse DFS list.
Now all heads and arcs in the sub-tree of root $m$ are correct. Let $\hat{h}$ be the correct head of $m$. $\hat{h}$ cannot be in this sub-tree otherwise there is a loop $\hat{h}\rightarrow m \cdots \rightarrow \hat{h}$ in $\hat{y}$. Therefore changing the head of $m$ to $\hat{h}$ does not form a loop and gives a valid tree.
**(b)** As long as changing the head of $m$ to be $\hat{h}$ gives us a valid tree, by condition (2) we know that its score is at least 1 higher than all other options, i.e. the tree is the argmax.
Since the algorithm correct the head of $m$, $err(y)$ is non-increasing and will be 0 after all heads are corrected (at $t=n$).++++
\\
=== Q2: Learning Problem ===
To learn the score function given training data $D=\left\lbrace\hat{x_i},\hat{y_i}\right\rbrace$, the standard is to minimize the (hinge) loss on the training data:
$$
\sum_{i=1}^{N} \mathcal{L}(\hat{x_i},\hat{y_i}) + \|\theta\|_2^2
$$
If we are to minimize the loss that corresponds to condition (2), it would be:
$$
\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{y,y'} \left\lbrace S(\hat{x_i}, y') + 1 - S(\hat{x_i}, y) \right\rbrace \quad\quad\quad (2o)
$$
Unfortunately, it is not clear if there is an efficient way to compute this loss function by finding $y$ and $y'$. Instead, we could use the standard structural learning loss function, which corresponds to condition (1):
$$
\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{\tilde{y}}\left\lbrace S(\hat{x_i}, \tilde{y}) + \text{err}(\tilde{y})\right\rbrace - S(\hat{x_i}, \hat{y_i}) \quad\quad (1o)
$$
where $\tilde{y}$ is the output of the decoding algorithm.
The connections between the two loss functions are the following: condition (1) is a necessary condition of condition (2), and (1o) is the lower bound of sum of loss (2o) for each two consecutive $(y, y')$ on any "climbing" path from $\tilde{y}$ to $\hat{y}$. Thus, by minimizing (1o) we are making the lower bound of (2o) close to zero. If this lower bound is **//tight//** (because $err(\tilde{y})$ is small and the path from $\tilde{y}$ to $\hat{y}$ is short),