==== Problem Formulation ==== We use $x$ and $y$ to denote a sentence and a dependency tree respectively. We consider score functions of the form $S(x,y)=\theta\cdot\phi(x,y)$, in which $\phi(x,y)$ may include any local and global features. Given the parameter $\theta$, the decoding is a structural prediction task which aims to solve the maximization problem: $$ \tilde{y}\ =\ \text{argmax}_{\,y\in\mathcal{T}(x)}\ \ S(x,y) $$ For arbitrary score function $S(x,y)$, the maximization problem could be NP-hard even on 2nd order non-projective parsing. However, we are going to show that a greedy hill climbing algorithm can work empirically and theoretically (under a certain condition). \\ ==== Hill Climbing Algorithm ==== The Hill Climbing algorithm operates similarly to Gibbs sampling, by changing the head of a modifier word each time. Let $\mathcal{T}(y,m)$ be the set of trees that can differ from tree y only at the head of word m. The algorithm is as follows, \\ ^ Input: | $\theta,\quad x$ | ^ Output: | $\tilde{y}\ \approx\ \text{argmax}_{\, y\in\mathcal{Y}}\ \ S(x,y)$ | ^ Algorithm: | \begin{enumerate} \item Randomly initialize tree $y^{(0)}\sim Q$ \item t = 0 \item $\text{Re}$peat $i=1,\ 2,\ \cdots$ do: \item $\quad\ $ list = reverse depth first traverse of $y^{(t)}$ \item $\quad\ $ For each word $m$ in list do: \item $\quad\ $$\quad\ $ $y^{(t+1)} = \text{argmax}_{\,y'\in \mathcal{T}(y^{(t)},m)}\ S(x,y')$ \item $\quad\ $$\quad\ $ $t = t+1$ \item Until no change has been made in i-th iteration \item Return $\tilde{y} = y^{(t)}$ \end{enumerate} | \\ Note that (1) the algorithm initializes $y^{(0)}$ from a distribution $Q$; (2) the algorithm repeatedly checks each word in **//reverse DFS order//** (thus a word that is far away from the root comes first); and (3) when changing the head of m, it has to keep $y^{(t)}$ a valid tree at any point $t$; (4) it operates until no changes are possible to increase the score. \\ ==== Justification of the Algorithm ==== Let us first denote a given sentence as $\hat{x}$ and the corresponding correct dependency tree as $\hat{y}$, which is also provided during training. In testing scenario, we just assume there is such a correct tree $\hat{y}$ but it is unknown. To justify the algorithm, two questions need to be answered: - Does the algorithm (and under what condition) find the best tree $\hat{y}$ ? - What is and how hard is the learning problem ? \\ === Q1: Sufficient Conditions === We can first consider how things work in a traditional structural prediction model. The traditional model aims to learn a score function that grades $\hat{y}$ higher than all other trees: \begin{eqnarray*} S(\hat{x}, \hat{y}) \geq S(\hat{x}, y) + \text{err}(y) \quad\quad y\in \mathcal{T}(\hat{x}) \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (1) \end{eqnarray*} where $\text{err}(y)$ is the number of wrong dependency arcs in this tree (i.e. $\ell_1$ structural loss). When condition (1) holds, solving the maximization problem would give us $\tilde{y}=\hat{y}$, but the problem is generally NP-hard. \\ Alternatively, we can force the model to learn a similar but stronger score function, which would make simpler maximization algorithm possible: \begin{eqnarray*} S(\hat{x}, y) \geq S(\hat{x}, y') + 1 \quad\quad y \in \mathcal{T}(y',m) \quad \text{err}(y)=\text{err}(y')-1 \quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ (2) \end{eqnarray*} What condition (2) states is that, the score will increase by at least 1 if we correct the head of a modifier. Note that **condition (2) implies condition (1)**. If condition (2) holds, we expect the Hill Climbing algorithm should successfully climb to the global optimal $\hat{y}$. There is a non-trivial point left -- we need to prove that when the algorithm tries to correct each head **//in some order//**, the structure will **//remain a valid tree//** at any point t. The following proof supports the //correctness// of the Hill Climbing algorithm. === Claim: === Suppose condition (2) holds, the Hill Climbing algorithm terminates in $t=O(n)$ and returns the correct tree $\hat{y}$, where $n$ is the number of words in $x$. Specifically, for $t=0,\ 1,\ \cdots,\ n$: $$ err(y^{(0)}) \ge err(y^{(1)}) \ge \cdots \ge err(y^{(n)}) = 0 $$ and therefore $y^{(n)}=\hat{y}$. ++++ Proof: | We show for each time $t$ and the modifier word $m$ been processed, (a) correcting the head of $m$ gives a valid tree, which is in $\mathcal{T}(y^{(t)},m)$; and (b) the argmax tree (line 6 in the algorithm) is the tree by correcting the head of $m$. **(a)** Note that each $m$ is processed in reverse DFS order. When $t=0$ the word $m$ is a leaf word of $y^{(0)}$, and therefore changing its head to any other word is still a valid tree. Generally at any time $t$, we consider the sub-tree of root $m$ in $y^{(t)}$. There is a non-obvious fact that all words in the sub-tree are already checked before time $t$ and their heads are corrected. This can be shown by contradiction. Assume a word $v$ in this sub-tree is not processed yet and its current head is denoted as $h(v)$. The arc $h(v)\rightarrow v$ is both in $y^{(0)}$ and $y^{(t)}$ because it is not processed yet. Because the algorithm processes words in reverse DFS order, $h(v)$ is also not processed. Similarly $h(h(v))$, $\cdots$, $m$ are not processed. This contradicts to that $m$ is the next immediate word in the reverse DFS list. Now all heads and arcs in the sub-tree of root $m$ are correct. Let $\hat{h}$ be the correct head of $m$. $\hat{h}$ cannot be in this sub-tree otherwise there is a loop $\hat{h}\rightarrow m \cdots \rightarrow \hat{h}$ in $\hat{y}$. Therefore changing the head of $m$ to $\hat{h}$ does not form a loop and gives a valid tree. **(b)** As long as changing the head of $m$ to be $\hat{h}$ gives us a valid tree, by condition (2) we know that its score is at least 1 higher than all other options, i.e. the tree is the argmax. Since the algorithm correct the head of $m$, $err(y)$ is non-increasing and will be 0 after all heads are corrected (at $t=n$).++++ \\ === Q2: Learning Problem === To learn the score function given training data $D=\left\lbrace\hat{x_i},\hat{y_i}\right\rbrace$, the standard is to minimize the (hinge) loss on the training data: $$ \sum_{i=1}^{N} \mathcal{L}(\hat{x_i},\hat{y_i}) + \|\theta\|_2^2 $$ If we are to minimize the loss that corresponds to condition (2), it would be: $$ \mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{y,y'} \left\lbrace S(\hat{x_i}, y') + 1 - S(\hat{x_i}, y) \right\rbrace \quad\quad\quad (2o) $$ Unfortunately, it is not clear if there is an efficient way to compute this loss function by finding $y$ and $y'$. Instead, we could use the standard structural learning loss function, which corresponds to condition (1): $$ \mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{\tilde{y}}\left\lbrace S(\hat{x_i}, \tilde{y}) + \text{err}(\tilde{y})\right\rbrace - S(\hat{x_i}, \hat{y_i}) \quad\quad (1o) $$ where $\tilde{y}$ is the output of the decoding algorithm. The connections between the two loss functions are the following: condition (1) is a necessary condition of condition (2), and (1o) is the lower bound of sum of loss (2o) for each two consecutive $(y, y')$ on any "climbing" path from $\tilde{y}$ to $\hat{y}$. Thus, by minimizing (1o) we are making the lower bound of (2o) close to zero. If this lower bound is **//tight//** (because $err(\tilde{y})$ is small and the path from $\tilde{y}$ to $\hat{y}$ is short),