We use and to denote a sentence and a dependency tree respectively. We consider score functions of the form , in which may include any local and global features.
Given the parameter , the decoding is a structural prediction task which aims to solve the maximization problem:
For arbitrary score function , the maximization problem could be NP-hard even on 2nd order non-projective parsing. However, we are going to show that a greedy hill climbing algorithm can work empirically and theoretically (under a certain condition).
The Hill Climbing algorithm operates similarly to Gibbs sampling, by changing the head of a modifier word each time.
Let be the set of trees that can differ from tree y only at the head of word m. The algorithm is as follows,
Input: | |
---|---|
Output: | |
Algorithm: |
|
Note that (1) the algorithm initializes from a distribution ; (2) the algorithm repeatedly checks each word in reverse DFS order (thus a word that is far away from the root comes first); and (3) when changing the head of m, it has to keep a valid tree at any point ; (4) it operates until no changes are possible to increase the score.
Let us first denote a given sentence as and the corresponding correct dependency tree as , which is also provided during training. In testing scenario, we just assume there is such a correct tree but it is unknown.
To justify the algorithm, two questions need to be answered:
We can first consider how things work in a traditional structural prediction model. The traditional model aims to learn a score function that grades higher than all other trees:
where is the number of wrong dependency arcs in this tree (i.e. structural loss). When condition (1) holds, solving the maximization problem would give us , but the problem is generally NP-hard.
Alternatively, we can force the model to learn a similar but stronger score function, which would make simpler maximization algorithm possible:
What condition (2) states is that, the score will increase by at least 1 if we correct the head of a modifier. Note that condition (2) implies condition (1). If condition (2) holds, we expect the Hill Climbing algorithm should successfully climb to the global optimal .
There is a non-trivial point left – we need to prove that when the algorithm tries to correct each head in some order, the structure will remain a valid tree at any point t. The following proof supports the correctness of the Hill Climbing algorithm.
Suppose condition (2) holds, the Hill Climbing algorithm terminates in and returns the correct tree , where is the number of words in . Specifically, for :
and therefore .
To learn the score function given training data , the standard is to minimize the (hinge) loss on the training data:
If we are to minimize the loss that corresponds to condition (2), it would be:
Unfortunately, it is not clear if there is an efficient way to compute this loss function by finding and . Instead, we could use the standard structural learning loss function, which corresponds to condition (1):
where is the output of the decoding algorithm.
The connections between the two loss functions are the following: condition (1) is a necessary condition of condition (2), and (1o) is the lower bound of sum of loss (2o) for each two consecutive on any “climbing” path from to . Thus, by minimizing (1o) we are making the lower bound of (2o) close to zero. If this lower bound is tight (because is small and the path from to is short),