Let denote a sentence and be the set of dependency trees. In first order dependency parsing, the score of a tree is the sum of arc scores , where (h,m) is a (head, modifier) arc in the tree.
We define the arc score as follows,
where is the feature vector defined on arc , and are feature vectors defined on word and respectively, is a vector indicating the distance, and is a hyper-parameter controling the importance of the two terms. Model parameters are vector and tensor .
Assuming has low-rank, it is explicitly represented using Kruskal form , where , and .
We use passive-aggressive online update to learn model parameters in an “alternating” fashion. For each training sentence and its annotated parse tree , we solve one of the following optimization problems:
The above optimization problem (1) has closed form solution:
where
( denotes element-wise product, i.e. . )
Similar to MST parser and Turbo parser, we averaged parameters when evaluating / testing the model. We average , and independently. Although the effect of averaging and in the term is not very much understood, we find it emperically works well.