Let denote a sentence and
be the set of dependency trees. In first order dependency parsing, the score of a tree
is the sum of arc scores
, where (h,m) is a (head, modifier) arc in the tree.
We define the arc score as follows,
where is the feature vector defined on arc
,
and
are feature vectors defined on word
and
respectively,
is a vector indicating the distance, and
is a hyper-parameter controling the importance of the two terms. Model parameters are vector
and tensor
.
Assuming has low-rank, it is explicitly represented using Kruskal form
, where
,
and
.
We use passive-aggressive online update to learn model parameters in an “alternating” fashion. For each training sentence and its annotated parse tree
, we solve one of the following optimization problems:
The above optimization problem (1) has closed form solution:
where
( denotes element-wise product, i.e.
. )
Similar to MST parser and Turbo parser, we averaged parameters when evaluating / testing the model. We average ,
and
independently. Although the effect of averaging
and
in the term
is not very much understood, we find it emperically works well.