Let denote a sentence. We represent each word and it's context as a feature vector of dimension (this is for you to specify). This feature vector should generalize to unseen words.
Each annotated sentence w has an associated set of arcs , and we denote this set . Let be the annotated corpus. A function that maps an arc to a value is given by
where , are features depending on the distance between word and word . Let be the set of feasible parse trees for sentence . is a collection of arcs that forms a parse tree which associates a score
Let . Assume we are given sentences and their Dependence parsing , we try to solve following optimization problem,
where is a measure of difference between two trees such as Hamming distance. It is used to prevent trivial solution which assign to all parameters.
In this section, we show how to obtain the dual problem. We start with the matrix case. Introducing dual variable for each sentence, we get:
The original optimization problem is the primal problem:
Strong duality holds for this optimization problem since Slater's_condition (Proof) can be satisfied. We can instead solve its dual form:
On the other hand, the trace norm can be represented as (Proof):
where is defined as the maximum singular value of . Plugging the definition of , we get
For simplicity, we define following auxiliary variables,
Taking derivative to and , letting them to be and eliminating yields
The dual problem is then given by,
This is again a convex problem (quadratic objective and contraints form a convex domain).
The max norm constraints can be explicitely represented as
where and are two normalized vectors. Equivalently, the inequality should be satisfied for all normalized and , therefore there are actually infinite number of constraints. To solve this problem, we use cutting plane method which is widely used in linear programming.
The idea behind cutting planed method is that only a few constraints are enough to find the global optimum. The method starts from empty constraint set. In each iteration, the optimization problem is solved under current constraint set. Then the constraint that is violated most by current solution is added into the constraint set. The process continues until no violated constraint can be found. At this point, the solution can be proved to be optimum.