User Tools

Site Tools


project:tensor_dp:old_draft

Problem Formulation

Let $w = (w_1,...,w_n)$ denote a sentence. We represent each word $w_i$ and it's context as a feature vector $\phi(w,i)$ of dimension $d$ (this is for you to specify). This feature vector should generalize to unseen words.

Each annotated sentence w has an associated set of arcs $(i\rightarrow j)$, and we denote this set $T(w)$. Let $W$ be the annotated corpus. A function that maps an arc to a value is given by

$$
f(i, j,w) = \phi(w,i)^T A \phi(w,j) = \text{tr}(A \phi(w,j) \phi(w,i)^T) + \eta^T d(i,j)
$$

where $A\in \mathbb{R}^{d\times d}$, $d(i,j)$ are features depending on the distance between word $i$ and word $j$. Let $\mathcal{Y}(w)$ be the set of feasible parse trees for sentence $w$. $Y(w)\in \mathcal{Y}(w)$ is a collection of arcs that forms a parse tree which associates a score

$$
s(Y(w)) = \sum_{(i,j) \in Y(w)} f(i,j, w) = \text{tr}(A \sum_{(i,j) \in Y(w)} \phi(w,j) \phi(w,i)^T) + \eta^T \sum_{(i,j) \in Y(w)}d(i,j) 
$$

Let $\tilde{Y}(w) = \text{argmax}_{Y(w) \in \mathcal{Y}(w)} s(Y(w))$. Assume we are given $m$ sentences and their Dependence parsing $(\hat{w}^i, \hat{Y}^i(w))$, we try to solve following optimization problem,

\begin{eqnarray*}
 &\min&  \lambda_A\|A\|_* + \lambda_{\eta} \|\eta^2\| + \sum_{i=1}^m \xi_i \nonumber \\
 &s.t.  & s(\hat{Y}^i(w)) \geq s( \tilde{Y}^i(w))  + F(\hat{Y}^i(w), \tilde{Y}^i(w))- \xi_i
\end{eqnarray*}

where $F(\cdot, \cdot)$ is a measure of difference between two trees such as Hamming distance. It is used to prevent trivial solution which assign $0$ to all parameters.




Primal-dual problem

In this section, we show how to obtain the dual problem. We start with the matrix case. Introducing dual variable $ \alpha_i$ for each sentence, we get:

$$
 \mathcal{L} (\alpha, A, \xi) = \lambda_A \|A\|_* + \lambda_{\eta} \|\eta\|^2 + \sum_{i=1}^m \xi_i - \sum_{i=1}^m \alpha_i (s(\hat{Y}^i(w)) - s(\tilde{Y}^i(w)) - F(\hat{Y}^i(w), \tilde{Y}^i(w)) + \xi_i)
$$

The original optimization problem is the primal problem:

$$ 
&  \min_{A,\eta, \xi\geq 0} & \ \max_{\alpha\geq 0} & \ \mathcal{L}(\alpha, A, \xi) 
$$

Strong duality holds for this optimization problem since Slater's_condition (Proof) can be satisfied. We can instead solve its dual form:

$$ 
&  \max_{\alpha\geq 0} & \ \min_{A, \eta, \xi\geq 0} & \ \mathcal{L}(\alpha, A, \xi) 
$$

On the other hand, the trace norm can be represented as (Proof):

$$
 \|A\|_* = \max_Q \text{tr}(Q^TA) \;\;, s.t. \;\; \|Q\|_2 \leq 1
$$

where $ \|Q\|_2$ is defined as the maximum singular value of $ Q$. Plugging the definition of $ Q$, we get

$$
\mathcal{L}(\alpha, A, \xi) = \lambda_A \text{tr}(Q^TA) + \lambda_{\eta} \|\eta\|^2 + \sum_{i=1}^m \xi_i - \sum_{i=1}^m  \alpha_i (s(\hat{Y}^i(w)) - s( \tilde{Y}^i(w)) - F(\hat{Y}^i(w), \tilde{Y}^i(w)) + \xi_i)
$$

For simplicity, we define following auxiliary variables,

\begin{eqnarray*}
F_i  &=& F(\hat{Y}^i(w), \tilde{Y}^i(w)) \nonumber \\
e_i  &=& \sum_{ (j,k) \in \hat{Y}^i(w)} d(j, k) - \sum_{ (j,k) \in \tilde{Y}^i(w)} d(j, k) \nonumber  \\
h_i  & = & \sum_{ (j,k) \in \hat{Y}^i(w)} \phi(w,k) \phi(w,j)^T - \sum_{ (j,k) \in \tilde{Y}^i(w)} \phi(w,k) \phi(w,j)^T
\end{eqnarray*}

Taking derivative to $ A$ and $ \eta$, letting them to be $ 0$ and eliminating $ \xi$ yields

\begin{eqnarray*}
Q  &=& \frac{\sum_i \alpha_i h_i}{\lambda_A} \\
\eta &=& \frac{\sum_i \alpha_i e_i}{2\lambda_\eta}
\end{eqnarray*}

The dual problem is then given by,

\begin{eqnarray*}
 \max_{\alpha} &&\sum_i \alpha_i F_i-\frac{(\sum_i \alpha_i e_i)^2}{4\lambda_\eta} \nonumber \\
 s.t. && 0 \leq \alpha_i \leq1 ,\;\;  \|\sum_i \alpha_i h_i\|_{\infty} \leq \lambda_A
\end{eqnarray*}

This is again a convex problem (quadratic objective and contraints form a convex domain).




Primal-dual algorithm

The max norm constraints can be explicitely represented as

$$
\max_{\|u\|=\|v\| = 1} u^T (\sum_i \alpha_i h_i) v \leq \lambda_A
$$

where $u$ and $v$ are two normalized vectors. Equivalently, the inequality should be satisfied for all normalized $u$ and $v$, therefore there are actually infinite number of constraints. To solve this problem, we use cutting plane method which is widely used in linear programming.

The idea behind cutting planed method is that only a few constraints are enough to find the global optimum. The method starts from empty constraint set. In each iteration, the optimization problem is solved under current constraint set. Then the constraint that is violated most by current solution is added into the constraint set. The process continues until no violated constraint can be found. At this point, the solution can be proved to be optimum.





References

project/tensor_dp/old_draft.txt · Last modified: 2013/10/26 15:29 by taolei