Differences

This shows you the differences between two versions of the page.

--- project:hillclimb_dp:draft [2014/03/31 02:14]
taolei [Hill Climbing Algorithm]
+++ project:hillclimb_dp:draft [2014/05/02 14:57] (current)
taolei [Justification of the Algorithm]
@@ Line 41: / Line 41: @@
 <WRAP clear></WRAP>
-Note that (1) the algorithm repeatedly checks each word in **//reverse DFS order//** (thus a word that is far away from the root comes first); and (2) when changing the head of m, it has to keep $y^{(t)}$ a valid tree at any point $t$; (3) it operates until no changes are possible to increase the score.
+Note that (1) the algorithm initializes $y^{(0)}$ from a distribution $Q$; (2) the algorithm repeatedly checks each word in **//reverse DFS order//** (thus a word that is far away from the root comes first); and (3) when changing the head of m, it has to keep $y^{(t)}$ a valid tree at any point $t$; (4) it operates until no changes are possible to increase the score.
 \\
@@ Line 103: / Line 103: @@
 \sum_{i=1}^{N} \mathcal{L}(\hat{x_i},\hat{y_i}) + \|\theta\|_2^2
 $$
-For the traditional structural model (and the corresponding condition (1)), the loss function $\mathcal{L}(x_i,y_i)$ would be
+If we are to minimize the loss that corresponds to condition (2), it would be:
 $$
-\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_y \left\lbrace S(\hat{x_i}, y) + \text{err}(y) \right\rbrace - S(\hat{x_i}, \hat{y_i})
+\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{y,y'} \left\lbrace S(\hat{x_i}, y') + 1 - S(\hat{x_i}, y) \right\rbrace \quad\quad\quad (2o)
 $$
-Evaluating this loss function (and its derivative) requires solving the same type of maximization problem.
+Unfortunately, it is not clear if there is an efficient way to compute this loss function by finding $y$ and $y'$. Instead, we could use the standard structural learning loss function, which corresponds to condition (1):
-If we are to minimize a similar loss for condition (2), it would be:
 $$
-\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{y,y'} \left\lbrace S(\hat{x_i}, y') + 1 - S(\hat{x_i}, y) \right\rbrace
+\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{\tilde{y}}\left\lbrace S(\hat{x_i}, \tilde{y}) + \text{err}(\tilde{y})\right\rbrace - S(\hat{x_i}, \hat{y_i}) \quad\quad (1o)
 $$
+where $\tilde{y}$ is the output of the decoding algorithm.
+The connections between the two loss functions are the following: condition (1) is a necessary condition of condition (2), and (1o) is the lower bound of sum of loss (2o) for each two consecutive $(y, y')$ on any "climbing" path from $\tilde{y}$ to $\hat{y}$. Thus, by minimizing (1o) we are making the lower bound of (2o) close to zero. If this lower bound is **//tight//** (because $err(\tilde{y})$ is small and the path from $\tilde{y}$ to $\hat{y}$ is short),

DokuWiki

User Tools

Site Tools

Differences

Page Tools