## Manual backward copies generated derivatives

``````Func f_filter("f_filter");
f_filter(x, ci, co) = filter(x, ci, co);

Expr width = input.dim(0).extent();
Expr kw = filter.dim(0).extent();
Expr out_chans = d_output.dim(1).extent();
Expr in_chans = input.dim(1).extent();
Expr bsize = input.dim(2).extent();

Func f_output_1_d_def("f_output_1_d_def");
f_output_1_d_def(x, co, n) = d_output(x, co, n);
Func re1("re1");
re1(x, co, n) = Halide::BoundaryConditions::repeat_edge(f_output_1_d_def,
{{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func ce("ce");
ce(x, co, n) = Halide::BoundaryConditions::constant_exterior(re1, 0.0f,
{{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func f_output_1_d("f_output_1_d");
f_output_1_d(x, co, n) = ce(x, co, n);

Func f_input_0_d_def("f_input_0_d_def");
RDom r(0, out_chans, 0, kw);
f_input_0_d_def(x, ci, n) = 0.0f;
f_input_0_d_def(x, ci, n) = f_input_0_d_def(x, ci, n) +
f_output_1_d(x + kw/2 - r.y, r.x, n)*f_filter(r.y, ci, r.x);

Func re2("re2");
re2(x, ci, n) = Halide::BoundaryConditions::repeat_edge(f_input_0_d_def,
{{0, width}, {0, in_chans}, {0, bsize}})(x, ci, n);
Func ce1("ce1");
ce1(x, ci, n) = Halide::BoundaryConditions::constant_exterior(re2, 0.0f,
{{0, width}, {0, in_chans}, {0, bsize}})(x, ci, n);
Func f_input_0_d("f_input_0_d");
f_input_0_d(x, ci, n) = ce1(x, ci, n);
d_input(x, ci, n) = f_input_0_d(x, ci, n);``````

Schedule for all versions:

``````Var nc("nc");
d_input
.compute_root()
.fuse(n, ci, nc)
.parallel(nc)
.vectorize(x, 8)
;
f_input_0_d_def
.compute_at(d_input, x)
.vectorize(x, 8);
f_input_0_d_def
.update()
.vectorize(x, 8);``````

conv1d_forward
total time: 2425.257080 ms samples: 2202 runs: 20 time/run: 121.262856 ms
heap allocations: 0 peak heap usage: 0 bytes
f_output: 119.556ms (98%) threads: 7.549 stack: 32

conv1d_manual_backward
total time: 2598.328125 ms samples: 2398 runs: 10 time/run: 259.832825 ms
heap allocations: 5242880 peak heap usage: 256 bytes
f_input_0_d_def: 188.589ms (72%) threads: 7.806 peak: 256 num: 5242880 avg: 32

conv1d_backward
total time: 2604.567383 ms samples: 2408 runs: 10 time/run: 260.456726 ms
heap allocations: 5242880 peak heap usage: 256 bytes
f_input_0_d_def: 190.515ms (73%) threads: 7.841 peak: 256 num: 5242880 avg: 32

## Manual backward does not wrap dinput in repeat edge + constant exterior

Gain 70ms in d_input computation.
f_input_0_d_def correctly allocates on the stack, not on the heap.

``````Func f_filter("f_filter");
f_filter(x, ci, co) = filter(x, ci, co);

Expr width = input.dim(0).extent();
Expr kw = filter.dim(0).extent();
Expr out_chans = d_output.dim(1).extent();
Expr in_chans = input.dim(1).extent();
Expr bsize = input.dim(2).extent();

Func f_output_1_d_def("f_output_1_d_def");
f_output_1_d_def(x, co, n) = d_output(x, co, n);
Func re1("re1");
re1(x, co, n) = Halide::BoundaryConditions::repeat_edge(f_output_1_d_def,
{{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func ce("ce");
ce(x, co, n) = Halide::BoundaryConditions::constant_exterior(re1, 0.0f,
{{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func f_output_1_d("f_output_1_d");
f_output_1_d(x, co, n) = ce(x, co, n);

Func f_input_0_d_def("f_input_0_d_def");
RDom r(0, out_chans, 0, kw);
f_input_0_d_def(x, ci, n) = 0.0f;
f_input_0_d_def(x, ci, n) = f_input_0_d_def(x, ci, n) +
f_output_1_d(x + kw/2 - r.y, r.x, n)*f_filter(r.y, ci, r.x);

d_input(x, ci, n) = f_input_0_d_def(x, ci, n);``````

conv1d_forward
total time: 2305.472412 ms samples: 2106 runs: 20 time/run: 115.273621 ms
heap allocations: 0 peak heap usage: 0 bytes
f_output: 111.098ms (96%) threads: 7.707 stack: 32

conv1d_manual_backward
total time: 1676.840332 ms samples: 1569 runs: 10 time/run: 167.684036 ms