Manual backward copies generated derivatives

Func f_filter("f_filter");
f_filter(x, ci, co) = filter(x, ci, co);

Expr width = input.dim(0).extent();
Expr kw = filter.dim(0).extent();
Expr out_chans = d_output.dim(1).extent();
Expr in_chans = input.dim(1).extent();
Expr bsize = input.dim(2).extent();

Func f_output_1_d_def("f_output_1_d_def");
f_output_1_d_def(x, co, n) = d_output(x, co, n);
Func re1("re1");
re1(x, co, n) = Halide::BoundaryConditions::repeat_edge(f_output_1_d_def,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func ce("ce");
ce(x, co, n) = Halide::BoundaryConditions::constant_exterior(re1, 0.0f,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func f_output_1_d("f_output_1_d");
f_output_1_d(x, co, n) = ce(x, co, n);

Func f_input_0_d_def("f_input_0_d_def");
RDom r(0, out_chans, 0, kw);
f_input_0_d_def(x, ci, n) = 0.0f;
f_input_0_d_def(x, ci, n) = f_input_0_d_def(x, ci, n) + 
  f_output_1_d(x + kw/2 - r.y, r.x, n)*f_filter(r.y, ci, r.x);

Func re2("re2");
re2(x, ci, n) = Halide::BoundaryConditions::repeat_edge(f_input_0_d_def,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, ci, n);
Func ce1("ce1");
ce1(x, ci, n) = Halide::BoundaryConditions::constant_exterior(re2, 0.0f,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, ci, n);
Func f_input_0_d("f_input_0_d");
f_input_0_d(x, ci, n) = ce1(x, ci, n);
d_input(x, ci, n) = f_input_0_d(x, ci, n);

Schedule for all versions:

Var nc("nc");
  d_input
    .compute_root()
    .fuse(n, ci, nc)
    .parallel(nc)
    .vectorize(x, 8)
    ;
  f_input_0_d_def
    .compute_at(d_input, x)
    .vectorize(x, 8);
  f_input_0_d_def
    .update()
    .vectorize(x, 8);

conv1d_forward
total time: 2425.257080 ms samples: 2202 runs: 20 time/run: 121.262856 ms
average threads used: 7.500000
heap allocations: 0 peak heap usage: 0 bytes
output: 1.706ms (1%) threads: 4.272
f_output: 119.556ms (98%) threads: 7.549 stack: 32

conv1d_manual_backward
total time: 2598.328125 ms samples: 2398 runs: 10 time/run: 259.832825 ms
average threads used: 7.821935
heap allocations: 5242880 peak heap usage: 256 bytes
d_input: 71.243ms (27%) threads: 7.860
f_input_0_d_def: 188.589ms (72%) threads: 7.806 peak: 256 num: 5242880 avg: 32

conv1d_backward
total time: 2604.567383 ms samples: 2408 runs: 10 time/run: 260.456726 ms
average threads used: 7.843854
heap allocations: 5242880 peak heap usage: 256 bytes
d_input: 69.941ms (26%) threads: 7.849
f_input_0_d_def: 190.515ms (73%) threads: 7.841 peak: 256 num: 5242880 avg: 32

Manual backward does not wrap dinput in repeat edge + constant exterior

Gain 70ms in d_input computation.
f_input_0_d_def correctly allocates on the stack, not on the heap.

Func f_filter("f_filter");
f_filter(x, ci, co) = filter(x, ci, co);

Expr width = input.dim(0).extent();
Expr kw = filter.dim(0).extent();
Expr out_chans = d_output.dim(1).extent();
Expr in_chans = input.dim(1).extent();
Expr bsize = input.dim(2).extent();

Func f_output_1_d_def("f_output_1_d_def");
f_output_1_d_def(x, co, n) = d_output(x, co, n);
Func re1("re1");
re1(x, co, n) = Halide::BoundaryConditions::repeat_edge(f_output_1_d_def,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func ce("ce");
ce(x, co, n) = Halide::BoundaryConditions::constant_exterior(re1, 0.0f,
    {{0, width}, {0, in_chans}, {0, bsize}})(x, co, n);
Func f_output_1_d("f_output_1_d");
f_output_1_d(x, co, n) = ce(x, co, n);

Func f_input_0_d_def("f_input_0_d_def");
RDom r(0, out_chans, 0, kw);
f_input_0_d_def(x, ci, n) = 0.0f;
f_input_0_d_def(x, ci, n) = f_input_0_d_def(x, ci, n) + 
  f_output_1_d(x + kw/2 - r.y, r.x, n)*f_filter(r.y, ci, r.x);

d_input(x, ci, n) = f_input_0_d_def(x, ci, n);

conv1d_forward
total time: 2305.472412 ms samples: 2106 runs: 20 time/run: 115.273621 ms
average threads used: 7.687559
heap allocations: 0 peak heap usage: 0 bytes
output: 4.174ms (3%) threads: 7.160
f_output: 111.098ms (96%) threads: 7.707 stack: 32

conv1d_manual_backward
total time: 1676.840332 ms samples: 1569 runs: 10 time/run: 167.684036 ms
average threads used: 7.818993
heap allocations: 0 peak heap usage: 0 bytes
d_input: 3.078ms (1%) threads: 7.448
f_input_0_d_def: 164.605ms (98%) threads: 7.825 stack: 32

conv1d_backward
total time: 2450.099365 ms samples: 2278 runs: 10 time/run: 245.009933 ms
average threads used: 7.872695
heap allocations: 5242880 peak heap usage: 256 bytes
d_input: 74.591ms (30%) threads: 7.889
f_input_0_d_def: 170.418ms (69%) threads: 7.865 peak: 256 num: 5242880 avg: 32

Manual backward does not wrap dinput in repeat edge + constant exterior