Operator Optimization

How to test

pdes/ directory has .pde files.
kernels/ directory has custom kernels. Need to be installed to environment by python setup.py.

The C++ extension failed to compile because:
- The out version of the function was missing or incorrectly defined.
- There were signature mismatches in function definitions.
- The meta function (fused_fftconv_meta) wasn’t properly registered.
- Template errors and ATen API changes caused unexpected failures.

Fix:

Defined fused_fftconv_out properly.
Used resize_ and copy_ to ensure the output tensor was correctly shaped without modifying inputs.
Registered both versions correctly: cpp TORCH_LIBRARY(myop, m) { m.def("fused_fftconv(Tensor input, Tensor filter) -> Tensor", fused_fftconv); m.def("fused_fftconv.out(Tensor input, Tensor filter, *, Tensor(a!) out) -> Tensor(a!)", fused_fftconv_out); }

-Error:

Attempted to change the tensor rank which is immutable: old=3, new=2
The exported model expected a 2D input for the linear layer (e.g., (256, 4)) but received a 3D input (1, 4, 256).
This happened because the partitioner isolated the linear operation in a subgraph with shape (B*L, C), while the overall model expected (B, C, L).
Fix Options:
- Option A (Python code change): Keep linear as a 3D operation and avoid flattening.
- Option B (C++ kernel change): Reshape the input inside C++ before convolution to match XNNPACK’s expected shape.

Most of the time, adjusting the Python side is simpler.

The export function failed when tracing the model because out was not contiguous.
Fix:
- Ensure out is contiguous before passing it to fused_fftconv_out. For example: python{out = torch.empty_like(x, memory_format=torch.contiguous_format) torch.ops.fftconv.fused_fftconv.out(x, filter, out=out)}

If I had more time, I would:

Optimize for Android: Use NEON intrinsics to improve performance on ARM-based devices.
Improve Memory Efficiency: Avoid unnecessary copies and temporary tensors.
Explore Other Backends: Possibly consider IPEX or other compilers for further performance gains.

While I managed to get it all running, there wasn't enough time to work on actual optimizations for the hardware.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.