How to test

  • pdes/ directory has .pde files.
  • kernels/ directory has custom kernels. Need to be installed to environment by python setup.py.

Issues Faced

1*Compilation Errors*

  • The C++ extension failed to compile because:
    • The out version of the function was missing or incorrectly defined.
    • There were signature mismatches in function definitions.
    • The meta function (fused_fftconv_meta) wasn’t properly registered.
    • Template errors and ATen API changes caused unexpected failures.

Fix:

  • Defined fused_fftconv_out properly.
  • Used resize_ and copy_ to ensure the output tensor was correctly shaped without modifying inputs.
  • Registered both versions correctly: cpp TORCH_LIBRARY(myop, m) { m.def("fused_fftconv(Tensor input, Tensor filter) -> Tensor", fused_fftconv); m.def("fused_fftconv.out(Tensor input, Tensor filter, *, Tensor(a!) out) -> Tensor(a!)", fused_fftconv_out); }

2*XNNPACK Runtime Error: Shape Mismatch*

-Error:

  • Attempted to change the tensor rank which is immutable: old=3, new=2
  • The exported model expected a 2D input for the linear layer (e.g., (256, 4)) but received a 3D input (1, 4, 256).
  • This happened because the partitioner isolated the linear operation in a subgraph with shape (B*L, C), while the overall model expected (B, C, L).

  • Fix Options:

    • Option A (Python code change): Keep linear as a 3D operation and avoid flattening.
    • Option B (C++ kernel change): Reshape the input inside C++ before convolution to match XNNPACK’s expected shape.

Most of the time, adjusting the Python side is simpler.

3*PyTorch Export Errors (torch._dynamo.exc.Unsupported: out= op was called where output tensor was non-contiguous)*

  • The export function failed when tracing the model because out was not contiguous.
  • Fix:
    • Ensure out is contiguous before passing it to fused_fftconv_out. For example: python{out = torch.empty_like(x, memory_format=torch.contiguous_format) torch.ops.fftconv.fused_fftconv.out(x, filter, out=out)}

Next Steps & Future Work

If I had more time, I would:

  • Optimize for Android: Use NEON intrinsics to improve performance on ARM-based devices.
  • Improve Memory Efficiency: Avoid unnecessary copies and temporary tensors.
  • Explore Other Backends: Possibly consider IPEX or other compilers for further performance gains.

Summary

  • Writing custom PyTorch C++ extensions.
  • Exporting models to XNNPACK & ExecutorTorch.
  • Debugging both compile-time and runtime errors.

While I managed to get it all running, there wasn't enough time to work on actual optimizations for the hardware.

Built With

Share this project:

Updates