How to test
pdes/directory has .pde files.kernels/directory has custom kernels. Need to be installed to environment bypython setup.py.
Issues Faced
1*Compilation Errors*
- The C++ extension failed to compile because:
- The
outversion of the function was missing or incorrectly defined. - There were signature mismatches in function definitions.
- The meta function (
fused_fftconv_meta) wasn’t properly registered. - Template errors and ATen API changes caused unexpected failures.
- The
Fix:
- Defined
fused_fftconv_outproperly. - Used
resize_andcopy_to ensure the output tensor was correctly shaped without modifying inputs. - Registered both versions correctly:
cpp TORCH_LIBRARY(myop, m) { m.def("fused_fftconv(Tensor input, Tensor filter) -> Tensor", fused_fftconv); m.def("fused_fftconv.out(Tensor input, Tensor filter, *, Tensor(a!) out) -> Tensor(a!)", fused_fftconv_out); }
2*XNNPACK Runtime Error: Shape Mismatch*
-Error:
- Attempted to change the tensor rank which is immutable: old=3, new=2
- The exported model expected a 2D input for the linear layer (e.g., (256, 4)) but received a 3D input (1, 4, 256).
This happened because the partitioner isolated the linear operation in a subgraph with shape (B*L, C), while the overall model expected (B, C, L).
Fix Options:
- Option A (Python code change): Keep linear as a 3D operation and avoid flattening.
- Option B (C++ kernel change): Reshape the input inside C++ before convolution to match XNNPACK’s expected shape.
Most of the time, adjusting the Python side is simpler.
3*PyTorch Export Errors (torch._dynamo.exc.Unsupported: out= op was called where output tensor was non-contiguous)*
- The export function failed when tracing the model because out was not contiguous.
- Fix:
- Ensure out is contiguous before passing it to fused_fftconv_out. For example:
python{out = torch.empty_like(x, memory_format=torch.contiguous_format) torch.ops.fftconv.fused_fftconv.out(x, filter, out=out)}
- Ensure out is contiguous before passing it to fused_fftconv_out. For example:
Next Steps & Future Work
If I had more time, I would:
- Optimize for Android: Use NEON intrinsics to improve performance on ARM-based devices.
- Improve Memory Efficiency: Avoid unnecessary copies and temporary tensors.
- Explore Other Backends: Possibly consider IPEX or other compilers for further performance gains.
Summary
- Writing custom PyTorch C++ extensions.
- Exporting models to XNNPACK & ExecutorTorch.
- Debugging both compile-time and runtime errors.
While I managed to get it all running, there wasn't enough time to work on actual optimizations for the hardware.
Log in or sign up for Devpost to join the conversation.