Add unit tests to validate accuracy vs analytic integrals. Add a small Python script to run benchmarks and plot speedup. Implement a CUDA version (if you want GPU comparison).