Problem | Contribution | Results | Report | Presentation | Stages 1 & 2 |

## ABSTRACT

Traditionally, Graphics Processing Units (GPUs) were designed for performing graphics specific computations. However, with rapid improvements in performance and programmability, GPUs have fostered considerable interest in doing computations that go beyond computer graphics; general purpose computation on GPUs, or "GPGPU". GPUs may be viewed as data parallel compute coprocessors that can provide significant improvements in computational performance especially for algorithms which exhibit sufficiently high amount of parallelism. One such algorithm is the Fast Multipole Method (FMM).

The Fast Multipole Method (FMM) provides a robust technique for the rapid evaluation of the pairwise interactions in a large collection of particles computed to a specified accuracy. The method scales as

The report describes in detail the strategies for parallelization of all phases of the FMM and discusses several techniques to optimize its computational performance on GPUs. The heart of FMM lies in its clever use of its underlying data structure, the Octree. We present two novel algorithms for constructing octrees in parallel on GPUs and discuss their pros and cons based on memory efficiency and running time.

The Fast Multipole Method (FMM) provides a robust technique for the rapid evaluation of the pairwise interactions in a large collection of particles computed to a specified accuracy. The method scales as

*O(N)*compared to the direct evaluation method with complexity*O(N*, which allows the solution of larger problems with the given limited resources. Further, there exists a high degree of parallelism in computations of these pairwise interactions which can be exploited. Our work uses the global illumination problem to study and analyze the nuances of FMM algorithm and its parallel implementation on GPUs. Global illumination problem requires the computation of pairwise interactions among each of the surface elements (points or triangles) in the given data (usually of order >^{2})*10*) and thus naturally fits in the FMM framework.^{6}The report describes in detail the strategies for parallelization of all phases of the FMM and discusses several techniques to optimize its computational performance on GPUs. The heart of FMM lies in its clever use of its underlying data structure, the Octree. We present two novel algorithms for constructing octrees in parallel on GPUs and discuss their pros and cons based on memory efficiency and running time.

## PROBLEM DEFINITION

To provide an efficient parallel GPU based Global Illumination solution for point models which is many folds faster than its corresponding CPU implementation.

**INPUT**: A 3-d point model with attributes like 3-d coordinates, default surface diffuse color, emmisivity and surface normals.**OUTPUT**: A fast parallel Global Illumination solution showing effects like color bleeding and soft shadows.## OUR CONTRIBUTIONS

- First parallel SFC construction algorithm on the GPU which can be used for constructing octrees efficiently in parallel.
- Implementation of a fast, parallel octree on the GPU supporting queries like Parallel Post Order Traversal, Location of the cell containing the queried point, Least Common Ancestor of two cells etc.
- Implementation of the Fast Multipole Method for radiosity kernel on the GPU, thus providing an efficient Global Illumination solution for point models which shows impressive speedups over its corresponding CPU implementation.
- Construction of a Visibility Map (V-Map) data structure on the GPU for computing view independent visibility to achieve correct Global Illumination results.

## RESULTS

(Click on any image/figure to see its zoomed version)## Memory Efficient Bottom-Up Octree Costruction

Bunny with 124531 points | |
---|---|

Ganpati with 165646 points | |

## Memory Efficient Top-Down Octree Costruction

Bunny with 124531 points | |
---|---|

Ganpati with 165646 points | |

## View Independent Visibility

## Quality of Results

We validate our proposed method here using an adaptive octree structure. We remark that the user divides the octree adaptively depending on the input scene density. Increasing or decreasing the levels of subdivision for a given scene is essentially a trade-off between quality of the visibility (user driven), and the computational time.

(a) An empty Cornell room | (b) Bluish Bunny (eye on the floor) |
---|---|

(c) The Buddha (eye on the green wall) | (d) Satyavathi |

In the figure above, (a) shows a point model of an empty Cornell room with some artificial lighting
to make the model visible. Note the default colors of the walls. We now introduce a bluish
white Stanford bunny. In (b), the eye (w.r.t. which visibility is being computed) is on
the floor, marked with a cyan colored dot. The violet (purple) color indicates those portions
of the room that are visible to this eye. Notice the "shadow" of the bunny on the back wall.
The same idea is repeated with the eye (marked in cyan) placed at different locations for
various different point models (all bluish white in color) of the Buddha (c), and an
Indian Goddess Satyavathi (d). We found that an octree of height 8 gave us accurate
visibility results, however, it is more prudent to go to higher depths such as 9 or 10.

## Timing Results

The figure below shows the V-map construction times (CPU and GPU) for point models with differing octree heights. Each graph in the figure shows the only-CPU, and CPU-GPU combo running time for various octree levels (5-10). For example, (a) shows results for the Stanford bunny in the
Cornell room while (b) shows the same for Ganesha and Satyavathi in Cornell Room. The table also shows the number
of leaves (at various depths) in the various adaptive octrees. This is an important parameter
on which the degree of parallelism indirectly depends. The running times tabulated, also
depends on the number of threads per block. A block size of (16 × 16) gives the best results
with 256 threads per block.

(a) |
---|

(b) |

## Fast Multipole Method for Radiosity Kernel

## Quality of Results

We converge to the final Global Illumination solution shown in the figure below by performing the FMM algorithm (Upward and then Downlward passes) for all colors (RGB) three times. We can see that the solution converges to a good extent in 3 iterations.

Ganpati (a) CPU, (b) GPU | |
---|---|

Bunny (a) CPU, (b) GPU | |

## Timing Results

The total time taken by the FMM algorithm for all 3 iterations and all 3 colors RGB is shown in the results below. (Note that the time taken by each iteration is approximately same.)

FMM Timings for all 3 iterations (Bunny with 124531 points) | |
---|---|

FMM Timings for all 3 iterations (Ganpati with 165646 points) | |

## CONCLUSION

We have presented two different octree construction algorithms, each having its own merits and support for desired queries. When a user needs support for various different queries, the bottom-up octree construction algorithm can be used, while the top-down algorithm, although supports only a subset of those queries, has less memory requirement and is faster in terms of run-time. We have compared our algorithms with their CPU equivalent counter-parts and showed that they out-perform them in every department. Further, we have applied both our octree algorithms to a N-body problem of computing radiosity based Global Illumination solution for point models using Fast Multipole Method and demonstrated the visually pleasing results. We have accelerated the run time of the FMM algorithm on the GPU in the range 15-20 as compared to a modern CPU and showed a feasibility of putting the FMM on the GPU. This should permit the use of these architectures in a vast variety of computationally intensive domains where the Fast Multipole Method or hierarchical data structuring techniques can be profitably used.

## REFERENCES

- L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325-348, 1987.
- J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669-686, July 1988.
- R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods.
- H. Sagan. Space Filling Curves. Springer-Verlag, 1994.
- S. Seal and S. Aluru. Spatial Domain Decomposition Methods in Parallel Scientific Computing.
*Book chapter* - N. A. Gumerov and R. Duraiswami. Fast Multipole Method on Graphics Processors. AstroGPU 2007.
- John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80-113, 2007.
- Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda
- P. Ajmera, R. Goradia, S. Chandran, and S. Aluru. Fast, parallel, gpu-based construction of space filling curves and octrees. In SI3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games, pages 1-1, New York, NY, USA, 2008. ACM.
- S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the GPU, pages 595-614. Addison Wesley, 2005.
- L. Nyland, M. Harris, and J. Prins. GPU Gems 3, chapter Fast N-Body Simulation with CUDA, pages 677-696. Addison Wesley, 2007.
- A. Karapurkar and S. Chandran. Fmm-based global illumination for polygonal models. Master's thesis, Indian Institute of Technology, Bombay, 2003.
- R. Goradia, A. Kanakanti, S. Chandran, and A. Datta. Visibility map for global illumination in point clouds. Proceedings of ACM SIGGRAPH GRAPHITE, 5th International Conference on Computer Graphics and Interactive Techniques, 2007.
- T. W. Christopher. Bitonic Sort Tutorial. http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm

## STAGES 1 & 2

You can find the work done by me in the stages 1 and 2 of this dual degree project at the following links:

[Stage 1] [Stage 2]