BVH accelerated point light shading
In deferred rendering
My engine uses a lot of modern techniques like programmable vertex pulling, persistent mapped buffer based multi threaded rendering with a ring buffer and at the core of these techniques, there is this concept of a simple structured buffer. Experiments with compute based ray tracing on kd- and octrees led me to stackless tree traversal on the gpu, which is very very interesting and can be easily found on the internet. And occasionally, I found this article about an alternative to all those clustered, forward plus deferred tile based or whatever approaches for a massive amount of lights. I highly recommend reading it and all the other nice posts over there. He got my interest. I heard about light bvhs only for offline renderers. And structured buffers? I have them. Compute shaders, I have them. My point lights? Yea, maybe I have many of them, but they mostly don't move. And than again, I need rendering and light evaluation not only for my deferred rendering pass, but also for my transparency pass, a regular grid of environment probes or my voxel cone tracing grid...
Long story short, implementing a basic version was very easy, because the concept is so simple.
Assuming a static tree, my implementation needs ~10ms for 100 point lights instead of ~34ms in the most trivial compute shader in the quite dense configuration above on my crappy notebook with integrated intel card. In a less dense configuration, the time goes down to ~4ms and less. It really depends on the amount of overlapping volumes and how efficient the tree is. 500 pointlights scattered over the Sponza atrium takes below 30ms.
BVH update: The most tricky and also the most costly part of the whole thing is probably the creation and update of the BVH which I haven't implemented efficiently yet. My creation happens on any light movement and clusters lights or inner nodes recursively into buckets of 4. 4 gave me better performance than 8 as in the blog post, probably because my light struct layout is not very efficient.
Sphere union: The implementation to find an enclosing sphere for n spheres is from here. I'm not too sure that a really optimal sphere is found, but since I'm feeding every sphere's aabb corner points into the library, some efficiency is already wasted on my side or the program.