Donnerstag, 26. November 2009

Perlin Noise Terrain Raycasting

Here a first trial to raycast perlin noise on the fly for achieving volumetric terrain rendering. In the demo, a 128^3 sized random volume data is used as a base for the scenes on the screenshots above.

By optimizing the empty-space skipping, it is possible to raycast reasonably large outdoor scenes at interactive framerates (20-40 fps) on a Nvidia GTX 260 GPU. The advantage of this kind of landscapes is, that they are extremely easy to handle and also that they are very memory friendly ( its just 128^3 rgba voxels = 8 MB of data ). Also can the performance easily adjusted for older graphics cards depending on the empty-space skipping configuration.

The Demo can be downloaded here: Controls are w,s,a,d.

Samstag, 14. November 2009


Here some demos of my new sparse-voxel-octree (SVO) rayster.

Technical details:

-Storage: ca. 100 bit/voxel
-Uses a variant of persistent threads

Demo download: SVO-Demo-Cuda.2.3.7z

Montag, 22. Juni 2009

Tile-based memory layout

After long time now another update. Next logical step in the development is to add a tile-based memory layout to allow large, unique, non-repeating landscapes. Here a first screenshot showing the tiles.

Dienstag, 31. März 2009

More Videos

Here two videos showing the Happy Buddha scene (1024x2048x1024).
High quality video here: Buddha avi [mirror]

The updated demo download from today (right side, first position in the links)
also includes the endless Buddha executable.

Montag, 30. März 2009


For the ones of you who cannot run the demo for some reason, I just captured a short video of it. You can watch it below in the window or download the larger version with better quality to see more details.

Landscape AVI [mirror]

Samstag, 28. März 2009

CUDA optimizations II

Today I would like to share a couple of interesting references about optimizing CUDA. There are many similariries among these presentations, but still its interesting as reading through give you new ideas about whats possible.

1.) Optimization Techniques for Large Data Structures on CUDA
2.) AstroGPU - CUDA Optimization Part I
3.) AstroGPU - CUDA Optimization Part II
4.) CUDA Programming Notes
5.) NVISION08: Advanced CUDA: Optimizing to Get 20x Performance
6.) Top 5 Optimization Strategies for CUDA
7.) CUDA at MIT - IAP2009

Looking at foil 3 of the first presentation, using the GPU should give an average speedup of factor 10 compared to the CPU in case the algorithm can be fully SIMD parallized. ( GPU: GTX280, 933GFlops/141.7 GB/s Mem, CPU: Intel Core 2 QX9650, 96 GFlops/12.8 GB/s Mem).

Now looking at NVidias CUDA page, I am often surprised to see that some algorithms seem to have been sped up like 100x or even more, compared to CPU - this seems to be rather hard to believe, taking the numbers above into account.

Montag, 23. März 2009

New Benchmark Version

Today I ported the CUDA version to CPU (multicore), it is included in the updated Demo

[-Download-] (CUDA 2.1 Required - Driver version 181.20 or newer )

The first results so far are:

CPU (3Ghz PentiumD) - Single/Repeated/Repeated 2xAA: 3/1.2/0.6 fps
CPU (Intel Core2 Quad Q6600, 4x 3Ghz) - Single/Repeated/Repeated 2xAA: 15/8/5 fps
GPU (8800GTS) - Single/Repeated/Repeated 2xAA: 33/24/17 fps
GPU (285GTX) - Single/Repeated/Repeated 2xAA: 44/34/36 fps

Scene is this time the complex version of the one shown in the pictures below

Reason for the low CPU performance is mostly due many floating point operations I guess. Changing the calculations to Integer might improve the speed. Now its the most possible fair comparison however, since CPU and GPU get the same c++ code to execute.

Donnerstag, 19. März 2009


Today I made a comparison of CPU vs. GPU, to see if it was really worth the work to write everything in CUDA rather than for CPU. [detaild pics] [-CPU-Demo-]

The oponents:
CPU: 3.0 Ghz Pentium D, 1GB vs.
GPU: NVidia GTX285, 1GB

In the first round the CPU seems to provide a good performance, compared to the GPU - the GPU is just 3x faster than the CPU.

In the second round however, the GPU already wins over CPU with a speed factor of 7.3 : 1.

In the third round the CPU now lost all ground and the GPU wins about 20:1 (47.5:2.4)

Finally it would be interesting to know why the GPU doesnt work linear at all. I dont have any idea why the framerate is not half if the computations are doubled or vice versa.

Mittwoch, 18. März 2009

Demo with 2x AA

Small update - the demo linked below now also includes 2xAA (not 2x2!), reducing the aliasing of distant pixels significantly. On the GTS 8800 its quite slow right now, but on the GTX285 its almost no difference to the normal version I found.
For the GTS perhaps I will think about only applying AA to distant geometry to increase the speed.

Dienstag, 17. März 2009

Now the algorithm works entirely on the GPU

Today I finished shifting the ray generation part to the GPU, saving another 1-4ms as well as an unnecessary memcopy. Also silhouette-smoothing is working well, together with basic anti-aliasing ( so far only for GTX2xx cards ).

As for the smoothing, I tried two variants (left), and found the one in the middle looks best so far. The unsmoothed original (top) is too edgy and the one on the bottom smoothens too much for the tree-scene which lets near rendered geometry look like a 2D impostor.

The updated demo is here [-download-] (Cuda 2.1)
Also containing softening for the buddha & dragon scenes now

For the experienced ones of you, the shader-folder contains the shader in GLSL (soft.frag). You can experiment a bit by modifying the smoothing.

Sonntag, 15. März 2009

Silhouette Smoothing

Today I experimented with a new shader to smoothen the silhouette based on the depth buffer. Looks not bad but its difficult to figure out the optimal parameters.

Samstag, 14. März 2009

Soft Voxels II

Today I improved the filtering a bit. The softening looks more nice than yesterday (also its slower a litte).  [-dl-new shaders-]

Still I'm not yet sure if soft voxels look better than hard-edged voxels in general. It  gives the impression of missing detail and low resolution - both things which are unwanted..
Better would be real filtering to approximate the surface.

Freitag, 13. März 2009

Soft Voxels

Today I added depth of field to soften the edgy voxels a littelbit. Its very simple - the smoothing radius just depends on the actual depth.

Donnerstag, 12. März 2009

New Release

Today its time for a new release. Major mapping bugs are fixed and the colors look better now (I hope).

[-Demo Version v2-] ( Cuda 2.1 )

I also posted the Demo as IOTD on GDev as I think its worth to see.

Dienstag, 10. März 2009

Happy Buddha reloaded

This time with shading - looks more nice.
From far its not possible to see if its polygons or voxels - only a closeup reveals what our buddha is made of.

Any limit?

View distance set to 4.000.000 - still interactive (18fps). To have unique voxels everywhere is a problem in this case however.

Here we can also see an advantage of the RLE structure - its very easy to generate procedural mountains. With octree-raycasting it might be possible too, but right now I dont have an idea how this could work easily.

Montag, 9. März 2009


Here we can see the 4 variants of Anti-aliasing. For the quality, also the distance where the next mipmap is switched to is very important.

Freitag, 6. März 2009

Maximal complexity ?

Here another very complex scene.
RLE Elements total :15.4M
RLE Elements processed:5.8M
RLE Elements rendered:1M
Visible Pixels:0.66M

Donnerstag, 5. März 2009

Better Performance

Today I wrote a converter for PLY.files. The first result can be seen on the left.

After several optimizations, also the framerate could be increased in average about 10% and depending on the scene of up to 50%.

Samstag, 28. Februar 2009


Here a scene from Voxelstein
Surprisingly, the rendering is very quick :-)

Download Demo (Cuda 1.1)

It's an alphaversion,
so not expect anything ;-)

Freitag, 27. Februar 2009

First Color for the new Version

Today the scene gots a bit more colored.

Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.

So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.

Donnerstag, 26. Februar 2009

Texture-mapping Works !

Here an actual screenshot
25 fps@1024x768 :-)
View distance: 40.000 voxels
(with mip-mapping)

Mittwoch, 25. Februar 2009

New Download

Here you can download the Demos below.
Its still the old version, so rendering at 1024x768 is not fast yet.

Required: Cuda 1.1
Filesize: 22MB

Dienstag, 24. Februar 2009


Mapping seems to work now - next step is to compute the rays accurately.

Donnerstag, 19. Februar 2009


Today I measured the performance of the actual implementation. The result: The scene on the right has about 8.0M RLE elements in the view frustum, out of which 4.7M are not culled and 280k are visible, rendered as 450k pixels. This, at a frame-rate of about 25 means the renderer processes about 117M RLE elements/second. 

My graphic cards maximum untextured triangle performance is 280M/s in case the triangles share vertices, and about 133M/s in case the triangles have independent vertices. Maximal vertex transform rate is about 400M/s.

This means, if the landscape would be visualized using splats, each rendered as single triangle, then at least 8M triangles would be required. Without any culling, this would lead to a performance of about 133/8=16 fps. Here, perhaps the geometry shader might be used to accelerate the rendering. It would be possible to send only one vertex from which the geometry shader generates a quad or triangle.

I case we would visualize each voxel inside the landscape using conventional polygons, we would have to use at least 2 triangles for each to create a quad. This means, taking shared vertices into account, We would have to render at least 16M quads, resulting in a theorethic frame rate of 280/16=17.5.

Common technologies to render Voxels

1.) Heightmap based Voxel-Terrain. Ref
2.) Voxlap Technology. Ref1 Ref2
3.) Splatting / Rendering voxels as sprites. Ref1 Ref2
4.) Sparse Voxel Octree (SVO) Raycasting. Ref
5.) Mixed Octree / Regular Grid Raycasting. Ref
6.) Mixed Polygon / Voxel (Splatting) rendering Ref
7.) Rendering Voxels as Polygons.Ref

Freitag, 13. Februar 2009

Its more difficult than one could expect.. Simple texture mapping can't do the job actually. This means, a pixel-shader is necessary to do the unwrapping.

Still some work

Until the new version is complete, its necessary to stretch out the temporary rendered buffer correctly using texture mapping.. Left we can see the debug output so far.

Cuda output right now is 32 bit. 16 bit depth, 8 bit color + 8 bit normal.

Donnerstag, 12. Februar 2009

More Speed

After adjusting a couple of parameters and doing further opimizations I got get 20 fps at 1024x768 now. Update: after some more optimizations, 30 fps seem possible at 1024x768 :-)

Mittwoch, 11. Februar 2009

Bunny Dataset

The Bunny dataset has a very complex surface and runs only at about 7-8 fps (1024x768)

Today: The Bonsai Tree Dataset

Today I played a bit with the bonsai dataset from here.The number of visible trees is about 3000.

Tips for CUDA programming

If some of you think of writing a CUDA program, here a couple of things to keep in mind:

1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...

17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.

Mittwoch, 4. Februar 2009

First Colorful Screenshots

Here the first colorful result of what voxel can do when used with a simple L-System.

I tried to add the culling feature, which is used in Voxlap, but unfortunately it didnt give any speedup wth CUDA.. So back to the previous version.

For the demo, Cuda 1.1 is required.

[ Download Link ( ca.10 MB ) ]