PDA

View Full Version : DaVinci Multiple GPU and CUBIX thread.



Alex Hastings
01-08-2011, 09:23 AM
Hello, I would love to start a thread where users can post their Multiple GPU benchmarks and Configurations, and also anybody using The Cubix GPU-Xpander
http://www.cubixgpu.com/

Since the Mac's are being configured and will be configured so many different ways it really is going to open up for so many options.

My main questions are about mixing GPU's. Since the MAC offerings for CUDA Enabled Graphics cards are weak and depressing, is mixing different types of GPUS going to slow down things?

Say I have a Quadro FX 4800, with a GeForce GTX 285 and a Quadro 4000, is that going to hinder things? Or do we have to keep the GPU's the same.

Very excited to see what users are doing out there.

Alexander Ibrahim
01-08-2011, 01:24 PM
I don't have facts... just educated conjecture.

The effect of multiple GPU types will vary based on how Resolve works... something that I don't think is public knowledge.

What I expect is that Resolve will use a single GPU until it starts to exceed that GPU's real time capacity. When it does, it will start offloading processing of entire nodes to additional GPUs.

Doing this creates a serial chain. More GPUs will take longer to start playing back in real-time, but they will keep a nice flow of footage even with a lot of nodes.

Alternatively, Resolve could attempt to parallel process entire frames. So, while GPU 1 is working on frame 1, GPU2 gets Frame 2 etc. The risk in this case is that you may get out of order frame completion. For example, GPU 3 may be the fastest GPU in your system, with GPU1 being second fastes, with GPU2 being slowest. In that case GPU3 may render 3 or 4 frames and all three GPU's will have to delay playback until GPU2 finishes its workload.

A serial chain is probably easier to get running smoothly with heterogenous processors, but a parallel chain can be more highly optimized with homogenous processors.

With luck we'd see patterns in the performance results.

Alex Hastings
01-09-2011, 10:32 AM
Wow, that is a very good assessment. I think those are the kind of questions that need to be answered.

When I was looking into the Baselight system, their FOUR uses multiple GPU's to render a part of the screen for greater than 2k Realtime, also as I recall they where using GTX NVIDIAS and not QUADRO FX, since they write all their own drivers.

I guess since I have a Quardro FX 4800 for DaVinci, am looking for another one?

Are the GPU's working in Serial or Parallel?

Jeff Kilgroe
01-09-2011, 11:48 AM
What I expect is that Resolve will use a single GPU until it starts to exceed that GPU's real time capacity. When it does, it will start offloading processing of entire nodes to additional GPUs.

Unfortunately you're just making assumptions here. We have no idea how the DaVinci folks have implemented their GPU pipeline(s) and CUDA does not have to work the way you have described.


Doing this creates a serial chain. More GPUs will take longer to start playing back in real-time, but they will keep a nice flow of footage even with a lot of nodes.

While it's possible they have created such an implementation, I would think it unlikely. Mostly because this is just flat-out bad design approach for multi-GPU software. That said, I've seen it done. :frown2:


Alternatively, Resolve could attempt to parallel process entire frames. So, while GPU 1 is working on frame 1, GPU2 gets Frame 2 etc. The risk in this case is that you may get out of order frame completion.

Highly doubtful. First of all, CUDA does not work like this -- GPU1, GPU2, etc... In a CUDA environment, each GPU presents itself as one or more CUDA nodes. Each node has a multitude of stream processors or "CUDA cores". You can also subdivide GPUs into multiple nodes for segmented cache polling and other multiprocessor operations. The GTX285 has 240 cores, the Quadro 4000 has 256 cores. While it seems somewhat logical to load up one GPU and then cascade operations over to the next GPU card, this is once again bad design and actually take more development effort and redundant code to make it work.


A serial chain is probably easier to get running smoothly with heterogenous processors, but a parallel chain can be more highly optimized with homogenous processors.

Neither of those approaches are ideal. This is a cellular micro-architecture. Think about broadcasting the next shader process onto the PCIe bus and all GPU nodes pick it up and cache it. As stream processors are available, they pull the next package from the node cache and broadcast a message "hey, we're starting". The process is executed by whichever node and core subset latches onto it and the others scrub it from their cache.

The real challenge to CUDA development is writing the highly multi-threaded code. You have to be able to break down operations into increasingly smaller tasks. A GPU, or especially a collection of GPUS, is much like an ant colony, with each core as a worker performing a simple task. Once completed, the worker jumps right in to the next task it can latch onto.

The parallel and serial methodologies you have mentioned above are entirely two-dimensional concepts that are all too often used when they should not be. We see it every day with desktop CPU multiprocessing. It's frustrating to no end. And I have, unfortunately, seen too much of it in the GPU programming world.

Let's just put it this way... We don't know how BMD/DaVinci has implemented their CUDA pipelines and GPU support. I only hope that your assessments are WRONG!!! :)

@ Alex -- Don't buy another FX4800. They are two generations old and still sell for a hefty premium due to market confusion. Both the GTX285 and Quadro4000 are significantly faster and cheaper. You can pick up two Quadro4000 cards for the price of one FX4800.

It stands to reason to match GPU types. Not necessarily exact GPU models, but the series. If you have one Fermi card (Quadro4000), you're probably best to stick with Fermi based cards as they profile the same and have the same break-down for specific tasks, whereas mixing a Quadro4000 with a GTX285 will create two separate CUDA profiles that must be meshed together and could cause segregation within your pipeline and leave unused cores sitting idle at times.

Gavin Greenwalt
01-09-2011, 02:01 PM
One thing to keep in mind is that I'm hearing anecdotal evidence of GeForce cards burning out after a relatively short timespan under constant load. Then again... you could buy 5 Geforces of equal cores/memory for the same price and just swap them out as they die.

Peter Chamberlain
01-09-2011, 05:39 PM
Hi guys, all I can say on this subject is we recommend all the GPU's to be the same model for the most optimized configuration. If you use different GPU models, with obvious differences in RAM and CUDA cores etc, the slowest card will be the pace maker and the faster cards will be under utilized.
BTW, current OSX maximum is four GPU's, inc UI. We have 16 image processing GPU systems running with Linux. That rocks even at 4K.
Peter

Chris Parker
01-09-2011, 08:03 PM
hi Peter,

does the multi-GPU addition via cubix expander help file-based rendering at all, or is it just affecting realtime addition of color nodes or fx for the live monitoring or playout to tape?

(ie. would having 4 GPUs render out a timeline full of ProRes 444 files into DNxHD files any faster than having only 1 GPU?)

thanks!

Alexander Ibrahim
01-09-2011, 09:14 PM
Unfortunately you're just making assumptions here. We have no idea how the DaVinci folks have implemented their GPU pipeline(s) and CUDA does not have to work the way you have described.

Jeff,

Did you read what I wrote? Because the first two sentences were:


I don't have facts... just educated conjecture.

The effect of multiple GPU types will vary based on how Resolve works... something that I don't think is public knowledge.

Is there some reason that you decided to "pick on me" after having read that, or did you just not notice it before?

As to the rest, further conjecture- yours or mine- is obviated by the guy from BM/DaVinci who actually knows what is going on posting with an actual answer.

Jeff Kilgroe
01-09-2011, 09:25 PM
Is there some reason that you decided to "pick on me" after having read that, or did you just not notice it before?

Wasn't picking on you at all, sorry if it came across that way. Mostly I was just saying that the parallel and/or serial processing approach to GPUs and CUDA is not what we should expect, nor the most efficient way for CUDA or GPU-accelerated tasks to work. Having said that, hopefully they have not developed a pipeline based on such a paradigm.


As to the rest, further conjecture- yours or mine- is obviated by the guy from BM/DaVinci who actually knows what is going on posting with an actual answer.

Yes, Peter confirmed that all GPUs should be of the same type -- only makes sense. Any information beyond that has not been divulged, nor do I expect them to do so. So we can only speculate...

Torrey Loomis
01-09-2011, 10:06 PM
Great thread. Looking forward to the progress here.

Torrey
-----------------------------------------------
Torrey Loomis
President & CEO - Silverado Systems, Inc.
(916) 760-0032 • FAX (916) 404-5258
torrey@silverado.cc
Web http://www.Silverado.cc
Blog http://silveradosys.blogspot.com
Twitter http://www.twitter.com/silveradosys

Build your own RED Rocket system: http://bit.ly/7F4QA1

Peter Chamberlain
01-10-2011, 12:31 AM
Resolve uses the CUDA GPU's for all image processing, but uses CPU's for the codec work. To Resolve, a render is the same as playback or PowerMaster to tape.... just one more step... the recording to disk. So in the example of a ProRes source to a DNxHD render; the issues is... where are the bottlenecks? In the image processing or the Codec work? Each clip is different depending on the amount of grading. So more GPU will help for clips that have more nodes but wont help if the bottleneck is the transcoding codec.

To speculate; I think a lot of facilities will be pretty happy with two GPU's for processing, one for UI, a rocket, or two for stereoscopic, the Decklink HD Extreme and of course the storage controller card. Adding the third processing GPU will make some sense for high res or very demanding grading but the sweet spot is likely to be the config as described above.
Peter