Sunday, June 11, 2006

Four Cores -- the More the Merrier

7/28/06 updated with even newer Intel performance numbers
We are now seeing a large number of users with dual proc dual core system like AMD Opteron 275s to 285s, and dual core HyperThreaded processors like Intel 965 Extreme Edition -- each of these systems present four processing paths for us to play with. Plus with quad-core single processors on the horizon, we have good reasons to do a little more with our threading architecture. The main reason I've haven't been blogging so much is that improving thread can be a real headache -- particularly as we are already pretty well threaded. Sometimes we might spend a week on a particular algorithm to realize we only got another 5% more speed -- and I want much more than that. For encoding we have N-way threading - that is really nice - and we've had that for sometime. It is made easier by the fact that we can encode many separate frames simultaneously on ingest -- we launch an encoding thread for every CPU available (real or logical core like HT.) Threading the decoder is not so simple, after all that has to be compatible with DirectShow, Premiere Pro and MediaPlayer, etc. Often threading tricks applied to the encoder don’t work as well with the decoder. We do now have an N-way threaded decoder for extreme playback, 4k etc, but not yet for editing – we’re still working on that.) Previously our products included an efficient 2-way threaded DirectShow decoding component which was ideal for single dual-core processors, dual Opterons 248-254 and HT enabled P4s. Yet on quad systems we were seeing the encoder is faster than the decoder (because of N-way threading) -- rather odd behavior for a symmetric compression technology (encoding and decoding should be similar in performance.) – Sorry, that was perhaps lots of boring background about the new threaded decoder that just shipped in all our products -- it will be up to 30% faster on quad system over the previous decoder.

On a related subject, it has been suggested that we make a standard encoder/decoder test suite to characterize real-time video processing performance on various CPU/memory/drive configurations. What do you think? Here is a sample of recent characterization we measured for the decode speed using a very demanding sequence captured from an XBOX 360 at 720p (note: consoles produce more demanding data then anything acquired by a lens -- a lens adds natural anti-aliasing which is easier to compress and playback, whereas gaming material typically has an infinite depth of field and harder edges. So HD camera playback rates are higher than these shown.)

Decoder 2
Decoder 3
Yonah 2.0GHz, 667 FSB
Merom 2.0GHz, 667 FSB
Merom 2.33GHz, 667 FSB (new data)

Pentium D 840 EE 3.2GHz, 800 FSB

Pentium D 965 EE 3.73GHz, 1066 FSB
110.83 146.61
Conroe 2.66GHz, 1066 FSB (next generation desktop) 122.81 137.59
Glidewell 3.2 GHz, 1066 FSB
Woodcrest 2.66Ghz, 1333 FSB (new)


All numbers are in frames per second for a full-resolution, full-quality decode. Decoder2 is used in our Premiere Pro editing solution (Aspect HD and Prospect HD); its preview mode literally doubles and triples these numbers. Decoder3 is our N-way threaded presentation codec (think Digital Cinema.) The Glidewell's (new Xeon) 223.75 fps was due to the 8 threads (!) that can run on HT dual-core dual-proc Xeons (4 real cores.) 720p at 223.75fps equates to 100fps 1080p and 30p for 4k at 2.35:1 (note: higher resolutions are more efficient on compression, so the frame rate numbers are very conservative.) And we have more optimization coming. :)

New Data 7/28/06 : Same tests on the new Woodcrest Xeons hits 270 fps, with a lower clock speed than Glidewell and no HT. Amazing!

Note: For those who want AMD numbers, these processors are also very good (and have been for a while.) Yet we have older AMD systems in-house than these newer Intel boxes, so I can't show the best AMD vs the best Intel. Intel is certainly bringing-it-on and is currently the performance leader in our office; previously we only recommended AMD for Prospect HD Ingest - not so any more.


Richard Leadbetter said...

As you know, I absolutely love the idea of CineForm encoding or decoding becoming a defacto benchmark for CPU performance, David.

Right now, I see no meaningful benchmarks for CPU performance that encompass multithreading, but more than that, the chance to bench CineForm products on certain CPUs and share results with other users could be very useful for people looking to upgrade their systems, or looking to purchase a new system to a set performance level and/or budget.

Some astounding figures there too - especially with the dual Xeon set-up. Assuming that the enhancements made in the playback can be incorporated into a Premiere Pro friendly 'Decoder 2.5', is the Conroe benchmark indicative of the performance gain we would see in a non-hyper threaded dual core CPU?

David said...

They Conroe numbers are without any further optimization targetted for that platform (which we do have coming) and it is still amazing. Just think just a Conroe Extreme Edition will do.