Wednesday, November 19, 2008

Intel Core i7 and CineForm

Wow! That was my immediate reaction when I first did a CineForm decoder performance test on the new Core-i7 processor. I've had access to these new Intel processors for a while now and I knew they where fast, I just didn't how fast. The system we very honored to have early access to was a Intel Core i7-965 Extreme Edition system (Nehalem architecture) running a 3.2Ghz quad core. When we first booted the system, we saw 8 CPUs within task manager, even though this is a quad core. These new chips have re-introduced the concept of Hyper Threading, each core can be setup as 2 virtual cores -- this means we will likely see 16 virtual CPUs in upcoming dual quad workstations. Nice! With so many virtual CPUs to run on, I knew we had to upgrade our decoder for better n-way threading (which the encoder already had.) This work I was most involved with over the last month, resulted a 50% boost in frame rate over our already fast decoder on Core-2 architecture dual-quads. Now it is time to test Core i7-965 .

In these tests I compared my beloved xw8600 HP workstation 3.16Ghz 8 core with 4GB RAM, running XP-Pro 4GB RAM, with a gaming configured desktop Core-i7 4 core running Vista 64 and 3GB RAM. No operations used GPU assistance.

Running with only half the number cores, this new processor nearly doubles the average performance of CineForm HD and 2K in 4:2:2, 4:4:4 and RAW formats, and even approaches real-time full resolution playback of 4K (a workstation class Core-i7 will be playing back 4K without issue from CineForm RAW or 4:4:4 encoded sources.) All this frame rate overhead greatly eases multiple stream processing and allows for huge efficiency increasing in batch processing of mezzanine and image archives. It also allows for much more Active Metadata (AM) processing through CDL style color databases, 3D LUT film looks and other yet to annouced AM features.

Intel and/or HP, when can I get my hands on an i7 dual Xeon? Please.

Where i7 didn't scale as well was with high quality 4K RAW demosaicing filters. Both the 4K R3D decodes and high quality debayer modes in CineForm RAW produce minimal speed-ups from 8-core Core2 to the 4-core i7 (still amazing considering the reduction in cores.) Looking at our own code, the demosaic has not used much of the SIMD (media) instruction set, nor is it particularly memory I/O limited, just lot of operations per pixel. It seems we do have room for more performance optimization in the demosaic.

All the transcodes where performed using the CineForm R2CF utility that comes with our NEO 4K and Prospect 4K products. R2CF has a very efficient implementation of the R3DSDK, allowing for close to 100% CPU utilization. I have included the R3D to CineForm transcode times, as REDCODE is known to be a particularly compute heavy format. These times do also include time for a CineForm encode, but this only effects the FPS numbers by around 10% as our encoder is very fast (up to ten times faster than a R3D decode plus additional processing [adding curves and color space controls.]) I'm showing the combined numbers as that is the CineForm workflow for R3D, you do a R3D decode once to convert to CineForm, then work with (decoding multiple time) CineForm files for the extra speed and flexibility.

I expect another factor in the widening margins for the CineForm decode performance on the Core-i7, is we avoid arithmetic coding, which is tricky and compute intensive for CPUs (and nearly impossible for GPUs to do efficiently -- we are asked about GPU acceleration often.) CineForm codec was always designed for speed on Intel processors, where faster memory and faster media instructions almost directly relates to proved frame rates, as we have compute-lite entropy coding engine. While arithmetic coding would increase bit-efficiency maybe 5-10%, the performance gains of 4-6X by not using it, made the easy choice when we started this codec work 7 years ago (on 1.7GHz P4s using MMX -- the fastest we could get could only do NTSC/PAL SD in real-time.) Now someone needs to suggest a fun use of for Express files running at 450 frame per second.

Tuesday, November 18, 2008

2K is still compelling

On my drive into the office I'm often listening to media related podcasts like TWIM, Red Centre, and Filmspotting. Today's drive I was catching up on Filmspotting podcast episode #235, which opens with a nice review of Slumdog Millionaire, the latest feature from Danny Boyle (Trainspotting, 28 Days Later, Sunshine, etc.) Now I knew this was Silicon Imaging SI-2K project, shoot as CineForm RAW, but other than that I've been too busy to learn anything more about. Filmspotting is a film review show, so not a techno-geek-out fest like Red Centre or elements of TWIM, so you never hear them talk about cameras, but after raves for the story, acting, narrative structure, there where high praises for the cinematography, amazed that is was digitally acquired and the types of shots where able to get in India, commenting "you do not film in the streets of Mumbai...you are taking your life into your own hands, and they actually did." Clearly this points to the huge advantage of a 2K camera with a real 11 stops of dynamic range at the size of deck of tarot cards. Yet this is "only" a super-16mm equivalent sensor, proving again that making a film has very little to do with megapixels and sensor size. Now I'll have to see the film, as it is getting a 92% freshness rating over on rottentomatoes.com.

Thursday, November 13, 2008

My Take on Red's Announcements

Very cool. The higher the bit-rates and the higher the resolution goes, the even greater need for high performance compressed digital immediate for post, mezzanine and long term archive, i.e. CineForm. And for anyone wishing to produce their own compressed RAW video cameras, Red and SI are not going to be in that business alone; CineForm RAW is ready when you are.


Anyone notice how the whole Red "Brain" line-up looks like chunker versions of Silicon Imaging dockable SI-2K Mini? Red continues to confirm SI's vision. :)

Friday, November 07, 2008

Even More Decode Speed

The last 2-3 weeks I found my time consumed with more decoder optimization. While the core of the CineForm codec has now been around for nearly 7 years, it enhancement has never stopped, whether we are adding new pixel formats, Active Metadata, improving quality or striving for more performance. Working on the codec core is more rewarding as an engineering success is not dependent on eccentricities the third party applications like Premiere or FCP, which like to get in the way.

The decoding engine has been threaded for 8 cores for some time, but it was only efficiently using about 3-4 cores. This inefficiency was not an issue for real-time playback as the codec was already very fast, faster than necessary for real-time multi-stream playback (even on dual core systems.) Each decoder during a transition or layered effect would happily use much of the available CPU. Better codec threading was needed for a new market the CineForm is finding itself in, file based film and television archives and mezzanine storage for HD distribution. These markets have been limited by the real-time nature of tape format like D5 and HDCAM SR. If you are going to switch to file based storage, no point in limiting yourself to 1:1 real-time, you want faster than real-time for batch processing and file format conversions wherever possible. This is one reason CineForm is displacing JPEG2000 for archives, is it just too slow for batch processing in software (typically much slower than real-time 1:1, i.e. slower than tape.)

While the current public beta is more efficiently threaded for up to 8 cores, up a 50% decoder speed-up for some sources, the in-house decoder (out soon) will support up to 32 cores, ready for those new Intel powered workstations with will have 6 and 8 cores per physical part coming very shortly.

Some performance numbers from my stock HP xw8600 8-core 3GHz workstation:
444 1080p 12-bit per channel StEM footage -- 64fps.
444 1080p 12-bit per channel Stereo (3D) -- 43fps per eye (86fps total throughput.)
RAW 4K 12-bit per channel with demosaic (no GPU acceleration) -- 22fps.
RAW 4K 12-bit per channel decoding at 2K (no GPU acceleration) -- 59fps.

All testing used Build 186 of Prospect 4K beta.