Private:3DV Remote Rendering

Here we describe the components of a 3D video remote rendering system for mobile devices based on cloud computing services. We also discuss the main design choices and challenges that need to be addressed in such a system.

Components

The system will be composed of three main components:

Mobile receiver(s)
Adaptation proxy
View synthesis and rendering cloud service

Transmission is to be carried via unicast over an unreliable wireless channel. A feedback channel would be necessary between the receiver and the proxy. This channel would be utilized to send information about current/desired viewpoint, buffer status, and network conditions, in addition statistics about mobile device itself (e.g. current battery level, screen resolution, expected amount of power for processing, etc.). Such feedback channel is crucial in order to have a fully-adaptive algorithm, that can quickly adapt to any change in these parameters.

Because of the limited wireless bandwidth, we need efficient and adaptive compression of transmitted views/layers. In addition, an unequal error protection (UEP) technique will be required to overcome the unreliable nature of the wireless channel. Multiple description coding (MDC) has already been used in some experiments, and the results seem quite promising for both multiview video coding, and video plus depth.

It is assumed that the mobile receiver has a display capable of rendering at least two views. Interfacing with mobile receiver's display may be an issue since this will only be possible through a predefined driver API. Whether or not these APIs will be exposed and which 3D image format they expect will vary from one device to the other. We believe that such autostereoscopic displays will probably come with their own IP hardware to perform rendering operations such as 3D warping, hole filling, etc. Whether or not we would have control over this process is still unknown. In the mean time, it is possible to experiment and send data that can be rendered in 2D, just like most of the experiments we have read so far. This would enable us to establish the feasibility of our scheme, and benchmark it against previous works.

Based on receiver feedback, the adaptation proxy is responsible for selecting the best views to send, perform rate adaptation based on current network conditions, and encode them quickly and efficiently. We can utilize RD-optimization techniques for rate adaptation.

It is important to distinguish what the proxy and cloud would accomplish. For example, a multiview plus depth scheme could be used in order to support a broad array of devices, such as phones and tablets. From some real, filmed views, a server might interpolate the extra views required, where the number of extra views depend of the nature of the mobile device. This would give quite some flexibility, at a very cheap cost: only the original views would need to be stored. Experimentally, we would have to see if it is doable to generate extra interpolated views on the fly, or if such views need to live on the server as well.

Design Choices

What is the format of the stored video files?
How many views (and possible depth maps) need to be sent to the receiver?
- two views (receiver needs to construct a disparity/depth map and synthesize intermediate views)
- two views + two depth maps (receiver can then synthesize any intermediate view between the received ones)
- one view + depth map (yields a limited view synthesis range)
- two+ views (and + depth maps) for larger displays, such as the iPad and displays/TV in cars
What compression format should be used to compress the texture images of the views? This could be driven by the resolution of the display where a high level of texture might not be noticeable.
What compression format is efficient for compressing the depth maps without affecting the quality of synthesized views? Should depth map be compressed?
- Will MVC be suitable for depth maps?
How much will quality reduction of one of the views to reduce bandwidth affect the synthesis process at the receiver side?
- Will the effect be significant given that receiver's display size is small?
How will view synthesis and associated operations (e.g. 3D warping and hole filling) at the receiver-side affect the power consumption of the device?
Should we only focus on reducing the amount of data needed to be transmitted, as antennas consume a significant amount of power? Or should we experiment and see how those 2 variables are tied together, given that decoding 3D videos is a much more resource intensive process?

Tools

Joint Multiview Video Coding JMVC Reference Software (:pserver:jvtuser@garcon.ient.rwth-aachen.de:/cvs/jvt)
View Synthesis Based on Disparity/Depth (ViSBD) Reference Software
Computer Unified Device Architecture (CUDA)

References

Mobile Form Factor Will Bring 3D to Mainstream Market