Private:3DV Remote Rendering

Here we describe the components of a 3D video remote rendering system for mobile devices based on cloud computing services. We also discuss the main design choices and challenges that need to be addressed in such a system.

Components

The system will be composed of three main components:

Mobile receiver(s)
Adaptation proxy
View synthesis and rendering cloud service

Transmission is to be carried via unicast over an unreliable wireless channel. A feedback channel would be necessary between the receiver and the proxy. This channel would be utilized to send information about current/desired viewpoint, buffer status, and network conditions, in addition statistics about mobile device itself (e.g. current battery level, screen resolution, expected amount of power for processing, etc.). Such feedback channel is crucial in order to have a fully-adaptive algorithm, that can quickly adapt to any change in these parameters.

Because of the limited wireless bandwidth, we need efficient and adaptive compression of transmitted views/layers. In addition, an unequal error protection (UEP) technique will be required to overcome the unreliable nature of the wireless channel. Multiple description coding (MDC) has already been used in some experiments, and the results seem quite promising for both multiview video coding, and video plus depth.

It is assumed that the mobile receiver has a display capable of rendering at least two views. Interfacing with mobile receiver's display may be an issue since this will only be possible through a predefined driver API. Whether or not these APIs will be exposed and which 3D image format they expect will vary from one device to the other. We believe that such autostereoscopic displays will probably come with their own IP hardware to perform rendering operations such as 3D warping, hole filling, etc. Whether or not we would have control over this process is still unknown. In the mean time, it is possible to experiment and send data that can be rendered in 2D, just like most of the experiments we have read so far. This would enable us to establish the feasibility of our scheme, and benchmark it against previous works.

Based on receiver feedback, the adaptation proxy is responsible for selecting the best views to send, perform rate adaptation based on current network conditions, and encode them quickly and efficiently. We can utilize RD-optimization techniques for rate adaptation.

It is important to distinguish what the proxy and cloud would accomplish. For example, a multiview plus depth scheme could be used in order to support a broad array of devices, such as phones and tablets. From some real, filmed views, a server might interpolate the extra views required, where the number of extra views depend of the nature of the mobile device. This would give quite some flexibility, at a very cheap cost: only the original views would need to be stored. Experimentally, we would have to see if it is doable to generate extra interpolated views on the fly, or if such views need to live on the server as well.

Design Choices

What is the format of the stored video files?
How many views (and possible depth maps) need to be sent to the receiver?
- two views (receiver needs to construct a disparity/depth map and synthesize intermediate views)
- two views + two depth maps (receiver can then synthesize any intermediate view between the received ones)
- one view + depth map (yields a limited view synthesis range)
- two+ views (and + depth maps) for larger displays, such as the iPad and displays/TV in cars
What compression format should be used to compress the texture images of the views? This could be driven by the resolution of the display where a high level of texture might not be noticeable.
What compression format is efficient for compressing the depth maps without affecting the quality of synthesized views? Should depth map be compressed?
- Will MVC be suitable for depth maps?
How much will quality reduction of one of the views to reduce bandwidth affect the synthesis process at the receiver side?
- Will the effect be significant given that receiver's display size is small?
How will view synthesis and associated operations (e.g. 3D warping and hole filling) at the receiver-side affect the power consumption of the device?
Should we only focus on reducing the amount of data needed to be transmitted, as antennas consume a significant amount of power? Or should we experiment and see how those 2 variables are tied together, given that decoding 3D videos is a much more resource intensive process?

Readings and Thoughts

Almost all of the autostereoscopic mobile 3D displays available at the moment are 2-view displays. The number of different views in autostereoscopic displays come at a price of reduced spatial resolution and lowered brightness. In the case of small screen, battery-driven mobile devices, the tradeoff between the number of views and spatial resolution is very important. Since mobile devices are normally watched by a single observer only, two independent views are considered sufficient for satisfactory 3D perception and a good compromise with respect to spatial resolution.
A recent survey published in the Proceedings of the IEEE state that video-plus-depth (V+D) coding approach is preferred for scalability over multiple view coding (MVC), but it might be too computationally demanding for terminal devices as it requires view rendering and hence makes the device less power efficient. This statement needs to be verified by actual implementation and measurements.
If we reduce the depth maps resolution by down-sampling to reduce the bitrate of the transmitted stream, we will need to perform depth enhancement operations after reconstruction at the receiver side. These operations can be computationally expensive and may drain the mobile receivers power.
When considering which is more important, texture stream or depth stream, in order to implement a prioritization technique, we are also faced with a dilemma. Having a high resolution texture stream is required for backward compatibility in case the device is only 2D capable. However, having a high quality depth map is very important to avoid shape/surface deformations after reconstructing the 3D scene.
Trying to divide the task of rendering the 3D video between the client and the server is also not trivial. For example, attempting to generate part of the view that will be rendered at the client-side on the server will not be possible because we do not know which viewpoint the client will render based on user input. Moreover, this defeats the goal that needs to be achieved which is sending the client only two neighboring views and delegating the rendering of intermediate views to it in order to reduce viewpoint change latency.
Attempting to utilize the GPU to speedup the view synthesis process also has its challenges. One main issue that may hinder achieving significant speedups is the 3D warping process. That mapping process between pixels in the reference view and pixels in the target view is not a one-to-one mapping process. Several pixels may be mapped to the same pixel location in the target view causing competition that needs to be resolved based on which pixel represents an object in the foreground and which one represents the background. Thus, attempting to perform 3D warping of the pixels in parallel will exhibit shared resource contention. How much effect does this has on achievable speedups will depend on the amount of contention and needs to be determined.
Scalable asymmetric coding of stereoscopic video can be achieved using SVC. Asymmetric encoding exploits the binocular suppression theory which states that the HVS tolerates lack of high-frequency components in one of the views. Hence, one of the views may be presented at a lower quality without degrading the 3-D video perception. It is possible to obtain spatial and/or quality scalable right and left views if they are simulcast coded using the SVC standard. Either both views are encoded using SVC, or one view is SVC encoded and the other is encoded with H.264/AVC. Thus, when available bandwidth is more that maximum rate of the video, the quality of the scalable bitstream dominates the perceived quality. Otherwise, the non-scalable bitstream becomes the high-quality pair of the asymmetry. The concept of asymmetric coding can be extended to multiple views where every other view is scalable coded using SVC for example.
In the case of multiple views plus depth (MVD) coding, view scalability and temporal scalability are achievable if MVC is used, and spatial scalability and quality scalability are achievable if SVC is used. The coding rate of the view and the corresponding depth maps can be modified during streaming to adapt to the dynamic network conditions (depth maps generally require 15-20% of video bitrate to produce acceptable results).
The paper by Hewage et al. uses subjective testing to measure the effect of compression artifacts and frame losses on the perceived quality (namely, overall image quality and depth perception) of the reconstructed stereoscopic video. Their results show that the VQM objective metric can be mapped so that it correlates strongly with both the overall viewer perception of image quality and depth perception. They conclude that VQM is a good model prediction for both the perceived overall image quality and depth of 3D video, whether the channel is error free or exhibits packet losses.

Tools

Joint Multiview Video Coding JMVC Reference Software (:pserver:jvtuser@garcon.ient.rwth-aachen.de:/cvs/jvt)
View Synthesis Based on Disparity/Depth (ViSBD) Reference Software
Computer Unified Device Architecture (CUDA)

References

Mobile Form Factor Will Bring 3D to Mainstream Market
C. Hewage, S. Worral, S. Dogan, S. Villette, and A. Kondoz, "Quality Evaluation of Color Plus Depth Map-Based Stereoscopic Video," IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2, April 2009.