MASH: Adaptive Streaming of Multiview Videos over HTTP

From NMSL
Revision as of 15:19, 18 November 2016 by Kdiab (talk | contribs) (→‎Evaluation)

People

  • Khaled Diab
  • Mohamed Hefeeda


Overview

Multiview videos offer unprecedented experience by allowing users to explore scenes from different angles and perspectives. Thus, such videos have been gaining substantial interest from major content providers such as Google and Facebook. Adaptive streaming of multiview videos is, however, challenging because of the Internet dynamics and the diversity of users interests and network conditions. To address this challenge, we propose a novel rate adaptation algorithm for multiview videos (called MASH). Streaming multiview videos is more user centric than single-view videos, because it heavily depends on how users interact with the different views. To efficiently support this interactivity, MASH constructs probabilistic view switching models that capture the switching behavior of the user in the current session, as well as the aggregate switching behavior across all previous sessions of the same video. MASH then utilizes these models to dynamically assign relative importance to different views. Furthermore, MASH uses a new buffer-based approach to request video segments of various views at different qualities, such that the quality of the streamed videos is maximized while the network bandwidth is not wasted. We have implemented a multiview video player and integrated MASH in it. We compare MASH versus the state-of-the-art algorithm used by YouTube for streaming multiview videos. Our experimental results show that MASH can produce much higher and smoother quality than the algorithm used by YouTube, while it is more efficient in using the network bandwidth. In addition, we conduct large- scale experiments with up to 100 concurrent multiview streaming sessions, and we show that MASH maintains fairness across competing sessions, and it does not overload the streaming server.

Details

Figure 1 shows a high-level overview of MASH, which runs at the client side. MASH combines the outputs of the global and local view switching models to produce a relative importance factor <math>\beta_i </math> for each view <math>V_i</math> . MASH also constructs a buffer-rate function <math>f_i</math> for each view <math>V_i</math>, which maps the current buffer occupancy to the segment quality to be requested. The buffer-rate functions are dynamically updated during the session; whenever a view switch happens. MASH strives to produce smooth and high quality playback for all views, while not wasting bandwidth by carefully prefetching views that will likely be watched.


Fig. 1: High-level overview of MASH.


View Switching Models

MASH combines the outputs of two stochastic models (local and global) to estimate the likelihood of different views being watched. We define each view switching model as a discrete-time Markov chain (DTMC) with <math>N</math> (number of views) states. View switching is allowed at discrete time steps of length <math>\Delta</math>. The time step <math>\Delta</math> is the physical constraint on how fast the user can interact with the video.

Local Model: It captures the user activities during the current streaming session, and it evolves with time. That is, the model is dynamic and is updated with every view switching event that happens in the session. The model maintains a count matrix <math>M(t)</math> of size <math>N \times N</math> , where <math>M_{ij}(t)</math> is proportional to the number of times the user switched from view <math>V_i</math> to <math>V_j</math>, from the beginning of the session up to time <math>t</math>. The count matrix <math>M(t)</math> is initialized to all ones. Whenever a view switching occurs, the corresponding element in <math>M(t)</math> is incremented. The count matrix is used to compute the probability transition matrix of the local model <math>L(t)</math>.

Global Model: This model aggregates users activities across all streaming sessions that have been served by the server so far. At beginning of the streaming session, the client downloads the global model parameters from the server. We use <math>G</math> to denote the transition matrix of the global model, where <math>G_{ij} = p(V_j |V_i )</math> is the probability of switching to <math>V_j</math> given <math>V_i</math>. If this is the fist streaming session, <math>G_{ij}</math> is initialized to <math>1/N</math> for every <math>i</math> and <math>j</math>.

Combined Model: The local and global model complement each other in predicting the (complex) switching behavior of users during watching multiview videos. For example, in some streaming sessions, the user activity may significantly deviate from the global model expectations, because the user is exploring the video from different viewing angles than most previous users have. Or the multiview video may be new, and the global model has not captured the expected view switching pattern yet. On the other hand, the local model may not be very helpful when the user has not had enough view switches yet, e.g., at the beginning of a streaming session. We combine the local and global models to compute an importance factor <math>\beta_i</math>for each view <math>V_i</math> by linearly combining <math>G</math> and <math>L(t)</math> using weight factor <math>\alpha_i</math>. This weight factor is carefully computed to dynamically adjust the relative weights of the global and local models.

MASH: The Proposed Algorithm

MASH is a buffer-based rate adaptation algorithm for multiview videos, which means it determines the requested segment quality based on the buffer occupancy level, and it does not need to estimate the network capacity.

Rate adaptation for multiview videos is far more complex, as it needs to handle many views of different importance, while not wasting network bandwidth or resulting in many stalls during playback for re-buffering. To handle this complexity, we propose employing a family of buffer-rate functions, which considers the relative importance of the active and inactive views and how this relative view importance dynamically changes during the streaming session. Specifically, we define a function <math>f_i (B_i(t))</math> for each view <math>V_i</math>, which maps the buffer level <math>B_i(t)</math> of that view to a target quality <math>Q_i(t)</math> based on its importance factor <math>\beta_i</math> at time <math>t</math>. We use <math>\beta_i</math> to limit the maximum buffer occupancy level for view <math>V_i</math> as: <math>B_{max,i} = \beta_i \times B_{max}</math>. Since we set <math>\beta_i = 1</math> for the active view, the algorithm can request segments up to the maximum quality <math>Q_{max,i}</math>. For inactive views, MASH can request segments for up to a fraction of their maximum qualities. Figure 2 illustrates the buffer-rate functions for two views <math>V_i</math> and <math>V_j</math>. <math>V_i</math> is the active view, so <math>B_{max,i} = B_{max}</math>. The figure shows when the requests stop for both <math>V_i</math> and <math>V_j</math>, and the maximum bitrate difference to reflect the importance of each view.


Fig. 2: Proposed buffer-rate functions of views <math>V_i</math> (active) and <math>V_j</math> (inactive)


Note: We show that the global and local view switching models converge to their corresponding stationary distributions. We also calculated the number of steps to converge (the details are in the paper).

Evaluation

We have implemented a complete multiview video player in about 4,000 lines of Java code. It consists of HTTP client, decoders, renderer and rate adapter. Each view has its own decoder and frame buffer. The rate adapter decides on the segments to be requested and their quality levels. Then, segments are fetched from the HTTP server. Once a segment is fetched, it is decoded to frames and stored in the corresponding frame buffer. The renderer has references to all frame buffers, and it renders the corresponding active view.

Our testbed, which consists of multiple virtual machines (VMs) running on the Amazon cloud. We chose high-end VMs with 1 Gbps links, so that the shared cloud environment does not interfere much with our network setup. The HTTP server is YouTube, when we actually run the YouTube multiview client. In other experiments, we install and use the nginx as our HTTP server. Users run our multiview player with different rate adaptation algorithms. When we compare against YouTube, we use the player embedded in the Google Chrome web browser. The bandwidth and latency of the network links connecting VMs with the server are controlled using the Linux Traffic Control <math>\texttt{tc}</math>. We experiment with multiple network conditions to stress our algorithm. We use a multiview video released by YouTube. The video is for a concert and it has four different views shot by four cameras. The views cover the singer, band, stage, and fans. The user is allowed to switch among the four views at any time. The video is about 350 sec long, and it has four quality levels <math>Q = \{0.5, 1, 1.6, 2.8\}</math> Mbps. YouTube did not release other multiview videos, and we can not import multiview videos to YouTube because of the proprietary nature of its multiview player.

Publications

NA