ACMM08

From NMSL

Here are some of the results from running the new code. SigThresh is hard coded at 0.25 which seems to work well for various types of video. The plots for frame distance are given in discussion_18apr08.pdf

Video#FramesSummary Image
city299 city
ice239 ice
foreman299 foreman
soccer299 soccer
doc_reality (from CBC)2000 doc_reality
car surveillance video 1490 car surveillance 1
car surveillance video 2 (night)540 car surveillance 2
car surveillance video 3420 car surveillance 3

As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. For videos that contain only one shot, only one key frame is extracted such as city and ice sequences. While for multi shot videos several frames are extracted. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications and use only H and S.

The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise/drop in the distance, no new key frame is extracted. This is semantically correct since there is no new shot. However, we need to address this issue either by parameter tuning or changing the algorithm. Characteristics of surveillance videos can help us tune the algorithm for such applications. For example, since the background in surveillance videos is usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.

Hierarchical Summarization: From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides us with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the frame coverage. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two subsequent root key frames (peaks in the plot). From all the frames in th shot, we calculate the maximum distance to the previous root key frame. From the number of frames in the current shot, and the desired summarization ratio, we can calculate the number of extra frames we need, M. The interval (on y-axis) is divided into M+1 equal segments and M frames are selected. This approach maximizes the distance (visual difference) between the extra key frames and the root key frames. Note that this is very different from uniform temporal sampling (equal distances on x-axis) as it provides more detail for the parts that have more changes. In other words, it is reflects the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We need to come up with a technique to allocate key frame to shots. Some shots do not need as many frames as others. The distance plot can help us identify these shots which correspond to flat parts of the plot (e.g. the last segment in foreman).