Difference between revisions of "ACMM08"

Latest revision as of 16:02, 18 April 2008

Here are some of the results from running the new code. SigThresh is hard coded at 0.25 which seems to work well for various types of video. The plots for frame distance are given in discussion_18apr08.pdf

Video	#Frames	Summary Image
city	299	city
ice	239	ice
foreman	299	foreman
soccer	299	soccer
doc_reality (from CBC)	2000	doc_reality
car surveillance video 1	490	car surveillance 1
car surveillance video 2 (night)	540	car surveillance 2
car surveillance video 3	420	car surveillance 3

As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. For videos that contain only one shot, only one key frame is extracted such as city and ice sequences. While for multi shot videos several frames are extracted. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications and use only H and S.

The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise/drop in the distance, no new key frame is extracted. This is semantically correct since there is no new shot. However, we need to address this issue either by parameter tuning or changing the algorithm. Characteristics of surveillance videos can help us tune the algorithm for such applications. For example, since the background in surveillance videos is usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.

Hierarchical Summarization: From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides us with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the frame coverage. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two subsequent root key frames (peaks in the plot). From all the frames in th shot, we calculate the maximum distance to the previous root key frame. From the number of frames in the current shot, and the desired summarization ratio, we can calculate the number of extra frames we need, M. The interval (on y-axis) is divided into M+1 equal segments and M frames are selected. This approach maximizes the distance (visual difference) between the extra key frames and the root key frames. Note that this is very different from uniform temporal sampling (equal distances on x-axis) as it provides more detail for the parts that have more changes. In other words, it is reflects the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We need to come up with a technique to allocate key frame to shots. Some shots do not need as many frames as others. The distance plot can help us identify these shots which correspond to flat parts of the plot (e.g. the last segment in foreman).

@@ Line 31: / Line 31: @@
 </table>
-As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications and use only H and S.
+As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. For videos that contain only one shot, only one key frame is extracted such as city and ice sequences. While for multi shot videos several frames are extracted. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications and use only H and S.
-For videos that contain only one shot, only one key frame is extracted such as city and ice sequences.
+The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise/drop in the distance, no new key frame is extracted. This is semantically correct since there is no new shot. However, we need to address this issue either by parameter tuning or changing the algorithm. Characteristics of surveillance videos can help us tune the algorithm for such applications. For example, since the background in surveillance videos is usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.
-The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise or drop in the distance, no new shot is detected which is semantically correct because the whole video is one shot. However, we need to either tune the parameters or change the algorithm to take key frames even when there is no change in the shot. This could mean there is a need for tuning the algorithm for such application. In addition, since the background in surveillance videos are usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.
-''' Hierarchical Summarization''': From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides are with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the ''frame coverage''. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two root key frames (peaks in the plot). From all the frames in this shot we calculate the maximum distance to the root key frame. We have the number of frames in the current shot, N, and the percentage from which we can calculate the number of frames we need, m. The distance is divided by m+1 and the frames are taken at equal distances (on y-axis). This approach maximizes the distance (visual difference) between the new key frames and the root key frames selected by the algorithm. Note that this is very different from equal distance temporal sampling (on x-axis) as it provides more detail for the parts that have more changes. Moreover, in some sense it is reflecting the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We also need to come up with an idea of how to allocate key frame to shot. Some shots do not need as many frames as others. The distance plot can help us identify these shots corresponding to flat parts of the plot (e.g. the last segment in foreman).
+''' Hierarchical Summarization''': From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides us with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the ''frame coverage''. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two subsequent root key frames (peaks in the plot). From all the frames in th shot, we calculate the maximum distance to the previous root key frame. From the number of frames in the current shot, and the desired summarization ratio, we can calculate the number of extra frames we need, M. The interval (on y-axis) is divided into M+1 equal segments and M frames are selected. This approach maximizes the distance (visual difference) between the extra key frames and the root key frames. Note that this is very different from uniform temporal sampling (equal distances on x-axis) as it provides more detail for the parts that have more changes. In other words, it is reflects the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We need to come up with a technique to allocate key frame to shots. Some shots do not need as many frames as others. The distance plot can help us identify these shots which correspond to flat parts of the plot (e.g. the last segment in foreman).