Difference between revisions of "ACMM08"

From NMSL
Line 31: Line 31:
 
</table>
 
</table>
  
As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality.
+
As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications.
 +
 
 
For videos that only contain one shot without much camera motion, only one key frame is extracted such as city and ice sequences.
 
For videos that only contain one shot without much camera motion, only one key frame is extracted such as city and ice sequences.
 
The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise or drop in the distance, no new shot is detected which is semantically correct because the whole video is one shot. However, we need to either tune the parameters or change the algorithm to take key frames even when there is no change in the shot. This could mean there is a need for tuning the algorithm for such application. In addition, since the background in surveillance videos are usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.
 
The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise or drop in the distance, no new shot is detected which is semantically correct because the whole video is one shot. However, we need to either tune the parameters or change the algorithm to take key frames even when there is no change in the shot. This could mean there is a need for tuning the algorithm for such application. In addition, since the background in surveillance videos are usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.
  
 
''' Hierarchical Summarization''': From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides are with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the ''frame coverage''. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two root key frames (peaks in the plot). From all the frames in this shot we calculate the maximum distance to the root key frame. We have the number of frames in the current shot, N, and the percentage from which we can calculate the number of frames we need, m. The distance is divided by m+1 and the frames are taken at equal distances (on y-axis). This approach maximizes the distance (visual difference) between the new key frames and the root key frames selected by the algorithm. Note that this is very different from equal distance temporal sampling (on x-axis) as it provides more detail for the parts that have more changes. Moreover, in some sense it is reflecting the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We also need to come up with an idea of how to allocate key frame to shot. Some shots do not need as many frames as others. The distance plot can help us identify these shots corresponding to flat parts of the plot (e.g. the last segment in foreman).
 
''' Hierarchical Summarization''': From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides are with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the ''frame coverage''. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two root key frames (peaks in the plot). From all the frames in this shot we calculate the maximum distance to the root key frame. We have the number of frames in the current shot, N, and the percentage from which we can calculate the number of frames we need, m. The distance is divided by m+1 and the frames are taken at equal distances (on y-axis). This approach maximizes the distance (visual difference) between the new key frames and the root key frames selected by the algorithm. Note that this is very different from equal distance temporal sampling (on x-axis) as it provides more detail for the parts that have more changes. Moreover, in some sense it is reflecting the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We also need to come up with an idea of how to allocate key frame to shot. Some shots do not need as many frames as others. The distance plot can help us identify these shots corresponding to flat parts of the plot (e.g. the last segment in foreman).

Revision as of 15:50, 18 April 2008

Here are some of the results from running the new code. SigThresh is hard coded at 0.25 which seems to work well for various types of video. The plots for frame distance are given in discussion_18apr08.pdf

Video#FramesSummary Image
city299 city
ice239 ice
foreman299 foreman
soccer299 soccer
doc_reality (from CBC)2000 doc_reality
car surveillance video 1490 car surveillance 1
car surveillance video 2 (night)540 car surveillance 2
car surveillance video 3420 car surveillance 3

As it can be seen from the extracted summaries, the algorithm succeeds in detecting shots. A good example here is the doc_reality sequence provided by CBC that has a lot of shots. In addition, since we are using all three color elements (HSV), changes in the brightness are detected e.g. the first fews key frames in doc_reality. We may want to change this for surveillance applications.

For videos that only contain one shot without much camera motion, only one key frame is extracted such as city and ice sequences. The algorithm fails to extract meaningful key frames for surveillance videos especially the last two. As it can be seen from the plots, although there is a significant raise or drop in the distance, no new shot is detected which is semantically correct because the whole video is one shot. However, we need to either tune the parameters or change the algorithm to take key frames even when there is no change in the shot. This could mean there is a need for tuning the algorithm for such application. In addition, since the background in surveillance videos are usually steady, we may want to subtract an average histogram (computed progressively) from all video frames, so small changes can be detected more easily.

Hierarchical Summarization: From what I understand, the distance of frames are measured to the last key frame so far. This means that we are tracing the distance and when it reaches its maximum we declare a new key frame. The algorithm provides are with root key frames as a starting points. These frames are intuitively the ones that are farthest from each other thus maximizing the frame coverage. In order to add more details to the summary we process each shot individually. Therefore, for an online system, we need to keep all the frames between the two root key frames (peaks in the plot). From all the frames in this shot we calculate the maximum distance to the root key frame. We have the number of frames in the current shot, N, and the percentage from which we can calculate the number of frames we need, m. The distance is divided by m+1 and the frames are taken at equal distances (on y-axis). This approach maximizes the distance (visual difference) between the new key frames and the root key frames selected by the algorithm. Note that this is very different from equal distance temporal sampling (on x-axis) as it provides more detail for the parts that have more changes. Moreover, in some sense it is reflecting the content progression of the video instead of temporal progression by taking more samples only if there is more visual change. We also need to come up with an idea of how to allocate key frame to shot. Some shots do not need as many frames as others. The distance plot can help us identify these shots corresponding to flat parts of the plot (e.g. the last segment in foreman).