Nowadays, visual recognition in video has no clear winner. However, combination of CNN and LSTM is a
popular way to describe a video. In this paper, we propose a method to recognize human actions and objects
which depend on the action. We also propose a simple attention method to focus on an important region
where human acts with objects. We rst extract temporal stream from a video sequence, and train a RGB
CNN and an optical
ow CNN separately. After that, an action tube to give attention on an important region
is estimated and described by the activations from each network. From the activations, we nally LSTM
blocks to predict action/object classes. Our architecture is small enough (810 fps) but shows competitive
performance against state-of-the-art methods over MPII Cooking dataset.
popular way to describe a video. In this paper, we propose a method to recognize human actions and objects
which depend on the action. We also propose a simple attention method to focus on an important region
where human acts with objects. We rst extract temporal stream from a video sequence, and train a RGB
CNN and an optical
ow CNN separately. After that, an action tube to give attention on an important region
is estimated and described by the activations from each network. From the activations, we nally LSTM
blocks to predict action/object classes. Our architecture is small enough (810 fps) but shows competitive
performance against state-of-the-art methods over MPII Cooking dataset.