Google Research hosted The 3rd YouTube-8M Video Understanding Challenge that asks participants to localize video-level labels to the precise time in the video where the label actually appears, and do it at an unprecedented scale. Our (Team Ceshine) mixture of context-aware and context-agnostic segment classifiers won 7th place in the final leaderboard with a relatively low budget, which was $150 GCP credit and one additional local GTX1070 card during the whole process.
The final leaderboard on Kaggle | The published workshop paper
On average, there are 237 annotated segments per class, which is generally considered too few to train even moderately sophisticated models. Therefore, I adopted a transfer learning approach to avoid overfitting and improve generalization. The video label prediction task from the previous year’s YouTube-8M challenge was used to pre-train video-level models. Further fine-tuning on the segments dataset helps the model to pinpoint relevant frames in shorter segments more accurately.
Directly fine-tuned models are context-agnostic, as they have no information about the other parts of the video. However, for some classes, such context information can be used to make better predictions. To make use of this information, context-aware models are created by combining a video encoder and a segment encoder. The video and segment embedding vectors from the two encoders are concatenated together to predict class probabilities.
Different characteristics of the 1,000 target classes have been better accommodated by using a mixture of context-aware and context-agnostic segment classification models, thus improving overall performance.
The technology used in this project: