Our experiments show that pretraining with either text-only or multi-modal data provides significant gains over no pretraining, on both the established YouCook2 benchmark and the new ViTT benchmark. Our approach entails jointly pretraining both multimodal encoder and text-based decoder models via MASS-style pretraining song2019mass. In contrast to prior work that focused on BERT-style pretraining of encoder networks ( sun2019videobert sun2019contrastive), Motivated by the high cost of collecting human annotations, we investigate pretraining a video segment captioning model using unsupervised signals – ASR ( Automatic Speech Recognition) tokens and visual features from instructional videos, and unpaired instruction steps extracted from independent sources: Recipe1M marin2019recipe1m+ and WikiHow koupaee2018wikihow. Using YouCook2 and the new ViTT dataset as benchmarks for testing model performance and generalization, we further focus on the sub-problem of video-segment–level caption generation, assuming segment boundaries are given hessel2019case sun2019videobert luo2020univilm For the given video scene, we show the ViTT annotation (Groundtruth) and model outputs (no pretraining and MASS-based pretraining). Groundtruthįigure 1: Dense video captioning using ViTT–trained models. This task is closely related to the dense video captioning task considered in prior work zhou2018towards Zhou2018EndtoEndDV krishna2017densecaptioning, where an instructional video is first segmented into its main steps, followed by segment-level caption generation. Producing such meta-data in an automatic way would greatly scale up the efforts of providing easier information access to videos. This effort echoes prior work in the literature showing how users of instructional videos can benefit from human-curated meta-data, such as a timeline pointing to the successive steps of a tutorial kim2014crowdsourcing margulieux2012subgoal weir2015learnersourcing. This enables users to get a quick sense of what the video covers, and also to jump to a particular time in the video if so desired. Recognizing this difficulty, search engines started showing links to “key moments” within videos in search results, based on timestamps and short descriptions provided by the content creators themselves. Indeed, compared to traditional content format such as text, video carries richer information to satisfy such needs.īut as a content media, videos are also inherently more difficult to skim through, making it harder to quickly target the relevant part(s) of a video. Audiences that would otherwise not be able to fully comprehend your video because, say, of a linguistic barrier or hearing impairment, can now enjoy them.An important reason for this fast-growing video consumption is information-seeking.įor instance, people turn to YouTube “hungry for how-to and learning content” oneilhart2018 In summary, captions and subtitles expand your audience. If you want to do subtitles as well, but have a tight budget, think about choosing the most common language(s) to include a wider audience (e.g., Spanish and Mandarin). At the very least, captions should always be included so that you are not excluding a large part of your audience. Ideally, it’s a good idea to have both captions and subtitles if you can afford it. If you’re intending to include audiences globally, then subtitles in foreign language(s) is important so that people in non-English speaking countries can also enjoy your video or film. It’s also a good idea to include captions if you’re posting on social media like Facebook or Instagram because most people keep their sound off. The main thing to note is “who is your audience” and what medium are you posting your media? For instance, if your audience is everybody, including the deaf and hearing-impaired, then captions would be ideal.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |