Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformers with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training, for better generalization to the language variation at inference. We call our model as Context Memory and Online Text Augmentation or CMOTA for short. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various evaluation metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.
Story visualization task is challenging for the requirement of rendering the visual details in images with convincing background of a scene – seasonal elements, environmental objects such as table, location, and the proper character appearing, which here we refer to as context, spread across the given text sentences. To better encode such semantic context, e.g., plausible background and characters, we propose a new memory architecture, call it as context memory. In particular, we utilize transformer architecture with explicitly connecting last layers for propagating historical information. Additionally, not all historic information is equally important for generating an image at a time step. Similar to the masked self-attention, we attentively weight the past information for better modeling of sparse context, i.e., attentively weighted memory. By utilizing this, we fuse the contextual memory information into current state, thereby generating temporally coherent and semantically relevant sequence.
We further propose to generate pseudo-texts during the learning process to augment (online-augmentation) for better linguistic generalization without requiring large external data by learning the bi-directional Transformers in both directions of generating images from texts and vice versa.
We use bi-directional (i.e., multi-modal) transformer that iteratively generates images and texts in an unified architecture. Similar to DALL-E (Ramesh et al., ICML'21), the image tokens are sequentially predicted from the input text sequences by the Transformer. Then the decoder of the VQ-VAE translates the predicted image tokens into image sequence. The text tokens are also sequentially predicted from the input image token sequence by the same Transformer. Particularly, for the bi-directional multi-modal generation, we add two embeddings; a positional embedding for absolute position between tokens and a segment embedding for distinguishing source and target modality.
Thanks to our bi-directional Transformers architecture, we can naturally integrate the process of generating pseudo-texts to the process of learning image-to-text model and the text-to-image generation model. In the early epochs, less meaningful sentences are generated, but as training progresses, more meaningful sentences are generated. As a side-product, by supervising the model learning with intermediate goals at each time step, we expect to expedite the convergence of learning.
Following previous works, we compare the performance with prior arts in Pororo-SV and Flintstones-SV dataset on the various metrics, i.e., FID, Character-F1 (Char.F1), Frame Accuracy (Frm.Acc), BLEU-2/3 and R-precision (R-Prc.), and resolutions, i.e., 64 × 64 and 128 x 128. CMOTA shows overall performance improvement, thereby outperforming existing methods by a large margin in both benchmark datasets. Also, our CMOTA generates qualitatively more plausible image sequence with better visual quality, compared to prior works. Further, we observe the advatage of using memory architecture. With the context memory, our CMOTA generates a semantically more plausible image sequence with proper context, e.g., background, while the CMOTA without memory fails to capture proper background context since a single sentence could be interpreted in many ways.
For more details, please check out the paper.
@inproceedings{ahn2023cmota,
author = {Ahn, Daechul and Kim, Daneul and Song, Gwangmo and Kim, Seung Hwan and Lee, Honglak and Kang, Dongyeop and Choi, Jonghyun},
title = {Story Visualization by Online Text Augmentation with Context Memory},
booktitle = {ICCV},
year = {2023},
}