GenAI<\/a>?<\/h2>\nAt the heart of text-to-video and text-to-image generation lies the process of diffusion. Inspired by the physical phenomenon where substances gradually mix \u2014 like ink diffusing in water \u2014 diffusion models in machine learning involve a two-step process: adding noise to data and then learning to remove it.<\/span><\/p>\nDuring training, the model takes images or sequences of video frames and progressively adds noise over several steps until the original content becomes indistinguishable. Essentially, it turns it into pure noise.\u00a0<\/span><\/p>\n<\/span>Diffusion and Generation processes for \u201cBeautiful blowing sakura tree placed on the hill during sunrise\u201d prompt<\/span><\/i><\/p>\nWhen generating new content, the process works in reverse. The model is trained to predict and remove noise in increments, focusing on a random intermediate noise step between two points, t, and t+1. Because of the long training process, the model has observed all steps in the progression from pure noise to almost clean images and now has quite the ability to identify and reduce noise at basically any level.\u00a0<\/span><\/p>\nFrom random, pure noise, the model, guided by the input text, iteratively creates video frames that are coherent and match the textual description. High-quality, detailed video content is a result of this very gradual process.<\/span><\/p>\nLatent diffusion is what makes this computationally possible. Instead of working directly with high-resolution images or videos, the data is compressed into a latent space by an encoder.\u00a0<\/span><\/p>\nThis is done to (significantly) reduce the amount of data the model needs to process, accelerating the generation without compromising quality. After the diffusion process refines the latent representations, a decoder transforms them back into full-resolution images or videos.<\/span><\/p>\nThe issue with video generation<\/h2>\n
Unlike a single image, video requires objects and characters to remain stable throughout, preventing unexpected shifts or changes in appearance. We have all seen the wonders generative AI is capable of, but the occasional missing arm or indistinguishable facial expression is well within the norm. In the video, however, the stakes are higher; consistency is paramount for a fluid feel.<\/span><\/p>\nSo, if a character appears in the first frame wearing a specific outfit, that outfit must look identical in each subsequent frame. Any change in the character\u2019s appearance, or any \u201cmorphing\u201d of objects in the background, breaks the continuity and makes the video feel unnatural or even eerie.<\/span><\/p>\n<\/p>\n
Image provided by the author<\/span><\/i><\/p>\nEarly methods approached video generation by processing frames individually, with each pixel in one frame only referencing its corresponding pixel in others. However, this frame-by-frame approach often resulted in inconsistencies, as it couldn\u2019t capture the spatial and temporal relationships between frames that are essential for smooth transitions and realistic motion. Artifacts like shifting colors, fluctuating shapes, or misaligned features are a result of this lack of coherence and diminish the overall quality of the video.<\/span>
\n<\/span>
\n<\/span>
\n<\/span><\/span>Image provided by the author<\/span><\/i><\/p>\nThe biggest blocker in solving this was computational demand \u2014 and the cost. For a 10-second video at 10 frames per second, generating 100 frames increases complexity exponentially. Creating these 100 frames is about 10,000 times more complex than generating a single frame due to the need for precise frame-to-frame coherence. This task requires 10,000 times more in terms of memory, processing time, and computational resources, often exceeding practical limits. As you can imagine, the luxury of experimenting with this process was available to a select few in the industry.<\/span><\/p>\nThis is what made OpenAI\u2019s release of SORA so significant: they demonstrated that diffusion transformers could indeed handle video generation despite the immense complexity of the task.<\/span><\/p>\nHow diffusion transformers solved the self-consistency\u00a0 problem in video generation<\/h2>\n
The emergence of diffusion transformers tackled several problems: they enabled the generation of videos of arbitrary resolution and length while achieving high self-consistency. This is largely because they can work with long sequences, as long as they fit into memory, and due to the self-attention mechanism.<\/span>
\n<\/span>
\n<\/span>In artificial intelligence, self-attention is a mechanism that computes attention weights between elements in a sequence, determining how much each element should be influenced by others. It enables each element in a sequence to consider all other elements simultaneously and allows models to focus on relevant parts of the input data when generating output, capturing dependencies across both space and time.\u00a0<\/span><\/p>\nIn video generation, this means that every pixel in every frame can relate to every other pixel across all frames. This interconnectedness ensures that objects and characters remain consistent throughout the whole video, from beginning to end. If a character appears in one frame, self-attention helps prevent changes and maintain that character’s appearance in all subsequent frames.<\/span><\/p>\nBefore, models incorporated a form of self-attention within a convolutional network, but this structure limited their ability to achieve the same consistency and coherence now possible with diffusion transformers.<\/span><\/p>\nWith simultaneous spatio-temporal attention in diffusion transformers, however, the architecture can load data from different frames simultaneously and analyze them as a unified sequence. As shown in the image below, previous methods processed interactions within each frame and only linked each pixel with its corresponding position in other frames (see Figure 1). This restricted view hindered their ability to capture the spatial and temporal relationships essential for smooth and realistic motion. Now, with diffusion transformers, everything is processed simultaneously (Figure 2).<\/span><\/p>\nSpatio-temporal interaction in diffusion networks before and after transformers. Image provided by the author<\/span><\/i><\/p>\nThis holistic processing maintains stable details across frames, ensuring that scenes do not morph unexpectedly and turn into an incoherent mess of a final product. Diffusion transformers can also handle sequences of arbitrary length and resolution, provided they fit into memory. With this advancement, the generation of longer videos is feasible without sacrificing consistency or quality, addressing challenges that previous convolution-based methods could not overcome.<\/span><\/p>\nThe arrival of diffusion transformers reshaped text-to-video generation. It enabled the production of high-quality, self-consistent videos across arbitrary lengths and resolutions. Self-attention within transformers is a key component in addressing challenges like maintaining frame consistency and handling complex spatial and temporal relationships. OpenAI\u2019s release of SORA proved this capability, setting a new standard in the industry: now, approximately 90% of advanced text-to-video systems are based on diffusion transformers, with major players like Luma, Clink, and Runway Gen-3 leading the market.<\/span><\/p>\nDespite these breathtaking advances, diffusion transformers are still very resource-intensive, requiring nearly 10,000 times more resources than a single-image generation, making training high-quality models still a very costly undertaking. Nevertheless, the open-source community has taken significant steps to make this technology more accessible. Projects like Open-SORA and Open-SORA-Plan, as well as other initiatives such as Mira Video Generation, Cog, and Cog-2, have opened new possibilities for developers and researchers to experiment and innovate. Backed by companies and academic institutions, these open-source projects give hope for ongoing progress and greater accessibility in video generation, benefiting not only large corporations but also independent creators and enthusiasts keen to experiment. This, as with any other community-driven effort, opens up a future where video generation is democratized, bringing this powerful technology to many more creatives to explore.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"Text-to-video generation is not what it was even just a few years ago. We transformed it into a tool with truly futuristic functionality. Users create content for personal pages, influencers leverage it for self-promotion, and companies utilize it for everything from advertising and educational materials to virtual training and movie production. The majority of text-to-video […]<\/p>\n","protected":false},"author":10,"featured_media":62590,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":{"subtitle":"SORA & Beyond: Unlocking Consistent Text-to-Video","format":"standard","override":[{"template":"5","layout":"right-sidebar","sidebar":"default-sidebar","second_sidebar":"default-sidebar","share_position":"float","share_float_style":"share-normal","show_share_counter":"1","show_view_counter":"1","show_featured":"1","show_post_meta":"1","show_post_author":"1","show_post_author_image":"1","show_post_date":"1","post_date_format":"default","post_date_format_custom":"Y\/m\/d","show_post_category":"1","show_post_reading_time":"0","post_reading_time_wpm":"300","post_calculate_word_method":"str_word_count","zoom_button_out_step":"2","zoom_button_in_step":"3","show_post_tag":"1","number_popup_post":"1","show_author_box":"0","show_post_related":"1","show_inline_post_related":"0"}],"image_override":[{"single_post_thumbnail_size":"no-crop","single_post_gallery_size":"crop-715"}],"trending_post_position":"meta","trending_post_label":"Trending","sponsored_post_label":"Sponsored by","disable_ad":"0"},"jnews_primary_category":[],"jnews_social_meta":[],"jnews_override_counter":{"view_counter_number":"0","share_counter_number":"0","like_counter_number":"0","dislike_counter_number":"0"},"footnotes":""},"categories":[408,3229],"tags":[963,10519,16537,17582],"coauthors":[17471],"class_list":["post-62584","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-contributors","category-artificial-intelligence","tag-ai","tag-openai","tag-sora","tag-text-to-video"],"_links":{"self":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/users\/10"}],"replies":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/comments?post=62584"}],"version-history":[{"count":"1","href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584\/revisions"}],"predecessor-version":[{"id":62589,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584\/revisions\/62589"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/media\/62590"}],"wp:attachment":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/media?parent=62584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/categories?post=62584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/tags?post=62584"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/coauthors?post=62584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}