{"id":62584,"date":"2024-12-27T13:05:44","date_gmt":"2024-12-27T12:05:44","guid":{"rendered":"https:\/\/dataconomy.ru\/?p=62584"},"modified":"2024-12-27T13:05:44","modified_gmt":"2024-12-27T12:05:44","slug":"diffusion-transformers-text-to-video-2024","status":"publish","type":"post","link":"https:\/\/dataconomy.ru\/2024\/12\/27\/diffusion-transformers-text-to-video-2024\/","title":{"rendered":"How Diffusion Transformers Changed Text-to-Video Generation in 2024"},"content":{"rendered":"<p><a href=\"https:\/\/www.unite.ai\/cameractrl-enabling-camera-control-for-text-to-video-generation\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Text-to-video generation <\/span><\/a><span style=\"font-weight: 400\">is not what it was even just a few years ago. We transformed it into a tool with truly futuristic functionality. Users create content for personal pages, influencers leverage it for self-promotion, and companies utilize it for everything from advertising and educational materials to virtual training and movie production. The majority of text-to-video systems are built on the architecture of diffusion transformers, which are the cutting edge in the world of video generation. This tech serves as the foundation for services like Luma and Kling. However, this status was only solidified in 2024, when the first diffusion transformers for video gained market adoption.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The turning point came with <a href=\"https:\/\/dataconomy.ru\/2024\/12\/10\/sora-launches-what-openais-text-to-video-tool-can-and-cant-do\/\">OpenAI&#8217;s release<\/a> of <\/span><a href=\"https:\/\/www.unite.ai\/video-generation-ai-exploring-openais-groundbreaking-sora-model\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">SORA<\/span><\/a><span style=\"font-weight: 400\">, showcasing incredibly realistic shots that were almost indistinguishable from real life. OpenAI showed that their diffusion transformer could successfully generate video content. This move validated the potential of the tech and sparked a trend across the industry: now, approximately 90% of current models are based on diffusion transformers.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Diffusion is a fascinating process that deserves a more thorough exploration. Let\u2019s understand how diffusion works, the challenges the transformer technology addresses in this process, and why it plays such a significant role in text-to-video generation.<\/span><\/p>\n<h2>What is the diffusion process in <a href=\"https:\/\/www.unite.ai\/reimagining-telecom-genais-role-in-elevating-customer-experiences\/\" target=\"_blank\" rel=\"noopener\">GenAI<\/a>?<\/h2>\n<p><span style=\"font-weight: 400\">At the heart of text-to-video and text-to-image generation lies the process of diffusion. Inspired by the physical phenomenon where substances gradually mix \u2014 like ink diffusing in water \u2014 diffusion models in machine learning involve a two-step process: adding noise to data and then learning to remove it.<\/span><\/p>\n<p><span style=\"font-weight: 400\">During training, the model takes images or sequences of video frames and progressively adds noise over several steps until the original content becomes indistinguishable. Essentially, it turns it into pure noise.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-62585\" src=\"https:\/\/dataconomy.ru\/wp-content\/uploads\/2024\/12\/text-to-video-1.jpg\" alt=\"text-to-video sakura\" width=\"512\" height=\"288\" title=\"\"><\/span><i><span style=\"font-weight: 400\">Diffusion and Generation processes for \u201cBeautiful blowing sakura tree placed on the hill during sunrise\u201d prompt<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400\">When generating new content, the process works in reverse. The model is trained to predict and remove noise in increments, focusing on a random intermediate noise step between two points, t, and t+1. Because of the long training process, the model has observed all steps in the progression from pure noise to almost clean images and now has quite the ability to identify and reduce noise at basically any level.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">From random, pure noise, the model, guided by the input text, iteratively creates video frames that are coherent and match the textual description. High-quality, detailed video content is a result of this very gradual process.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Latent diffusion is what makes this computationally possible. Instead of working directly with high-resolution images or videos, the data is compressed into a latent space by an encoder.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">This is done to (significantly) reduce the amount of data the model needs to process, accelerating the generation without compromising quality. After the diffusion process refines the latent representations, a decoder transforms them back into full-resolution images or videos.<\/span><\/p>\n<h2>The issue with video generation<\/h2>\n<p><span style=\"font-weight: 400\">Unlike a single image, video requires objects and characters to remain stable throughout, preventing unexpected shifts or changes in appearance. We have all seen the wonders generative AI is capable of, but the occasional missing arm or indistinguishable facial expression is well within the norm. In the video, however, the stakes are higher; consistency is paramount for a fluid feel.<\/span><\/p>\n<p><span style=\"font-weight: 400\">So, if a character appears in the first frame wearing a specific outfit, that outfit must look identical in each subsequent frame. Any change in the character\u2019s appearance, or any \u201cmorphing\u201d of objects in the background, breaks the continuity and makes the video feel unnatural or even eerie.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-62586\" src=\"https:\/\/dataconomy.ru\/wp-content\/uploads\/2024\/12\/text-to-video-2.jpg\" alt=\"text-to-video frames\" width=\"512\" height=\"287\" title=\"\"><\/p>\n<p><i><span style=\"font-weight: 400\">Image provided by the author<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400\">Early methods approached video generation by processing frames individually, with each pixel in one frame only referencing its corresponding pixel in others. However, this frame-by-frame approach often resulted in inconsistencies, as it couldn\u2019t capture the spatial and temporal relationships between frames that are essential for smooth transitions and realistic motion. Artifacts like shifting colors, fluctuating shapes, or misaligned features are a result of this lack of coherence and diminish the overall quality of the video.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-62587\" src=\"https:\/\/dataconomy.ru\/wp-content\/uploads\/2024\/12\/text-to-video-3.jpg\" alt=\"text-to-video environments\" width=\"512\" height=\"287\" title=\"\"><\/span><i><span style=\"font-weight: 400\">Image provided by the author<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400\">The biggest blocker in solving this was computational demand \u2014 and the cost. For a 10-second video at 10 frames per second, generating 100 frames increases complexity exponentially. Creating these 100 frames is about 10,000 times more complex than generating a single frame due to the need for precise frame-to-frame coherence. This task requires 10,000 times more in terms of memory, processing time, and computational resources, often exceeding practical limits. As you can imagine, the luxury of experimenting with this process was available to a select few in the industry.<\/span><\/p>\n<p><span style=\"font-weight: 400\">This is what made OpenAI\u2019s release of SORA so significant: they demonstrated that diffusion transformers could indeed handle video generation despite the immense complexity of the task.<\/span><\/p>\n<h2>How diffusion transformers solved the self-consistency\u00a0 problem in video generation<\/h2>\n<p><span style=\"font-weight: 400\">The emergence of diffusion transformers tackled several problems: they enabled the generation of videos of arbitrary resolution and length while achieving high self-consistency. This is largely because they can work with long sequences, as long as they fit into memory, and due to the self-attention mechanism.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\">In artificial intelligence, self-attention is a mechanism that computes attention weights between elements in a sequence, determining how much each element should be influenced by others. It enables each element in a sequence to consider all other elements simultaneously and allows models to focus on relevant parts of the input data when generating output, capturing dependencies across both space and time.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">In video generation, this means that every pixel in every frame can relate to every other pixel across all frames. This interconnectedness ensures that objects and characters remain consistent throughout the whole video, from beginning to end. If a character appears in one frame, self-attention helps prevent changes and maintain that character&#8217;s appearance in all subsequent frames.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Before, models incorporated a form of self-attention within a convolutional network, but this structure limited their ability to achieve the same consistency and coherence now possible with diffusion transformers.<\/span><\/p>\n<p><span style=\"font-weight: 400\">With simultaneous spatio-temporal attention in diffusion transformers, however, the architecture can load data from different frames simultaneously and analyze them as a unified sequence. As shown in the image below, previous methods processed interactions within each frame and only linked each pixel with its corresponding position in other frames (see Figure 1). This restricted view hindered their ability to capture the spatial and temporal relationships essential for smooth and realistic motion. Now, with diffusion transformers, everything is processed simultaneously (Figure 2).<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-62588\" src=\"https:\/\/dataconomy.ru\/wp-content\/uploads\/2024\/12\/text-to-video-4.jpg\" alt=\"text-to-video diffusion transformers\" width=\"512\" height=\"305\" title=\"\"><i><span style=\"font-weight: 400\">Spatio-temporal interaction in diffusion networks before and after transformers. Image provided by the author<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400\">This holistic processing maintains stable details across frames, ensuring that scenes do not morph unexpectedly and turn into an incoherent mess of a final product. Diffusion transformers can also handle sequences of arbitrary length and resolution, provided they fit into memory. With this advancement, the generation of longer videos is feasible without sacrificing consistency or quality, addressing challenges that previous convolution-based methods could not overcome.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The arrival of diffusion transformers reshaped text-to-video generation. It enabled the production of high-quality, self-consistent videos across arbitrary lengths and resolutions. Self-attention within transformers is a key component in addressing challenges like maintaining frame consistency and handling complex spatial and temporal relationships. OpenAI\u2019s release of SORA proved this capability, setting a new standard in the industry: now, approximately 90% of advanced text-to-video systems are based on diffusion transformers, with major players like Luma, Clink, and Runway Gen-3 leading the market.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Despite these breathtaking advances, diffusion transformers are still very resource-intensive, requiring nearly 10,000 times more resources than a single-image generation, making training high-quality models still a very costly undertaking. Nevertheless, the open-source community has taken significant steps to make this technology more accessible. Projects like Open-SORA and Open-SORA-Plan, as well as other initiatives such as Mira Video Generation, Cog, and Cog-2, have opened new possibilities for developers and researchers to experiment and innovate. Backed by companies and academic institutions, these open-source projects give hope for ongoing progress and greater accessibility in video generation, benefiting not only large corporations but also independent creators and enthusiasts keen to experiment. This, as with any other community-driven effort, opens up a future where video generation is democratized, bringing this powerful technology to many more creatives to explore.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text-to-video generation is not what it was even just a few years ago. We transformed it into a tool with truly futuristic functionality. Users create content for personal pages, influencers leverage it for self-promotion, and companies utilize it for everything from advertising and educational materials to virtual training and movie production. The majority of text-to-video [&hellip;]<\/p>\n","protected":false},"author":10,"featured_media":62590,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":{"subtitle":"SORA & Beyond: Unlocking Consistent Text-to-Video","format":"standard","override":[{"template":"5","layout":"right-sidebar","sidebar":"default-sidebar","second_sidebar":"default-sidebar","share_position":"float","share_float_style":"share-normal","show_share_counter":"1","show_view_counter":"1","show_featured":"1","show_post_meta":"1","show_post_author":"1","show_post_author_image":"1","show_post_date":"1","post_date_format":"default","post_date_format_custom":"Y\/m\/d","show_post_category":"1","show_post_reading_time":"0","post_reading_time_wpm":"300","post_calculate_word_method":"str_word_count","zoom_button_out_step":"2","zoom_button_in_step":"3","show_post_tag":"1","number_popup_post":"1","show_author_box":"0","show_post_related":"1","show_inline_post_related":"0"}],"image_override":[{"single_post_thumbnail_size":"no-crop","single_post_gallery_size":"crop-715"}],"trending_post_position":"meta","trending_post_label":"Trending","sponsored_post_label":"Sponsored by","disable_ad":"0"},"jnews_primary_category":[],"jnews_social_meta":[],"jnews_override_counter":{"view_counter_number":"0","share_counter_number":"0","like_counter_number":"0","dislike_counter_number":"0"},"footnotes":""},"categories":[408,3229],"tags":[963,10519,16537,17582],"coauthors":[17471],"class_list":["post-62584","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-contributors","category-artificial-intelligence","tag-ai","tag-openai","tag-sora","tag-text-to-video"],"_links":{"self":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/users\/10"}],"replies":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/comments?post=62584"}],"version-history":[{"count":"1","href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584\/revisions"}],"predecessor-version":[{"id":62589,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/posts\/62584\/revisions\/62589"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/media\/62590"}],"wp:attachment":[{"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/media?parent=62584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/categories?post=62584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/tags?post=62584"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/dataconomy.ru\/wp-json\/wp\/v2\/coauthors?post=62584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}