sites.research.google - VideoPoet – Google Research

home - domains - sites.research.google

Goto Site

https://sites.research.google/videopoet/?ref=taaft&utm_source=taaft&utm_medium=referral

Site Description

A Large Language Model for Zero-Shot Video Generation. VideoPoet demonstrates simple modeling method that can convert any autoregressive language model into a high quality video generator.

Example Site Content

VideoPoet – Google Research Google Research VideoPoet VideoPoet Text-to-Video Image-to-Video Video Editing Stylization Inpainting Text-to-Video Image-to-Video Video Editing Stylization Inpainting VideoPoet A large language model for zero-shot video generation A dog listening to music with headphones, highly detailed, 8k. A large blob of exploding splashing rainbow paint, with an apple emerging, 8k A robot cat eating spaghetti, digital art. A pumpkin exploding, slow motion. Two pandas playing cards. A vaporwave fashion dog in Miami looks around and barks, digital art. An astronaut riding a galloping horse. A family of raccoons living in a small cabin, tilt shift, arc shot. A golden retriever wearing VR goggles and eating pizza in Paris. A tree walking through the forest, tilt shift. A walking figure made out of water A shark with a laser beam coming out of its mouth. Teddy bears holding hands, walking down rainy 5th ave A chicken lifting weights. An origami fox walking through the forest. Robot emerging from a large column of billowing black smoke, high quality. A t-rex jumping over a cactus, with water gushing after the t-rex falls. A mouse eating cheese in a royal dress, arc shot. An alien enjoys food, 8k. A lion with a mane made out of yellow dandelion petals roars. A massive explosion on the surface of the earth. A horse galloping through Van Gogh's 'starry night'. A squirrel in armor riding a goose, action shot. A panda taking a selfie. An octopus attacks New York. A bear with the head of an owl screeches loudly An astronaut typing on a keyboard, arc shot. A rabbit eating grass, soft lighting. Flag of the US on top of a tall white mountain, rotating panorama. Motorcyclist on a racing track, highly detailed. A massive tidal wave crashes dramatically against a rugged coastline, digital art. Humans building a highway on Mars, cinematic. A skeleton drinking a glass of soda. The orient express driving through a fantasy landscape, animated oil on canvas. VideoPoet can output high-motion variable length videos given a text prompt. Video-to-audio VideoPoet can also output audio to match an input video without using any text as guidance. Unmute the videos to play the audio. A dog eating popcorn at the cinema. A teddy bear with a cap, sunglasses, and leather jacket playing drums. A teddy bear in a leather jacket, baseball cap, and sunglasses playing guitar in front of a waterfall. A pink cat playing piano in the forest. The orient express driving through a fantasy landscape, oil on canvas. A dragon breathing fire, cinematic. Using generative models to tell visual stories To showcase VideoPoet's capabilities, we have produced a short movie composed of many short clips generated by the model. For the script, we asked Bard to write a series of prompts to detail a short story about a traveling raccoon. We then generated video clips for each prompt, and stitched together all resulting clips to produce the final YouTube Short below. Introduction VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components: A pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text. An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence. A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities (e.g., text-to-audio). This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency. VideoPoet demonstrates state-of-the-art video generation, in particular in producing a wide range of large, interesting, and high-fidelity motions. The VideoPoet model supports generating videos in square orientation, or portrait to tailor generations towards short-form content, as well as supporting audio generation from

Websites with similar content