🐨 Koala-36M

A Large-scale Video Dataset Improving Consistency between
Fine-grained Conditions and Video Content


Qiuheng Wang1,2*, Yukai Shi1,3*, Jiarong Ou1, Rui Chen1, Ke Lin1, Jiahao Wang1, Boyuan Jiang1,
Haotian Yang1, Mingwu Zheng1, Xin Tao1, Fei Yang1†, Pengfei Wan1, Di Zhang1

1 Kuaishou Technology    2 Shenzhen University    3 Tsinghua University
* Equal contribution    Corresponding author

Institution     Institution     Institution

arXiv Paper
download Download
github Code

Demo of Koala-36M

Koala-36M features more accurate temporal splitting, more detailed captions, and improved video filtering based on the proposed Video Training Suitability Score (VTSS).

A scuba diver is exploring a shallow coral reef underwater, observing a large shark swimming nearby. The diver, wearing a black wetsuit and fins, is seen swimming slowly and cautiously through the water, maintaining a safe distance from the shark. The scene is set in a clear, blue underwater environment with sunlight filtering through the water, illuminating the sandy seabed and the surrounding marine life. The shark moves gracefully through the water, its body undulating smoothly as it swims. The diver's movements are deliberate and slow, ensuring they do not disturb the shark. The background remains relatively static, with only minor movements from the smaller fish in the water. The overall movement is slow and steady, with the shark's direction being primarily horizontal across the frame. The main subjects are the scuba diver and the shark. The diver is positioned on the right side of the frame, wearing a black wetsuit and fins, and is seen swimming slowly and cautiously. The shark is on the left side of the frame, swimming in the same direction as the diver but at a greater distance. The shark's body is streamlined, and it moves smoothly through the water. The background consists of a sandy seabed covered with patches of green seaweed and smaller fish. The water is clear, allowing sunlight to penetrate and create a serene underwater atmosphere. The scene is set in a shallow coral reef, with the sandy bottom and marine vegetation providing a natural habitat for the marine life. The camera is stationary, providing a wide-angle view of the scene. The perspective is from the diver's point of view, capturing both the diver and the shark in the frame.

Clarity Score: 0.9014.     Aesthetic score: 5.34.     Motion score: 22.03.     VTSS: 4.48.

A white car is driving along a winding road in a rural area. The car is moving at a moderate speed, navigating the curves of the road. The surrounding landscape is characterized by rolling hills covered in grass and low shrubs. The sky is clear, suggesting a sunny day. The main subject is a white car, which is driving along the road. The car is positioned centrally in the frame, moving from the left to the right side of the screen. The car's movement is smooth and consistent, following the curves of the road. The background consists of a rural landscape with rolling hills covered in grass and low shrubs. The hills are expansive and stretch out to the horizon, creating a sense of depth. The sky is clear, indicating good weather, and the lighting suggests it is daytime. The camera is stationary, capturing a wide-angle view of the road and the surrounding landscape. The perspective is from a slightly elevated angle, providing a clear view of the car's movement and the surrounding environment.

Clarity Score: 0.8023.     Aesthetic score: 4.82.     Motion score: 3.94.     VTSS: 4.31.

A close-up view of a large, spiky-haired kiwi bird resting on a grassy surface. The bird's feathers are predominantly black with white tips, and it appears to be in a relaxed state, occasionally moving its head and body slightly. The scene is calm and serene, focusing on the bird's detailed features and the texture of its feathers. The background consists of a lush, green grassy area, suggesting a natural outdoor setting. The grass is well-maintained and appears to be healthy, providing a soft and natural backdrop for the bird. There are no other objects or animals visible in the background, keeping the focus on the kiwi bird. The main subject is a kiwi bird, characterized by its large size, black feathers with white tips, and a long, pointed beak. The bird is positioned centrally in the frame, lying on its side with its head slightly raised. Its body is mostly stationary, but it occasionally moves its head and beak, indicating slight movements. The camera is stationary, providing a close-up view of the kiwi bird. The focus remains sharp on the bird throughout the video, with no noticeable camera movement.

Clarity Score: 0.5284.     Aesthetic score: 4.58.     Motion score: 50.87.     VTSS: 4.35.

Two men in a workshop or garage setting, where one man is demonstrating how to put on a black jacket with the word "KIMM" printed on the back. The man in the black jacket is standing still while the other man, who is wearing a purple shirt, is actively pointing and gesturing to explain the process. The scene is well-lit, and the background includes shelves with various motorcycle helmets and other equipment. The main subjects are two men. The man in the black jacket is wearing a black jacket with the word "KIMM" printed on the back. He is standing still and observing the demonstration. The man in the purple shirt is actively gesturing and explaining the process of putting on the jacket. He is wearing a purple shirt, jeans, and a silver bracelet. The man in the purple shirt is actively pointing and gesturing towards the man in the black jacket, indicating specific areas of the jacket to be adjusted or fastened. His movements are deliberate and focused, with moderate amplitude and speed. The man in the black jacket remains mostly stationary, occasionally shifting his posture slightly to follow the instructions. The background remains static throughout the video. The camera is stationary, providing a medium shot that captures both men and the background clearly. The view is at eye level, focusing on the interaction between the two men.

Clarity Score: 0.9964.     Aesthetic score: 5.17.     Motion score: 0.6632.     VTSS: 4.81.

A black Ford F-150 pickup truck is driving down a snowy road in an industrial area. The truck moves steadily forward, its wheels kicking up snow as it progresses. The background features a large, dark building with visible industrial structures and machinery, indicating a cold, snowy environment. The truck's headlights and taillights are on, illuminating the snowy road ahead. The main subject is a black Ford F-150 pickup truck. It has a large, dark grille, silver trim, and black wheels. The truck is positioned centrally in the frame, moving forward. The truck's headlights and taillights are on, and it appears to be in motion, driving down a snowy road. The truck moves steadily forward at a moderate speed, kicking up snow from the road as it progresses. The background remains static, with no visible movement of other objects or changes in the environment. The camera is stationary, capturing the truck's movement from a fixed, slightly elevated angle, providing a clear view of the truck and the snowy road ahead.

Clarity Score: 0.9589.     Aesthetic score: 4.93.     Motion score: 23.80.     VTSS: 4.48.

A bustling urban scene with a wide view of a city street lined with tall buildings. The street is filled with numerous vehicles, including cars and trucks, moving in both directions. The buildings are a mix of modern and older architectural styles, with some under construction. The scene is set during the day under clear skies, with the sun casting shadows on the buildings and vehicles. The main subjects are the vehicles on the street, including cars and trucks, which are moving in both directions. The vehicles are positioned along the street, with some closer to the camera and others further away. The buildings are prominent structures, with some under construction, and they frame the street. The vehicles are in constant motion, indicating a busy city environment. The background consists of a cityscape with a mix of modern and older buildings. The buildings are tall and varied in design, with some under construction. The street is lined with trees and greenery, adding a touch of nature to the urban setting. The scene is set during the day, with clear skies and bright sunlight casting shadows on the buildings and vehicles. The camera is stationary, providing a wide, aerial view of the city street and buildings. The perspective is from a high vantage point, capturing the entire scene in a single, continuous shot.

Clarity Score: 0.9835.     Aesthetic score: 5.08.     Motion score: 5.30.     VTSS: 4.35.

Our Collection Pipeline

We propose a refined data processing pipeline to further improve the consistency between fine-grained conditions and video content,
including transition detection methods, structured caption system, Video Training Suitability Score and metric conditions.

First, we propose a more accurate and efficient transition detection method for video splitting. Then we caption splitted videos with an average length of 200 words based on our structured caption system. Subsequently, we train a Video Training Suitability Score (VTSS) for data filtering to prevent high-quality data from the erroneous deletion. Finally, we introduce multiple data sub-metrics as Metric Conditions into the generation model to enrich the fine-grained conditions.

Compare with Other Datasets

Koala-36M dataset simultaneously provides a large number of videos (over 10M) and high-quality fine-grained text captions (longer than 200 words),
significantly improving the quality of large scale video datasets

Dataset #Videos Average Text Length(words) Total Video Length(hours) Text Filtering Resolution
LSMDC 118K 7.0 158 Manual Sub-metrics 1080p
DiDeMo 27K 8.0 87 Manual Sub-metrics -
YouCook2 14K 8.8 176 Manual Sub-metrics -
ActivityNet 100K 13.5 849 Manual Sub-metrics -
MSR-VTT 10K 9.3 40 Manual Sub-metrics 240p
VATEX 41K 15.2 ~115 Manual Sub-metrics -
WebVid-10M 10M 12.0 52K Alt-Text Sub-metrics 360p
HowTo100M 136M 4.0 135K ASR Sub-metrics 240p
HD-VILA-100M 103M 17.6 760.3K ASR Sub-metrics 720p
VidGen-1M 1M 89.3 - Generated Sub-metrics 720p
MiraData 330K 318.0 16K Generated & Struct Sub-metrics 720p
Panda-70M 70M 13.2 167K Generated Sub-metrics 720p
Koala-36M 36M 202.1 172K Generated & Struct Expert Model 720p

Generation Results Trained on Datasets

We train the same generation model from scratch on different datasets in 256x256 resolution for comparison. All models are passed through 140M data samples in total.
The generation model achieve the optimal performance on Koala-36M, with both the best video quality and text-video consistency.

Koala-36M

Panda-70M

"A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background."


Koala-36M

Panda-70M

"A truck turning a corner."


Koala-36M

Panda-70M

"A cat wearing sunglasses at a pool."


Koala-36M

Panda-70M

"A teddy bear washing the dishes."


Koala-36M

Panda-70M

"Sunset time lapse at the beach with moving clouds and colors in the sky."


Koala-36M

Panda-70M

"A person is filling eyebrows."



Citation

@misc{wang2024koala36mlargescalevideodataset,
      title={Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content}, 
      author={Qiuheng Wang and Yukai Shi and Jiarong Ou and Rui Chen and Ke Lin and Jiahao Wang and Boyuan Jiang and Haotian Yang and Mingwu Zheng and Xin Tao and Fei Yang and Pengfei Wan and Di Zhang},
      year={2024},
      eprint={2410.08260},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.08260}, 
    }