How to Create a TikTok Dataset for Machine Learning Research

By The Tiklocker Team on 2026-04-08 18:58:37

Researchers in computer vision, NLP, and social media analysis increasingly need TikTok video datasets. But building one from scratch means dealing with TikTok's aggressive anti-scraping measures, constantly changing APIs, and complex authentication flows. Here's a practical approach using the Tiklocker API.

🔬 Common Research Use Cases

TikTok datasets are valuable for: - Trend analysis: Studying how content goes viral and spreads across communities - Computer vision: Training models on short-form video content, gesture recognition, or activity detection - NLP: Analyzing captions, hashtags, and comment patterns - Recommendation systems: Understanding engagement patterns and content similarity - Misinformation research: Tracking how false claims propagate through short video

📐 Planning Your Dataset

Before downloading anything, define your dataset scope:

Parameter Example
Topic/niche Cooking tutorials
Sample size 500-1000 videos
Time range Videos from 2025-2026
Metadata needed Views, likes, hashtags, captions
Video specs MP4, 1080p, no watermark

🐍 Building the Dataset With Python

Step 1: Collect URLs

Gather TikTok video URLs from your target niche. Common approaches: - Browse hashtag pages and collect video links manually - Use TikTok's discover page for trending content - Compile URLs from academic sources or existing URL lists

Save them to a text file, one URL per line:

# urls.txt
https://www.tiktok.com/@user1/video/111111
https://www.tiktok.com/@user2/video/222222
https://www.tiktok.com/@user3/video/333333

Step 2: Batch Download via API

import requests
import zipfile
import os
from pathlib import Path

API_URL = "https://tiklocker.com/api/batch-download"
API_KEY = "your_api_key_here"
BATCH_SIZE = 10
OUTPUT_DIR = "dataset"

def download_batch(urls, batch_id):
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    response = requests.post(API_URL, json={"urls": urls}, headers=headers)

    if response.status_code == 200:
        zip_path = f"{OUTPUT_DIR}/batch_{batch_id}.zip"
        with open(zip_path, "wb") as f:
            f.write(response.content)

        # Extract videos
        with zipfile.ZipFile(zip_path, 'r') as z:
            z.extractall(f"{OUTPUT_DIR}/videos")
        os.remove(zip_path)
        return True
    return False

def main():
    Path(f"{OUTPUT_DIR}/videos").mkdir(parents=True, exist_ok=True)

    with open("urls.txt") as f:
        urls = [line.strip() for line in f if line.strip()]

    print(f"Downloading {len(urls)} videos...")

    for i in range(0, len(urls), BATCH_SIZE):
        batch = urls[i:i + BATCH_SIZE]
        batch_id = (i // BATCH_SIZE) + 1
        print(f"  Batch {batch_id}/{-(-len(urls) // BATCH_SIZE)}...")
        download_batch(batch, batch_id)

    video_count = len(list(Path(f"{OUTPUT_DIR}/videos").glob("*.mp4")))
    print(f"Dataset complete: {video_count} videos downloaded")

if __name__ == "__main__":
    main()

Step 3: Create a Metadata CSV

For research purposes, you'll want structured metadata alongside the videos:

import csv

def create_metadata(urls, output_path="dataset/metadata.csv"):
    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["url", "video_id", "username", "filename"])

        for url in urls:
            # Extract video ID and username from URL
            parts = url.rstrip("/").split("/")
            video_id = parts[-1] if parts[-1].isdigit() else "unknown"
            username = [p for p in parts if p.startswith("@")]
            username = username[0] if username else "unknown"

            writer.writerow([
                url,
                video_id,
                username,
                f"{video_id}.mp4"
            ])

📊 Dataset Organization

A well-organized research dataset looks like this:

dataset/
├── videos/           (MP4 files)
├── metadata.csv      (URL, video ID, username, etc.)
├── README.md         (dataset description, collection method)
└── scripts/          (collection and analysis scripts)

⚖️ Ethical Considerations

When building research datasets from social media:

  • Respect privacy: Avoid including identifiable personal information in published datasets
  • Check your institution's IRB requirements for social media research
  • Credit the platform and creators in your research publications
  • Consider data retention policies: Only keep data as long as needed for your research
  • Review TikTok's Terms of Service and your institution's guidelines on data collection

🤔 FAQ

How many videos can I download with the API? As many as your credits allow. The API supports batch processing, so large datasets are feasible.

Can I get video metadata (views, likes) through the API? The Tiklocker API focuses on video downloads. For engagement metrics, you may need to combine this with other data sources.

What video quality will I get? All downloads are the highest available quality (typically 1080p) without watermarks.

Start building your research dataset at Tiklocker.com/api.

Enjoying the content?

Get updates on new features, blog posts, and exclusive offers directly in your inbox.