How to Create a TikTok Dataset for Machine Learning Research
By The Tiklocker Team on 2026-04-08 18:58:37
Researchers in computer vision, NLP, and social media analysis increasingly need TikTok video datasets. But building one from scratch means dealing with TikTok's aggressive anti-scraping measures, constantly changing APIs, and complex authentication flows. Here's a practical approach using the Tiklocker API.
🔬 Common Research Use Cases
TikTok datasets are valuable for: - Trend analysis: Studying how content goes viral and spreads across communities - Computer vision: Training models on short-form video content, gesture recognition, or activity detection - NLP: Analyzing captions, hashtags, and comment patterns - Recommendation systems: Understanding engagement patterns and content similarity - Misinformation research: Tracking how false claims propagate through short video
📐 Planning Your Dataset
Before downloading anything, define your dataset scope:
| Parameter | Example |
|---|---|
| Topic/niche | Cooking tutorials |
| Sample size | 500-1000 videos |
| Time range | Videos from 2025-2026 |
| Metadata needed | Views, likes, hashtags, captions |
| Video specs | MP4, 1080p, no watermark |
🐍 Building the Dataset With Python
Step 1: Collect URLs
Gather TikTok video URLs from your target niche. Common approaches: - Browse hashtag pages and collect video links manually - Use TikTok's discover page for trending content - Compile URLs from academic sources or existing URL lists
Save them to a text file, one URL per line:
# urls.txt
https://www.tiktok.com/@user1/video/111111
https://www.tiktok.com/@user2/video/222222
https://www.tiktok.com/@user3/video/333333
Step 2: Batch Download via API
import requests
import zipfile
import os
from pathlib import Path
API_URL = "https://tiklocker.com/api/batch-download"
API_KEY = "your_api_key_here"
BATCH_SIZE = 10
OUTPUT_DIR = "dataset"
def download_batch(urls, batch_id):
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
response = requests.post(API_URL, json={"urls": urls}, headers=headers)
if response.status_code == 200:
zip_path = f"{OUTPUT_DIR}/batch_{batch_id}.zip"
with open(zip_path, "wb") as f:
f.write(response.content)
# Extract videos
with zipfile.ZipFile(zip_path, 'r') as z:
z.extractall(f"{OUTPUT_DIR}/videos")
os.remove(zip_path)
return True
return False
def main():
Path(f"{OUTPUT_DIR}/videos").mkdir(parents=True, exist_ok=True)
with open("urls.txt") as f:
urls = [line.strip() for line in f if line.strip()]
print(f"Downloading {len(urls)} videos...")
for i in range(0, len(urls), BATCH_SIZE):
batch = urls[i:i + BATCH_SIZE]
batch_id = (i // BATCH_SIZE) + 1
print(f" Batch {batch_id}/{-(-len(urls) // BATCH_SIZE)}...")
download_batch(batch, batch_id)
video_count = len(list(Path(f"{OUTPUT_DIR}/videos").glob("*.mp4")))
print(f"Dataset complete: {video_count} videos downloaded")
if __name__ == "__main__":
main()
Step 3: Create a Metadata CSV
For research purposes, you'll want structured metadata alongside the videos:
import csv
def create_metadata(urls, output_path="dataset/metadata.csv"):
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["url", "video_id", "username", "filename"])
for url in urls:
# Extract video ID and username from URL
parts = url.rstrip("/").split("/")
video_id = parts[-1] if parts[-1].isdigit() else "unknown"
username = [p for p in parts if p.startswith("@")]
username = username[0] if username else "unknown"
writer.writerow([
url,
video_id,
username,
f"{video_id}.mp4"
])
📊 Dataset Organization
A well-organized research dataset looks like this:
dataset/
├── videos/ (MP4 files)
├── metadata.csv (URL, video ID, username, etc.)
├── README.md (dataset description, collection method)
└── scripts/ (collection and analysis scripts)
⚖️ Ethical Considerations
When building research datasets from social media:
- Respect privacy: Avoid including identifiable personal information in published datasets
- Check your institution's IRB requirements for social media research
- Credit the platform and creators in your research publications
- Consider data retention policies: Only keep data as long as needed for your research
- Review TikTok's Terms of Service and your institution's guidelines on data collection
🤔 FAQ
How many videos can I download with the API? As many as your credits allow. The API supports batch processing, so large datasets are feasible.
Can I get video metadata (views, likes) through the API? The Tiklocker API focuses on video downloads. For engagement metrics, you may need to combine this with other data sources.
What video quality will I get? All downloads are the highest available quality (typically 1080p) without watermarks.
Start building your research dataset at Tiklocker.com/api.