YouTube Data Collection and Coordinated Network Analysis
440K videos, 40M comments, 1.6K channels harvested via the YouTube Data API on Vertex AI and stored in BigQuery, used to study coordinated cross-platform link-sharing.
What we built and how.
The team collected YouTube video engagement and interaction data and analysed it to uncover coordinated inauthentic behaviour, measuring the influence of cross-platform link-sharing. Demonstrating the importance of studying these inauthentic interactions to teams at YouTube resulted in a 5× increase to the API rate limit granted to the project.
The pipeline begins with data collection from tweets containing YouTube URLs, followed by data cleaning to remove invalid URLs and extract video IDs, and data expansion using the YouTube Data API to retrieve comment data. The resulting dataset includes video IDs, comments, author details, like counts, and published dates. Asynchronous API requests scale from 10K to 50K per day, managed by Vertex AI, with data transformation from CSV to Parquet and storage in BigQuery.
The collected dataset encompasses 440K videos, 40M comments, 1.6K channels, and 28K playlists, spanning 82 countries with 152.9 billion total views. The data supports filtering by video count, subscriber count, and country, with visualisations generated in Looker Studio.
The project employs network plots to represent user-video interactions: users and videos are nodes, comments form the connecting edges, and the size of each node visually communicates the recency of the comment, with larger nodes for earlier comments and smaller nodes for delayed ones, offering insights into engagement patterns over time.





