SimPPL

Since the start of my academic career, I have had a sincere enthusiasm for machine learning and research. During the early years of my B.Tech in Computer Science and Engineering (AI & ML) program at Karnavati University, I progressively became quite interested in research, particularly in the field of machine learning. My time with SimPPL turned out to be a crucial stage since it gave me the opportunity to work on projects that challenged my technical and intellectual development and were both product-oriented and research-driven. In addition to honing my abilities, these experiences gave me a clear sense of direction and helped me transform a keen interest into a defined professional route.

Discovery of SimPPL

I first learned about SimPPL in my 3rd year when I was preparing for my GRE. Shortly thereafter, I was awarded a fellowship. Although my participation and engagement during that initial fellowship were limited due to final exams and a concurrent summer internship, I returned to SimPPL in August as a Research Engineering Intern.

Initial Role and Onboarding

Upon joining SimPPL as a Research Engineering Intern, My first assigned project was "Harassment in the Digital Arena: Analyzing Online Abuse Against U.S. Politicians", conducted in collaboration with the Center for Tech and Society. Recognizing that social media platforms function as modern "town squares," where real-time public discourse can quickly devolve into hateful content, my primary responsibility was to curate a robust YouTube dataset centered on U.S. political figures. In doing so, I systematically collected and preprocessed over 1 million comments and youtube video metadata to capture a representative sample of partisan and emotionally charged exchanges.

Once the dataset was in place, I performed EDA and identified politicians, such as Josh Shapiro, Jeff Jackson, Kamala Harris, and Ilhan Omar, were most frequently targeted by polarizing language. By quantifying the prevalence of gendered and racialized slurs, I gained actionable insights into how hate speech aggregates around specific political demographics, particularly women and minorities.

Working alongside a fellow colleague, I then trained and evaluated a hate-speech classifier using the LLaMA 3.1-Instruct model. We concentrated on fine-tuning LLaMA 3.1 to differentiate between harmless remarks and truly destructive information because of the nuances of natural language, where sarcasm, dog whistles, and coded epithets can conceal overt animosity. Our examination revealed persistent inadequacies in capturing the complete subtlety of interpersonal online encounters, despite the model's encouraging precision and recall. This underscores the significance of continuous model improvement.

To complement model training, I contributed to AI-driven visualizations that mapped clusters of words and phrases most strongly associated with hateful discourse. These visual aids revealed how certain political topics—especially those involving policy debates or high-profile events—triggered spikes in aggressive rhetoric. By overlaying network graphs of user interactions, we were able to identify "hot spots" where hate speech propagated most rapidly.

Through this role and onboarding process, I not only acquired hands-on experience in large-scale data curation and model development but also gained a deeper appreciation for AI's dual potential: to expose and mitigate hate speech, yet also to shine a light on the ethical imperative of responsible AI deployment in digital communities.

Transition to Research into Top Beauty Influencers

After my first contribution to SimPPL's research on hatred and harassment, I transitioned into a new project titled Research into Top Beauty Influencers and the Potentially Misleading Claims Promoted by Them, developed in collaboration with Jagran, a prominent news media organization. The project sought to investigate how beauty influencers on platforms like Instagram and YouTube propagate misleading product claims, particularly those that could adversely affect the mental health and body image of young audiences.

My responsibilities evolved significantly during this phase. I was tasked with the development of a scalable and automated scraping pipeline, which involved integrating rotating proxies in a Dockerized environment to scale the collection of public data. This infrastructure supported our broader effort to gather metadata, posts, and engagement metrics from influencer profiles.

To facilitate targeted scraping, I contributed to the curation of a master list of 150 influential beauty creators, filtering them based on reach, engagement rates, and relevance to the beauty niche in India. I also helped in testing various data collection libraries, including Instaloader, Instagrapi API, and approaches popularized by platforms like starngage.com, which allowed us to extract deeper insights such as sentiment, follower trends, and promotional patterns.

Understanding that data integrity and consistency were critical, I worked on building and testing pipelines that ensured the robust collection of structured data from platforms like Instagram and YouTube, including posts, comments, and metadata. We tested various resilience strategies to ensure we could collect permissible data from these platforms.

Another key deliverable was the integration of Amazon S3 with EC2 using Mountpoint for S3, enabling seamless storage and retrieval of large datasets for downstream AI analysis. The data collected was later used to power NLP-driven models for identifying potentially misleading claims, clustering recurring narratives, and detecting emotional or manipulative language patterns.

Throughout the project, I participated in designing AI-driven content analysis mechanisms from hashtag clustering and engagement analysis to temporal trend tracking. This enabled us to generate preliminary insights into how influencers shape product narratives and which ones are most responsible for spreading questionable beauty standards or unverified claims.

Our collective work culminated in the creation of a robust analytical foundation for a public report on beauty misinformation, set to inform not just newsrooms but also policymakers, educators, and mental health advocates.

Nest Presentation Experience

As part of my journey at SimPPL, I had the unique opportunity to present at a two-day workshop hosted by the Nest Center for Journalism Innovation and Development in Mongolia, in collaboration with DW Akademie. This workshop was designed to empower journalists and civil society organizations in their fight against disinformation through responsible and scalable use of AI.

My session focused on practical demonstrations and conceptual frameworks surrounding Large Language Models (LLMs), emphasizing their potential in temporal analysis, sentiment detection, and machine translation. The aim was to show how these tools can be effectively deployed in fact-checking workflows and disinformation research, especially in resource-constrained settings.

In addition to technical content, the workshop emphasized data pipeline scalability, challenges in handling web-scale information, and the strategic application of AI in monitoring online networks. It was particularly rewarding to illustrate real-world use cases, including how uncertainty in insights from LLMs can be measured and contextualized.

This experience not only reinforced my technical capabilities but also highlighted the importance of responsible AI deployment in journalism. I'm deeply grateful to the SimPPL team for this opportunity and for their continued support. Special thanks to Swapneel Mehta and Dhara Mungra for their mentorship throughout this engagement.

Participating in this global initiative reaffirmed my belief in the transformative role of AI in journalism and information integrity, and I'm proud to have contributed to work that is already making an international impact.

Transition from Jagran to Arbiter: Laying the Foundations

I was questioned about my long-term objectives in my one-on-one meetings with my mentors, Swapneel and Dhara, which were usually conducted during my assignment with one of our national newsroom clients. Even though I had a lot of experience working on research-focused projects at the time, I thought that my profile needed to be more varied. As a final-year student, I was especially conscious of the lack of product-level project experience, which I knew would be crucial for placements. In order to improve my practical abilities and better meet industry standards, this stoked my interest in investigating product-oriented work.

My focus transitioned to establishing the technical backbone for Arbiter, a platform which required building modular, scalable, and research-oriented social listening infrastructure. This phase marked a shift from isolated task execution to a cohesive project-based development model. I actively contributed to architecting key components of the backend, cloud infrastructure, data pipelines, and user interface.

On the infrastructure side, I designed and deployed a cost-effective GPU-based ML inference architecture using AWS Inf1 EC2 instances, reducing model response time from 30 minutes to under 5 minutes. This not only improved performance but also established a scalable methodology to share GPU resources across concurrent projects with minimal cost overhead.

For Data Engineering, I oversaw critical system setups such as PostgreSQL data transfer pipelines. These configurations enabled robust storage and retrieval of high-volume multimodal social data.

A core deliverable involved Telegram module development—from schema standardization to platform integration. I implemented functionality to generate sentence embeddings during data collection, introduced structured search query categorization, and ensured seamless data pushes to PostgreSQL for downstream processing. This work was reinforced through modular unit testing, encapsulated within a clean, test-driven folder structure. I also played a key role in building and validating YouTube and Telegram module test pipelines, as well as implementing Search and Auth module Test Coverage.

On the frontend, I led the restructuring of the landing page, bringing motion enhancements to the main dashboard, refining UI/UX elements, and introducing visual feedback for authentication flows. These changes were systematically deployed using CI/CD workflows via GitHub Actions, promoting a fully automated, repeatable deployment pipeline.

In addition, I helped integrate an AI-powered chatbot and worked on improving the quality and contextual accuracy of summary generation outputs, enhancing user interaction and automated content comprehension features within the platform.

This phase was not just about code contributions—it was about laying down architecture that supports social listening, misinformation detection, and cross-platform data analysis at scale. The learnings and systems established during this foundational work continue to serve as a launchpad for future development cycles within the Arbiter ecosystem.

Looking Ahead

My time at SimPPL marked a significant turning point in my journey. Despite the inevitable ups and downs, it became a place where I could truly experiment and learn. It provided the environment and freedom to grow both technically and personally, shaping the way I approach challenges today. Working on real-world AI challenges, contributing to large-scale model development, and learning from my mentors taught me to approach problems with both technical precision and practical insight. I approach the future with the clarity, self-assurance, and teamwork that SimPPL helped me build, whether I'm working on scalable AI solutions or contributing to research.

Arpit Sharma

Discovery of SimPPL

Initial Role and Onboarding

Transition to Research into Top Beauty Influencers

Nest Presentation Experience

Transition from Jagran to Arbiter: Laying the Foundations

Looking Ahead