Detecting Propaganda Techniques in Code-Switched Social Media Texts

Document Type



Propaganda is a planned persuasive form of communication whose goal is to influence the opinions and the mindset of a target audience or the public in general towards a specific agenda. Propaganda presents a definite tone while expressing its messages, which is a conscious act manifested by a certain individual, group, or institution in an attempt to propagate their own narratives. With the advent of the internet and the rise in the number of social media platforms, the spread of falsified information and distorted arguments in the form of propaganda has begun to spread on a massive scale. Propaganda can influence individuals by shaping their attitudes and beliefs about certain issues, events, or individuals. It can also influence their behavior, such as how they vote, what products they buy, or even how they view certain groups of people. Most work on propaganda detection has been done on high-resourced languages such as English. However, little effort has been made to detect propaganda on low-resource languages. Most low-resource language communities resort to mixing multiple languages, especially on social media platforms in the form of posts, tweets, or comments. This phenomenon of mixing multiple languages is referred to as code-switching. Code-switching involves switching between two or more languages in a sentence or a phrase to express ideas and thoughts more accurately and convincingly. In general, code-switching brings together high-resource and low-resource languages within the same text. To contribute to a healthier online environment, we propose a novel task of detecting propaganda techniques in code-switched data involving English and Roman Urdu. We create a corpus of 1030 code-switched texts, which we manually annotate at the fragment level with 20 propaganda techniques and make publicly available. For fragment-level annotation on our code-switched texts, we develop a web-based annotation platform with an interface that allows easy labelling of spans of text. Moreover, to do preliminary analysis on our newly created dataset, we run experiments using several state-of-the-art pre-trained multilingual and cross-lingual language models namely BERT, mBERT, and XLM-RoBERTa with different fine-tuning strategies, and discover that XLM-RoBERTa fine-tuned on Roman Urdu outperforms all other models on our task and dataset. Our work brings a new perspective to the understanding of how propaganda can be detected within the context of multiple languages used in social media communication.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Natural Language Processing

Advisors: Dr. Shady Shehata, Dr. Preslav Nakov

with 2 year embargo period

This document is currently not available here.