These datasets were prepared as part of a research project lead by Dr Harmonie Toros, titled “Gendered Narrative Analysis of Violent Extremism in the United Kingdom”, funded by GCHQ’s Research Fellowship in National Resilience scheme. The project uses two case studies – the controversy surrounding the return to the UK of Shamima Begum from Iraq/Syria and the attack on two mosques in Christchurch, New Zealand – to investigate the gendered narratives regarding violent extremism in both mainstream and non-mainstream (extremist) online communities.
For this, we collected and prepared data from two mainstream sources, and one non-mainstream source. For the former, the dataset contains comment sections from articles published by The Independent and DailyMail, two UK newspapers. For the non-mainstream data, we collected data from the threads in the 4plebs archive of the /pol/ 4chan board, that had matches to keywords related to each case study and were posted within the designated timeline of the study. Detailed below is a brief description of the data collection process. Specific information about the dataset files, ie their dimensions, instances and attributes, can be requested.
Each dataset contains the comments posted on manually selected articles related to each of the two case studies, preserving the tree-like structure of the comments (through the ReplyTo and OPNumber attributes). The articles were selected from search results on Google using search phrases such as:
The selected articles had their comments extracted using the requests python library, and the datasets were preprocessed and saved as csv files. Each line of these files refers to a text comment, with metadata contextualizing it: the timestamp, title and id of the article it was posted on, the username provided by the poster (anonymised as a numeric ID), and information on the structure of the conversation (what comment it is replying to, and how many replies did that comment get).
Each dataset contains every post in threads that were considered relevant for the case study, based on keyword search. To get the datasets for each case study from the unfiltered 4plebs data archive (online archive for the politics board of 4chan, /pol/) from the relevant timeline, we performed a simple keyword search over the texts in all comments. If any comment posted on a thread had any of target keywords or key phrases, the entire thread was kept in the dataset. Otherwise, the whole tread was removed, as it was deemed irrelevant for that case study.
The keyword search for Case 1 had the following strings: “Brenton“, “Tarrant“, and “Christchurch Mosque“. For Case 2, those were “Shamima“, “Begum” and any combination of the words “Jihadi, Beheader, Isis” and “wife, wives, bride, brides” (as these terms were used to refer to the case in the forums). The Case 1 dataset was too big to be kept in a single csv file, so it was divided into 4 consecutive parts.
The data files are encrypted zip files. If you would like to use the data, please send a brief email to Dr Harmonie Toros (firstname.lastname@example.org), detailing your name, institution and intended use of the data. After reviewing your request, we will send you the password for decrypting the zip files.
Links to the data files: