Image to show secure data (padlock and binary code)

Datasets, Tools and Other Resources

iCSS researchers regularly publish datasets, tools and other resources for other researchers and users to reproduce our research work, to conduct follow-up research and to use such resources for other purposes. Please follow the instructions of each resource on how to access it and how to cite it in your work.

Datasets++

Open all

Contributors: Enes Altuncu*, Can Başkent, Sanjay Bhattacherjee, Shujun Li and Dwaipayan Roy*
* The main constructors and coders of the resource. Others contributed to the conceptual design and validation.
Release Years: 2025

This repository includes the dataset and source codes reported in a Resource & Reproducibility paper “FACTors: A New Dataset for Studying Fact-checking Ecosystem“ accepted to SIGIR 2025 (48th International ACM SIGIR Conference on Research and Development in Information Retrieval).

FACTors contains 118,112 claims from 117,993 fact-checking reports in English (co-)authored by 1,953 individuals and published during the period 1995-2025 by 39 fact-checking organisations that are active signatories of the IFCN (International Fact-Checking Network) and/or EFCSN (European FactChecking Standards Network). It contains 7,327 overlapping claims investigated by multiple fact-checking organisations, corresponding to 2,977 unique claims.

The resource can be found at the GitHub repository:

https://github.com/altuncu/FACTors

Contributors: Yuqi Niu*, Weidong Qiu, Peng Tang, Lifan Wang, Shuo Chen, Shujun Li, Nadin Kökciyan and Ben Niu
* The main constructor and coder of the resource. Others contributed to the conceptual design and validation.
Release Years: 2024

This resource includes two datasets of photo privacy images and one machine learning based classifier for detecting bystanders in a given photo. It is about the research work described in the following papers:

Yuqi Niu, Weidong Qiu, Peng Tang, Lifan Wang, Shuo Chen, Shujun Li, Nadin Kökciyan and Ben Niu (2025) Everyone’s Privacy Matters! An Analysis of Privacy Leakage from Real-World Facial Images on Twitter and Associated User Behaviors. Proceedings of the ACM on Human-Computer Interaction (PACMHCI), Volume 9, Issue 2, Article Number CSCW069, 38 pages, ACM, to be presented at CSCW 2025 (28th ACM Conference on Computer-Supported Cooperative Work and Social Computing), in Bergen, Norway on October 18-22, 2025.

The resource can be found at the GitHub repository:

https://github.com/Yuqi-Niu/Bystander-Detection

Contributors: Rogério de Lemos, Virginia Franqueira, Tracee Green, Marek Grzes and Robin Ayling
Release Year: 2024

The dataset is the main outcome of the PASYDA project funded by REPHRAIN (National Research Centre on Privacy, Harm Reduction and Adversarial Influence Online). It consists of 10 synthetic datasets encapsulating metadata from social media interactions centred around instances of child grooming. These datasets serve as resources for researchers seeking to evaluate and enhance their detection algorithms.

There are 3 files for each of the Perverted Justice (PJ) scenarios:

data – contains all the exchanges related to the simulated scenario
vic_data – contains the exchanges received or sent by the victim
solutions – contains only the exchanges associated with the scenario

The dataset is released publicly as a GitHub repository:

https://github.com/rdelemos/PASYDA

Lead Researcher: Dr Harmonie Toros (formerly with the School of Politics & International Relations)
Data Preparation: Caio Ribeiro (School of Computing)
Release Year: 2022

These datasets were prepared as part of a research project lead by Dr Harmonie Toros, titled “Gendered Narrative Analysis of Violent Extremism in the United Kingdom”, funded by GCHQ’s Research Fellowship in National Resilience scheme. The project uses two case studies – the controversy surrounding the return to the UK of Shamima Begum from Iraq/Syria and the attack on two mosques in Christchurch, New Zealand – to investigate the gendered narratives regarding violent extremism in both mainstream and non-mainstream (extremist) online communities.

For this, we collected and prepared data from two mainstream sources, and one non-mainstream source. For the former, the dataset contains comment sections from articles published by The Independent and DailyMail, two UK newspapers. For the non-mainstream data, we collected data from the threads in the 4plebs archive of the /pol/ 4chan board, that had matches to keywords related to each case study and were posted within the designated timeline of the study. Detailed below is a brief description of the data collection process. Specific information about the dataset files, i.e. their dimensions, instances and attributes, can be requested.

For more information, visit the Kent Data Repository:

Case 1 datasets (206MB): https://data.kent.ac.uk/id/eprint/407
Case 2 datasets (40MB): https://data.kent.ac.uk/id/eprint/409

Contributors: Mohamad Imad Mahaini, Shujun Li and Rahime Belen Sağlam
Release Year: 2020

The cyber security taxonomy was a result of the following research paper, by applying a human-machine teaming based taxonomy construction process:

Mohamad Imad Mahaini, Shujun Li and Rahime Belen Sağlam (2019) Building Taxonomies based on Human-Machine Teaming: Cyber Security as an Example. Proceedings of the 14th International Conference on Availability, Reliability and Security (ARES 2019), Article Number 30, 9 pages, ACM.

The taxonomy in five different formats (CSV, JSON, SQL, XLSX, and XML) and its five different visualisations (Flat Tree, Hyper Tree, Space Tree, Radial Graph and Sun Burst) can be found at the following dedicated page:

https://cyber.kent.ac.uk/research/cyber_taxonomy/

Tools and Other Resources

Open all

Contributors: Zhanbo Liang*, Jie Guo, Weidong Qiu, Zheng Huang and Shujun Li
* The main constructor and coder of the resource. Others contributed to the conceptual design and validation.
Release Years: 2023-24

This resource is the source code and data of the privacy discourse detection method described in the following research paper:

Zhanbo Liang, Jie Guo, Weidong Qiu, Zheng Huang and Shujun Li (2024) When Graph Convolution Meets Double Attention: Online Privacy Disclosure Detection with Multi-Label Text Classification. Data Mining and Knowledge Discovery, Volume 38, pp. 1171-1192, Springer, accepted via and presented at ECML-PKDD 2024 (36th European Conference on Machine Learning & 28th European Conference on Principles and Practice of Knowledge Discovery in Databases) in September 2024.

The resource can be found at the GitHub repository:

https://github.com/xiztt/WGCMDA

Contributors: Ruiyuan Lin*, Sheng Liu*, Jun Jiang*, Shujun Li*, Chengqing Li and C.-C. Jay Kuo
* The main coders of the resource. Others contributed to the conceptual design and validation.
Release Years: 2022-23

This resource is the source code for the following research paper:

Ruiyuan Lin, Sheng Liu, Jun Jiang, Shujun Li, Chengqing Li and C.-C. Jay Kuo (2024) Recovering sign bits of DCT coefficients in digital images as an optimization problem. Journal of Visual Communication and Image Representation, Volume 98, Article Number 104045, 14 pages, Elsevier Inc.

The source code can be found at the GitHub repository:

https://github.com/ChengqingLi/DCT_SBR

Contributors: Sam Parker*, Haiyue Yuan and Shujun Li
* The main constructor and coder of the resource. Others contributed to the conceptual design and validation.
Release Year: 2023

This is the source code and data on a password visualisation system described in the following research paper:

Sam Parker, Haiyue Yuan and Shujun Li (2023) PassViz: A Visualisation System for Analysing Leaked Passwords. Proceedings of the 2023 20th IEEE Symposium on Visualization for Cyber Security (VizSec 2023), pp. 33-42, IEEE.

The developed software includes two components:

A GUI (graphical user-interface): https://github.com/samcparker/passviz-gui
A command-line toolkit: https://github.com/samcparker/passviz-cli

Contributors: Yang Xu*, Jie Guo, Weidong Qiu, Zheng Huang, Enes Altuncu and Shujun Li
* The main constructor and coder of the resource. Others contributed to the conceptual design and validation.
Release Year: 2022

This is the source code of the rumour detection method proposed in the following research paper:

Yang Xu, Jie Guo, Weidong Qiu, Zheng Huang, Enes Altuncu and Shujun Li (2022) “Comments Matter and The More The Better!”: Improving Rumor Detection with User Comments. Proceedings of the 2022 21st IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2022), pp. 383-390, IEEE.

The source code can be found at the GitHub repository:

https://github.com/Oraccc/Improving-Rumor-Detection-with-User-Comments

Developers: Gabriel Ruaud, Robin Leclerc and Louis Anelli
Supervisors: Shujun Li and Jack Cunliffe
Release Year: 2021-22

Nyja is a modular toolkit designed to help researchers study the Tor network. It was developed as the main outcome of an MSc student project.

It is a cross-platform and user-friendly toolkit that allows researchers to conduct massive metadata gathering, timestamp-based archiving of websites, scheduled monitoring of indexing websites for automatic discovery of .onion links, and much more.

In order to run Nyja, the only required dependency is having Docker installed and running.

The tool can be found at the GitHub repository:

https://github.com/B611/nyja

Contributors: Haiyue Yuan*, Shujun Li and Patrice Rusconi
* The main constructor and coder of the resource. Others contributed to the conceptual design and validation.
Release Year: 2021

CogTool+ is an open source software inspired by and developed based on CogTool (https://github.com/cogtool/cogtool). It is a major research outcome of a Singapore-UK research project “COMMANDO-HUMANS: COMputational Modelling and Automatic Non-intrusive Detection Of HUMan behAviour based iNSecurity”, jointly funded by Engineering and Physical Sciences Research Council (EPSRC) and Singapore’s National Research Foundation (NRF).

The tool is documented in detail in the following research paper:

Haiyue Yuan, Shujun Li and Patrice Rusconi (2021) CogTool+: Modeling Human Performance at Large Scale. ACM Transactions on Computer-Human Interaction, Volume 28, Issue 2, Article Number 7, 38 pages, ACM.

An early description of CogTool+ and some example applications are included in the following monograph:

Haiyue Yuan, Shujun Li and Patrice Rusconi, Cognitive Modeling for Automated Human Performance Evaluation at Scale, part of Human–Computer Interaction Series book series (HCIS) and of SpringerBriefs in Human-Computer Interaction book sub series (BRIEFSHUMAN), ISBN 978-3-030-45704-4, Springer, 2020

The source code and other related data can be found at the GitHub repository:

https://github.com/hyyuan/cogtool_plus

Institute of Cyber Security for Society (iCSS)

Featured story

Datasets, Tools and Other Resources

Datasets++

FACTors: A New Dataset for Studying Fact-checking Ecosystem

Photo privacy datasets and bystander classifier

PASYDA (Synthetic Datasets for Enhancing Online Child Protection from Grooming) dataset

GCHQ gendered narratives datasets

Cyber security taxonomy and visualisations

Tools and Other Resources

Online privacy disclosure detection with multi-label text classification

Recovering sign bits of DCT coefficients in digital images as an optimization problem

PassViz: A visualisation system for analysing leaked passwords

Rumor detection with user comments

Nyja: A dark web sensing toolkit

CogTool+: A cognitive modelling tool for automating large-scale human performance evaluation