DataFirst

Understanding the Relation between Noise and Bias
in Annotated Datasets

Motivation

annotations emoji image

Problem Statement

In the evolving landscape of machine learning, the reliability and fairness of models hinge on the quality of annotated data. However, the presence of bias, particularly in subjective tasks, has become a critical concern. This is especially prominent in sensitive domains like hate speech recognition, where annotators, stemming from diverse backgrounds and perspectives, might introduce bias in their annotations.

Datasets

Toxicity and Hate Speech
SBIC [1] Kennedy [2] Agree to Disagree [3]
# Annotators 307 7,912 819
# Annotations per annotator 479±829.6 17.1±3.8 63.7±139
# Unique texts 45318 39,565 10,440
# Annotations per text 3.2±1.2 2.3±1.0 5
# Labels 2 3 2

Methodology

Two Regimes of Classification

model 1  model 1

MODEL AND PERFORMANCE

Model

Trained 2 models for each dataset

Majority Label Model Multi Annotator Model
Model - Roberta-Base [6]
Epochs - 5
Learning Rate - 5e-5
Batch Size - 32
Model - DISCO [7]
Epochs - 5
Learning Rate - 2e-3
Batch Size - 200

Performance

Dataset F1 Score (majority) F1 Score (multi-annotator)
Agree to Disagree 0.78 0.78
Kennedy 0.68 0.75
SBIC 0.80 0.78

Uncertainty in Machine Learning Predictions

Results

DATASET CARTOGRAPHY

AGREE TO DISAGREE

agree to disagree

KENNEDY

kennedy

SBIC

sbic

SINGLE GROUND TRUTH MODEL

AGREE TO DISAGREE

agr factor vs conf

KENNEDY

agr factor vs conf

SBIC

agr factor vs conf

MULTI ANNOTATOR MODEL

AGREE TO DISAGREE

low conf mjr disagree vs conf    group count]

SBIC

low conf mjr disagree vs conf    group count

KENNEDY

low conf mjr disagree vs conf    group count

Analysis

High Confidence in Majority model and Low Confidence in Multi-Annotator model

Mislabeled annotations:

Text Annotation
I thought you were going to win this in a landslide Offensive
crossing my fingers for you from Germany. Even though I am not an American. Offensive
boy F**K you Not Offensive
Fragility at its finest Offensive

Findings

Single Ground Truth Model

Multi-Annotator Model

References

[1] Social Bias Frames: Reasoning about Social and Power Implications of Language (Sap et al., ACL 2020)

[2] Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application (Kennedy et al., 2020)

[3] Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement (Leonardelli et al., EMNLP 2021)

[4] SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., SemEval 2018)

[5] Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics (Swayamdipta et al., EMNLP 2020)

[6] RoBERTa: A Robustly Optimized BERT Pretraining Approach (Yinhan Liu et al., 2019)

[7] Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo (Weerasooriya et al., Findings 2023)

[8] Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection (Sap et al., NAACL 2022)

[9] Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations (Mostafazadeh Davani et al., TACL 2022)

About the Team

Abhishek Anand

Abhishek Anand

MS in Computer Science

Anweasha Saha

Anweasha Saha

MS in Computer Science

Prathyusha Naresh Kumar

Prathyusha Naresh Kumar

MS in Computer Science

Negar Mokhberian

Negar Mokhberian

PhD in Computer Science

Ashwin Rao

Ashwin Rao

PhD in Computer Science

Zihao He

Zihao He

PhD in Computer Science