Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages
Solving the problem of abusive speech detection in 8 (10 types) languages from 14 publicly available sources.
New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.
Please cite our paper in any published work that uses any of these resources.
@article{das2022data,
title={Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages},
author={Das, Mithun and Banerjee, Somnath and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2204.12543},
year={2022}
}
./Dataset --> Contains the dataset related details.
./Codes --> Contains the codes
Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt
.
Check out the Dataset
folder to know more about how we curated the dataset for different languages.
-
m-BERT is pre-trained on 104 languages with the largest Wikipedia utilizing a masked language modeling (MLM) objective. It is a stack of transformer encoder layers with 12 ``attention heads," i.e., fully connected neural networks augmented with a self-attention mechanism. m-BERT is restricted in the number of tokens it can handle (512 at max). To fine-tune m-BERT, we also add a fully connected layer with the output corresponding to the CLS token in the input. This CLS token output usually holds the representation of the sentence passed to the model. The m-BERT model has been well studied in abusive speech, has already surpassed existing baselines, and stands as a state-of-the-art.
-
MuRIL stands for Multilingual Representations for Indian Languages and aims to improve interoperability from one language to another. This model uses a BERT base architecture pretrained from scratch utilizing the Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages and their transliterated counterparts.
- Bengali
- Hindi
- Hindi-CodeMixed
- Kannada-CodeMixed
- Malayalam-CodeMixed
- Marathi
- Tamil-CodeMixed
- Urdu
- Urdu-CodeMixed
- English
- AllInOne
Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. "Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages". ACM HT'22