Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Solving the problem of abusive speech detection in 8 (10 types) languages from 14 publicly available sources.

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Please cite our paper in any published work that uses any of these resources.

@article{das2022data,
  title={Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages},
  author={Das, Mithun and Banerjee, Somnath and Mukherjee, Animesh},
  journal={arXiv preprint arXiv:2204.12543},
  year={2022}
}

Folder Description 👈


./Dataset   --> Contains the dataset related details.
./Codes     --> Contains the codes

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.

Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. ⚠️ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. ⚠️

Models used for our task

m-BERT is pre-trained on 104 languages with the largest Wikipedia utilizing a masked language modeling (MLM) objective. It is a stack of transformer encoder layers with 12 ``attention heads," i.e., fully connected neural networks augmented with a self-attention mechanism. m-BERT is restricted in the number of tokens it can handle (512 at max). To fine-tune m-BERT, we also add a fully connected layer with the output corresponding to the CLS token in the input. This CLS token output usually holds the representation of the sentence passed to the model. The m-BERT model has been well studied in abusive speech, has already surpassed existing baselines, and stands as a state-of-the-art.
MuRIL stands for Multilingual Representations for Indian Languages and aims to improve interoperability from one language to another. This model uses a BERT base architecture pretrained from scratch utilizing the Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages and their transliterated counterparts.

Links to the individual model 👼

For more details about our paper

Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. "Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages". ACM HT'22

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Codes		Codes
Dataset		Dataset
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Folder Description 👈

Requirements

Dataset

Models used for our task

Links to the individual model 👼

For more details about our paper

About

Releases

Packages

Languages

License

hate-alert/IndicAbusive

Folders and files

Latest commit

History

Repository files navigation

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Folder Description 👈

Requirements

Dataset

Models used for our task

Links to the individual model 👼

For more details about our paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages