Skip to content

hate-alert/IndicAbusive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Solving the problem of abusive speech detection in 8 (10 types) languages from 14 publicly available sources.

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Please cite our paper in any published work that uses any of these resources.

@article{das2022data,
  title={Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages},
  author={Das, Mithun and Banerjee, Somnath and Mukherjee, Animesh},
  journal={arXiv preprint arXiv:2204.12543},
  year={2022}
}

Folder Description 👈


./Dataset   --> Contains the dataset related details.
./Codes     --> Contains the codes

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.


Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. ⚠️ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. ⚠️


Models used for our task

  1. m-BERT is pre-trained on 104 languages with the largest Wikipedia utilizing a masked language modeling (MLM) objective. It is a stack of transformer encoder layers with 12 ``attention heads," i.e., fully connected neural networks augmented with a self-attention mechanism. m-BERT is restricted in the number of tokens it can handle (512 at max). To fine-tune m-BERT, we also add a fully connected layer with the output corresponding to the CLS token in the input. This CLS token output usually holds the representation of the sentence passed to the model. The m-BERT model has been well studied in abusive speech, has already surpassed existing baselines, and stands as a state-of-the-art.

  2. MuRIL stands for Multilingual Representations for Indian Languages and aims to improve interoperability from one language to another. This model uses a BERT base architecture pretrained from scratch utilizing the Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages and their transliterated counterparts.

Links to the individual model 👼

  1. Bengali
  2. Hindi
  3. Hindi-CodeMixed
  4. Kannada-CodeMixed
  5. Malayalam-CodeMixed
  6. Marathi
  7. Tamil-CodeMixed
  8. Urdu
  9. Urdu-CodeMixed
  10. English
  11. AllInOne

For more details about our paper

Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. "Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages". ACM HT'22

Releases

No releases published

Packages

No packages published

Languages