Skip to content

Commit 6e60583

Browse files
committed
add documentation archive for previous version docs
1 parent 044979a commit 6e60583

File tree

5 files changed

+1124
-1
lines changed

5 files changed

+1124
-1
lines changed

README.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
99
## Documentation
1010
Full documentation are available [here](./docs/README.md)
1111

12+
If you are using previous version of **bnlp** check the document [archive]()
13+
1214
## Features
1315
- Tokenization
1416
- [Basic Tokenizer](./docs/README.md#basic-tokenizer)
@@ -60,4 +62,14 @@ raw_text = "আমি বাংলায় গান গাই।"
6062
tokens = tokenizer(raw_text)
6163
print(tokens)
6264
# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]
63-
```
65+
```
66+
67+
## Contributor Guide
68+
69+
Check [CONTRIBUTING.md](https://github.com/sagorbrur/bnlp/blob/master/CONTRIBUTING.md) page for details.
70+
71+
72+
## Thanks To
73+
74+
* [Semantics Lab](https://www.facebook.com/lab.semantics/)
75+
* All the developers who are contributing to enrich Bengali NLP.

docs/archive/doc_v1.0.0.md

+148
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Bengali Natural Language Processing(BNLP)
2+
3+
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to **tokenize Bengali text**, **Embedding Bengali words**, **construct neural model** for Bengali NLP purposes.
4+
5+
## Installation
6+
7+
8+
9+
* pypi package installer(python 3.6, 3.7 tested okay)
10+
11+
```pip install bnlp_toolkit```
12+
13+
14+
## Pretrained Model
15+
16+
Trained on `wikipedia dump` dataset
17+
18+
* [Bengali SentencePiece](https://github.com/sagorbrur/bnlp/tree/master/model)
19+
* [Bengali Word2Vec](https://drive.google.com/open?id=13fBXPwqpP8-e_aWVognoViTeg5DxSUKR)
20+
* [Bengali FastText](https://drive.google.com/open?id=1KRA91w6dMpuQpowOwLCRplRgSdRzyOYz)
21+
22+
## Tokenization
23+
24+
* **Bengali SentencePiece Tokenization**
25+
26+
- tokenization using trained model
27+
```py
28+
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
29+
30+
bsp = SP_Tokenizer()
31+
model_path = "./model/bn_spm.model"
32+
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
33+
tokens = bsp.tokenize(model_path, input_text)
34+
print(tokens)
35+
36+
```
37+
- Training SentencePiece
38+
```py
39+
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
40+
41+
bsp = SP_Tokenizer(is_train=True)
42+
data = "test.txt"
43+
model_prefix = "test"
44+
vocab_size = 5
45+
bsp.train_bsp(data, model_prefix, vocab_size)
46+
47+
```
48+
49+
* **NLTK Tokenization**
50+
51+
```py
52+
from bnlp.nltk_tokenizer import NLTK_Tokenizer
53+
54+
text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
55+
bnltk = NLTK_Tokenizer(text)
56+
word_tokens = bnltk.word_tokenize()
57+
sentence_tokens = bnltk.sentence_tokenize()
58+
print(word_tokens)
59+
print(sentence_tokens)
60+
61+
```
62+
63+
64+
## Word Embedding
65+
66+
* **Bengali Word2Vec**
67+
68+
- Generate Vector using pretrain model
69+
70+
```py
71+
from bnlp.bengali_word2vec import Bengali_Word2Vec
72+
73+
bwv = Bengali_Word2Vec()
74+
model_path = "model/wiki.bn.text.model"
75+
word = 'আমার'
76+
vector = bwv.generate_word_vector(model_path, word)
77+
print(vector.shape)
78+
print(vector)
79+
80+
```
81+
82+
- Find Most Similar Word Using Pretrained Model
83+
84+
```py
85+
from bnlp.bengali_word2vec import Bengali_Word2Vec
86+
87+
bwv = Bengali_Word2Vec()
88+
model_path = "model/wiki.bn.text.model"
89+
word = 'আমার'
90+
similar = bwv.most_similar(model_path, word)
91+
print(similar)
92+
93+
```
94+
- Train Bengali Word2Vec with your own data
95+
96+
```py
97+
from bnlp.bengali_word2vec import Bengali_Word2Vec
98+
99+
data_file = "test.txt"
100+
model_name = "test_model.model"
101+
vector_name = "test_vector.vector"
102+
bwv.train_word2vec(data_file, model_name, vector_name)
103+
104+
105+
```
106+
107+
* **Bengali FastText**
108+
109+
110+
- Download Bengali FastText Pretrained Model From [Here](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.bn.300.bin.gz)
111+
112+
- Generate Vector Using Pretrained Model
113+
114+
115+
```py
116+
from bnlp.bengali_fasttext import Bengali_Fasttext
117+
118+
bft = Bengali_Fasttext()
119+
word = "গ্রাম"
120+
model_path = "cc.bn.300.bin"
121+
word_vector = bft.generate_word_vector(model_path, word)
122+
print(word_vector.shape)
123+
print(word_vector)
124+
125+
126+
```
127+
- Train Bengali FastText Model
128+
129+
```py
130+
from bnlp.bengali_fasttext import Bengali_Fasttext
131+
132+
bft = Bengali_Fasttext(is_train=True)
133+
data = "data.txt"
134+
model_name = "saved_model.bin"
135+
bft.train_fasttext(data, model_name)
136+
137+
```
138+
139+
## Issue
140+
* if `ModuleNotFoundError: No module named 'fasttext'` problem arise please do the next line
141+
142+
```pip install fasttext```
143+
* if `nltk` issue arise please do the following line before importing `bnlp`
144+
145+
```py
146+
import nltk
147+
nltk.download("punkt")
148+
```

docs/archive/doc_v1.2.0.md

+222
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Bengali Natural Language Processing(BNLP)
2+
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to **tokenize Bengali text**, **Embedding Bengali words**, **construct neural model** for Bengali NLP purposes.
3+
4+
5+
# Contents
6+
- [Current Features](#current-features)
7+
- [Installation](#installation)
8+
- [Pretrained Model](#pretrained-model)
9+
- [Tokenization](#tokenization)
10+
- [Embedding](#word-embedding)
11+
- [Issue](#issue)
12+
- [Contributor Guide](#contributor-guide)
13+
- [Contributor List](#contributor-list)
14+
15+
16+
## Current Features
17+
* [Bengali Tokenization](#tokenization)
18+
- SentencePiece Tokenizer
19+
- Basic Tokenizer
20+
- NLTK Tokenizer
21+
* [Bengali Word Embedding](#word-embedding)
22+
- Bengali Word2Vec
23+
- Bengali Fasttext
24+
- Bengali GloVe
25+
26+
27+
## Installation
28+
29+
* pypi package installer(python 3.5, 3.6, 3.7 tested okay)
30+
31+
```pip install bnlp_toolkit```
32+
33+
* Local
34+
```
35+
$git clone https://github.com/sagorbrur/bnlp.git
36+
$cd bnlp
37+
$python setup.py install
38+
```
39+
40+
41+
42+
## Pretrained Model
43+
44+
### Download Link
45+
46+
* [Bengali SentencePiece](https://github.com/sagorbrur/bnlp/tree/master/model)
47+
* [Bengali Word2Vec](https://drive.google.com/open?id=1DxR8Vw61zRxuUm17jzFnOX97j7QtNW7U)
48+
* [Bengali FastText](https://drive.google.com/open?id=1CFA-SluRyz3s5gmGScsFUcs7AjLfscm2)
49+
* [Bengali GloVe Wordvectors](https://github.com/sagorbrur/GloVe-Bengali)
50+
51+
### Training Details
52+
* All three model trained with **Bengali Wikipedia Dump Dataset**
53+
- [Bengali Wiki Dump](https://dumps.wikimedia.org/bnwiki/latest/)
54+
* SentencePiece Training Vocab Size=50000
55+
* Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
56+
* Word2Vec word embedding dimension = 300
57+
* To Know Bengali GloVe Wordvector and training process follow [this](https://github.com/sagorbrur/GloVe-Bengali) repository
58+
59+
60+
## Tokenization
61+
62+
* **Bengali SentencePiece Tokenization**
63+
64+
- tokenization using trained model
65+
```py
66+
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
67+
68+
bsp = SP_Tokenizer()
69+
model_path = "./model/bn_spm.model"
70+
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
71+
tokens = bsp.tokenize(model_path, input_text)
72+
print(tokens)
73+
74+
```
75+
- Training SentencePiece
76+
```py
77+
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
78+
79+
bsp = SP_Tokenizer(is_train=True)
80+
data = "test.txt"
81+
model_prefix = "test"
82+
vocab_size = 5
83+
bsp.train_bsp(data, model_prefix, vocab_size)
84+
85+
```
86+
87+
* **Basic Tokenizer**
88+
89+
90+
91+
```py
92+
from bnlp.basic_tokenizer import BasicTokenizer
93+
basic_t = BasicTokenizer(False)
94+
raw_text = "আমি বাংলায় গান গাই।"
95+
tokens = basic_t.tokenize(raw_text)
96+
print(tokens)
97+
98+
# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]
99+
100+
```
101+
102+
* **NLTK Tokenization**
103+
104+
```py
105+
from bnlp.nltk_tokenizer import NLTK_Tokenizer
106+
107+
text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
108+
bnltk = NLTK_Tokenizer(text)
109+
word_tokens = bnltk.word_tokenize()
110+
sentence_tokens = bnltk.sentence_tokenize()
111+
print(word_tokens)
112+
print(sentence_tokens)
113+
114+
# output
115+
# word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"]
116+
# sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]
117+
118+
```
119+
120+
121+
## Word Embedding
122+
123+
* **Bengali Word2Vec**
124+
125+
- Generate Vector using pretrain model
126+
127+
```py
128+
from bnlp.bengali_word2vec import Bengali_Word2Vec
129+
130+
bwv = Bengali_Word2Vec()
131+
model_path = "model/bengali_word2vec.model"
132+
word = 'আমার'
133+
vector = bwv.generate_word_vector(model_path, word)
134+
print(vector.shape)
135+
print(vector)
136+
137+
```
138+
139+
- Find Most Similar Word Using Pretrained Model
140+
141+
```py
142+
from bnlp.bengali_word2vec import Bengali_Word2Vec
143+
144+
bwv = Bengali_Word2Vec()
145+
model_path = "model/bengali_word2vec.model"
146+
word = 'আমার'
147+
similar = bwv.most_similar(model_path, word)
148+
print(similar)
149+
150+
```
151+
- Train Bengali Word2Vec with your own data
152+
153+
```py
154+
from bnlp.bengali_word2vec import Bengali_Word2Vec
155+
bwv = Bengali_Word2Vec(is_train=True)
156+
data_file = "test.txt"
157+
model_name = "test_model.model"
158+
vector_name = "test_vector.vector"
159+
bwv.train_word2vec(data_file, model_name, vector_name)
160+
161+
162+
```
163+
164+
* **Bengali FastText**
165+
166+
167+
- Generate Vector Using Pretrained Model
168+
169+
170+
```py
171+
from bnlp.bengali_fasttext import Bengali_Fasttext
172+
173+
bft = Bengali_Fasttext()
174+
word = "গ্রাম"
175+
model_path = "model/bengali_fasttext.bin"
176+
word_vector = bft.generate_word_vector(model_path, word)
177+
print(word_vector.shape)
178+
print(word_vector)
179+
180+
181+
```
182+
- Train Bengali FastText Model
183+
184+
```py
185+
from bnlp.bengali_fasttext import Bengali_Fasttext
186+
187+
bft = Bengali_Fasttext(is_train=True)
188+
data = "data.txt"
189+
model_name = "saved_model.bin"
190+
epoch = 50
191+
bft.train_fasttext(data, model_name, epoch) # epoch not implement in pypi yet
192+
# bft.train_fasttext(data, model_name) in pypi now
193+
194+
```
195+
196+
* **Bengali GloVe Word Vectors**
197+
198+
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors</br>
199+
You can download and use it on your different machine learning purposes.
200+
201+
```py
202+
from bnlp.glove_wordvector import BN_Glove
203+
glove_path = "bn_glove.39M.100d.txt"
204+
word = "গ্রাম"
205+
bng = BN_Glove()
206+
res = bng.closest_word(glove_path, word)
207+
print(res)
208+
vec = bng.word2vec(glove_path, word)
209+
print(vec)
210+
211+
```
212+
213+
## Issue
214+
* if `ModuleNotFoundError: No module named 'fasttext'` problem arise please do the next line
215+
216+
```pip install fasttext```
217+
* if `nltk` issue arise please do the following line before importing `bnlp`
218+
219+
```py
220+
import nltk
221+
nltk.download("punkt")
222+
```

0 commit comments

Comments
 (0)