Skip to content

Commit c91a517

Browse files
committed
markdown 정리
1 parent a579145 commit c91a517

File tree

4 files changed

+250
-7
lines changed

4 files changed

+250
-7
lines changed

README.md

Lines changed: 82 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
# korean-text-sentiment-analysis
2+
23
이 저장소는 [mercy-project](https://github.com/mercy-project)의 한 부분으로써 한국어를 이용한 딥러닝에 사용되는 것들을 다룹니다.
34

45
## Introduction
5-
* 이 프로젝트의 목적은 huggingface의 transformers 저장소를 사용하기 편하도록 wrapping하는 것입니다.
6-
* 또한 Pretrained Language Model(from huggingface)을 사용하여 간단하게 비슷한 의미를 가지는 문장을 찾을 수 있는 metric을 제공합니다.
6+
7+
- 이 프로젝트의 목적은 huggingface의 transformers 저장소를 사용하기 편하도록 wrapping하는 것입니다.
8+
- 또한 Pretrained Language Model(from huggingface)을 사용하여 간단하게 비슷한 의미를 가지는 문장을 찾을 수 있는 metric을 제공합니다.
79

810
## Dependency
911

@@ -21,12 +23,14 @@ git clone https://github.com/mercy-project/korean-text-sentiment-analysis
2123
2224
cd korean-text-sentiment-analysis
2325
24-
pip install .
26+
sudo python3 -m pip install .
27+
28+
pip3 install -r requirements.txt
2529
```
2630

2731
## Quick start
2832

29-
* how to get similar text with latent vector
33+
- how to get similar text with latent vector
3034

3135
```python
3236
from mercy_transformer import models
@@ -67,7 +71,26 @@ distance = metric.cosine(latent_list, [latent])
6771
print(distance)
6872
```
6973

70-
* classfication
74+
To Run
75+
76+
```
77+
python3 smilar.py
78+
```
79+
80+
Result
81+
82+
```
83+
[[4.17170231]
84+
[6.63776399]
85+
[4.8249083 ]
86+
[5.71683576]]
87+
[[0.07849896]
88+
[0.20911893]
89+
[0.11204922]
90+
[0.15967475]]
91+
```
92+
93+
- classfication
7194

7295
```python
7396
from mercy_transformer import models
@@ -136,7 +159,28 @@ for epoch in range(10):
136159
print(epoch, step, loss.item(), acc)
137160
```
138161

139-
* paired question
162+
To Run
163+
164+
```
165+
python3 classfication.py
166+
```
167+
168+
Result
169+
170+
```
171+
0 0 0.737722635269165 0.3333333333333333
172+
1 0 0.6059796810150146 0.6666666666666666
173+
2 0 0.4729032516479492 0.7777777777777778
174+
3 0 0.3866463899612427 0.8888888888888888
175+
4 0 0.24941475689411163 1.0
176+
5 0 0.1359175443649292 1.0
177+
6 0 0.06440091878175735 1.0
178+
7 0 0.027132326737046242 1.0
179+
8 0 0.008938517421483994 1.0
180+
9 0 0.0025468349922448397 1.0
181+
```
182+
183+
- paired question
140184

141185
```python
142186
from mercy_transformer import models
@@ -209,8 +253,39 @@ for epoch in range(20):
209253
print(epoch, step, loss.item(), acc)
210254
```
211255

256+
To Run
257+
258+
```
259+
python3 paired_question.py
260+
```
261+
262+
Result
263+
264+
```
265+
0 0 0.7150420546531677 0.5
266+
1 0 0.8219130039215088 0.5
267+
2 0 0.8126139640808105 0.5
268+
3 0 0.6669130325317383 0.6666666666666666
269+
4 0 0.6324862837791443 0.6666666666666666
270+
5 0 0.5813004970550537 1.0
271+
6 0 0.45990419387817383 1.0
272+
7 0 0.28030848503112793 1.0
273+
8 0 0.1331692934036255 1.0
274+
9 0 0.06911587715148926 1.0
275+
10 0 0.030566172674298286 1.0
276+
11 0 0.015069677494466305 1.0
277+
12 0 0.008203997276723385 1.0
278+
13 0 0.004712474066764116 1.0
279+
14 0 0.0029395369347184896 1.0
280+
15 0 0.0020999612752348185 1.0
281+
16 0 0.001615511137060821 1.0
282+
17 0 0.0012224833481013775 1.0
283+
18 0 0.0008657379657961428 1.0
284+
19 0 0.0006126383086666465 1.0
285+
```
286+
212287
## Todo List
213288

214289
- [ ] GPU Assign
215290
- [x] Classification
216-
- [x] Paired Question
291+
- [x] Paired Question

classfication.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
from mercy_transformer import models
2+
from mercy_transformer import metric
3+
from mercy_transformer import datasets
4+
5+
import torch
6+
import torch.nn as nn
7+
8+
class Classifier(nn.Module):
9+
10+
def __init__(self, bert, num_class):
11+
super(Classifier, self).__init__()
12+
13+
self.bert = bert
14+
self.classifier = nn.Linear(768, num_class)
15+
16+
def forward(self, ids):
17+
latent = self.bert(ids)
18+
latent = latent[:, 0]
19+
logits = self.classifier(latent)
20+
return logits
21+
22+
bert = models.LanguageModel('distilbert')
23+
model = Classifier(
24+
bert=bert,
25+
num_class=2)
26+
27+
classfication_datasets = datasets.ClassificationDataset(
28+
text=['아 더빙.. 진짜 짜증나네요 목소리',
29+
'흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나',
30+
'너무재밓었다그래서보는것을추천한다',
31+
'교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정',
32+
'사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다',
33+
'막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.',
34+
'원작의 긴장감을 제대로 살려내지못했다.',
35+
'별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네',
36+
'액션이 없는데도 재미 있는 몇안되는 영화'],
37+
labels=[0, 1, 0, 0, 1, 0, 0, 0, 1],
38+
bert=bert,
39+
max_len=30)
40+
41+
train_loader = torch.utils.data.DataLoader(
42+
dataset=classfication_datasets,
43+
batch_size=32,
44+
num_workers=1)
45+
46+
criterion = torch.nn.CrossEntropyLoss()
47+
optimizer = torch.optim.Adam(
48+
params=model.parameters(),
49+
lr=1e-4)
50+
51+
for epoch in range(10):
52+
53+
for step, (ids, labels) in enumerate(train_loader):
54+
55+
optimizer.zero_grad()
56+
logits = model(ids)
57+
loss = criterion(logits, labels)
58+
loss.backward()
59+
optimizer.step()
60+
61+
pred = torch.argmax(logits, axis=1)
62+
acc = pred.eq(labels).sum().item() / ids.shape[0]
63+
64+
print(epoch, step, loss.item(), acc)

paired_question.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
from mercy_transformer import models
2+
from mercy_transformer import metric
3+
from mercy_transformer import datasets
4+
5+
import torch
6+
import torch.nn as nn
7+
8+
class PairedQuestion(nn.Module):
9+
10+
def __init__(self, bert):
11+
super(PairedQuestion, self).__init__()
12+
13+
self.bert = bert
14+
self.classifier = nn.Linear(768 * 2, 2)
15+
16+
def forward(self, ids1, ids2):
17+
latent1 = self.bert(ids1)[:, 0]
18+
latent2 = self.bert(ids2)[:, 0]
19+
concat = torch.cat([latent1, latent2], axis=1)
20+
logits = self.classifier(concat)
21+
return logits
22+
23+
24+
bert = models.LanguageModel('distilbert')
25+
model = PairedQuestion(
26+
bert=bert)
27+
28+
paired_dataset = datasets.PairedQuestionDataset(
29+
question1=['골프 배워야 돼',
30+
'많이 늦은시간인데 연락해봐도 괜찮을까?',
31+
'물배달 시켜야겠다.',
32+
'배고파 죽을 것 같아',
33+
'심심해',
34+
'나 그 사람이 좋아'],
35+
question2=['골프치러 가야돼',
36+
'늦은 시간인데 연락해도 괜찮을까?',
37+
'물 주문해야지',
38+
'배 터질 것 같아',
39+
'방학동안 너무 즐거웠어',
40+
'너무 싫어'],
41+
labels=['sim', 'sim', 'sim', 'unsim', 'unsim', 'unsim'],
42+
bert=bert,
43+
max_len=40)
44+
45+
train_loader = torch.utils.data.DataLoader(
46+
dataset=paired_dataset,
47+
batch_size=32,
48+
num_workers=2)
49+
50+
criterion = torch.nn.CrossEntropyLoss()
51+
optimizer = torch.optim.Adam(
52+
params=model.parameters(),
53+
lr=1e-4)
54+
55+
for epoch in range(20):
56+
57+
for step, (ids1, ids2, labels) in enumerate(train_loader):
58+
59+
optimizer.zero_grad()
60+
logits = model(ids1, ids2)
61+
loss = criterion(logits, labels)
62+
loss.backward()
63+
optimizer.step()
64+
65+
pred = torch.argmax(logits, axis=1)
66+
acc = pred.eq(labels).sum().item() / ids1.shape[0]
67+
68+
print(epoch, step, loss.item(), acc)

smilar.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
from mercy_transformer import models
2+
from mercy_transformer import metric
3+
4+
import torch
5+
import numpy as np
6+
7+
model_name = ['bert', 'distilbert']
8+
model_name = model_name[np.random.choice(len(model_name))]
9+
model = models.LanguageModel(model_name)
10+
11+
text = [
12+
'안녕하세요 당신은 누구십니까?',
13+
'전화번호좀 알려주세요',
14+
'담당자가 누구인가요?',
15+
'같이 춤추실래요']
16+
17+
latent_list = []
18+
for t in text:
19+
tokens = model.tokenize(t)
20+
latent = model.encode(tokens)[0][0]
21+
latent = torch.mean(latent, axis=0)
22+
latent_list.append(latent.detach().cpu().numpy())
23+
24+
latent_list = np.stack(latent_list)
25+
26+
reference = '안녕 너는 누구야?'
27+
28+
token = model.tokenize(reference)
29+
latent = model.encode(token)[0][0]
30+
latent = torch.mean(latent, axis=0)
31+
latent = latent.detach().cpu().numpy()
32+
33+
distance = metric.euclidean(latent_list, [latent])
34+
print(distance)
35+
distance = metric.cosine(latent_list, [latent])
36+
print(distance)

0 commit comments

Comments
 (0)