Skip to content

Batch support for TreeLSTM #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 4 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,9 @@

# Tree-Structured Long Short-Term Memory Networks
This is a [PyTorch](http://pytorch.org/) implementation of Tree-LSTM as described in the paper [Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks](http://arxiv.org/abs/1503.00075) by Kai Sheng Tai, Richard Socher, and Christopher Manning. On the semantic similarity task using the SICK dataset, this implementation reaches:
- Pearson's coefficient: `0.8492` and MSE: `0.2842` using hyperparameters `--lr 0.010 --wd 0.0001 --optim adagrad --batchsize 25`
- Pearson's coefficient: `0.8674` and MSE: `0.2536` using hyperparameters `--lr 0.025 --wd 0.0001 --optim adagrad --batchsize 25 --freeze_embed`
- Pearson's coefficient: `0.8676` and MSE: `0.2532` are the numbers reported in the original paper.
- Known differences include the way the gradients are accumulated (normalized by batchsize or not).

### Requirements
- Python (tested on **3.6.5**, should work on **>=2.7**)
- Java >= 8 (for Stanford CoreNLP utilities)
- Other dependencies are in `requirements.txt`
Note: Currently works with PyTorch 0.4.0. Switch to the `pytorch-v0.3.1` branch if you want to use PyTorch 0.3.1.

### Usage
Before delving into how to run the code, here is a quick overview of the contents:
- Use the script `fetch_and_preprocess.sh` to download the [SICK dataset](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools), [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml) and [Stanford POS Tagger](http://nlp.stanford.edu/software/tagger.shtml), and [Glove word vectors](http://nlp.stanford.edu/projects/glove/) (Common Crawl 840) -- **Warning:** this is a 2GB download!), and additionally preprocees the data, i.e. generate dependency parses using [Stanford Neural Network Dependency Parser](http://nlp.stanford.edu/software/nndep.shtml).
- `main.py`does the actual heavy lifting of training the model and testing it on the SICK dataset. For a list of all command-line arguments, have a look at `config.py`.
- The first run caches GLOVE embeddings for words in the SICK vocabulary. In later runs, only the cache is read in during later runs.
- Logs and model checkpoints are saved to the `checkpoints/` directory with the name specified by the command line argument `--expname`.
The [original implementation](https://github.com/dasguptar/treelstm.pytorch) for paper [Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks](http://arxiv.org/abs/1503.00075) doesn't support batch calculation of TreeLSTM.

Next, these are the different ways to run the code here to train a TreeLSTM model.
#### Local Python Environment
If you have a working Python3 environment, simply run the following sequence of steps:
```
- bash fetch_and_preprocess.sh
- pip install -r requirements.txt
- python main.py
```
#### Pure Docker Environment
If you want to use a Docker container, simply follow these steps:
```
- docker build -t treelstm .
- docker run -it treelstm bash
- bash fetch_and_preprocess.sh
- python main.py
```
#### Local Filesystem + Docker Environment
If you want to use a Docker container, but want to persist data and checkpoints in your local filesystem, simply follow these steps:
To run the model with batch TreeLSTM:
```
- bash fetch_and_preprocess.sh
- docker build -t treelstm .
- docker run -it --mount type=bind,source="$(pwd)",target="/root/treelstm.pytorch" treelstm bash
- python main.py
- python main.py --use_batch --batchsize 25
```
**NOTE**: Setting the environment variable OMP_NUM_THREADS=1 usually gives a speedup on the CPU. Use it like `OMP_NUM_THREADS=1 python main.py`. To run on a GPU, set the CUDA_VISIBLE_DEVICES instead. Usually, CUDA does not give much speedup here, since we are operating at a batchsize of `1`.

### Notes
- (**Apr 02, 2018**) Added Dockerfile
- (**Apr 02, 2018**) Now works on **PyTorch 0.3.1** and **Python 3.6**, removed dependency on **Python 2.7**
- (**Nov 28, 2017**) Added **frozen embeddings**, closed gap to paper.
- (**Nov 08, 2017**) Refactored model to get **1.5x - 2x speedup**.
- (**Oct 23, 2017**) Now works with **PyTorch 0.2.0**.
- (**May 04, 2017**) Added support for **sparse tensors**. Using the `--sparse` argument will enable sparse gradient updates for `nn.Embedding`, potentially reducing memory usage.
- There are a couple of caveats, however, viz. weight decay will not work in conjunction with sparsity, and results from the original paper might not be reproduced using sparse embeddings.

### Acknowledgements
Shout-out to [Kai Sheng Tai](https://github.com/kaishengtai/) for the [original LuaTorch implementation](https://github.com/stanfordnlp/treelstm), and to the [Pytorch team](https://github.com/pytorch/pytorch#the-team) for the fun library.

### Contact
[Riddhiman Dasgupta](https://researchweb.iiit.ac.in/~riddhiman.dasgupta/)

*This is my first PyTorch based implementation, and might contain bugs. Please let me know if you find any!*

### License
MIT
which should give you exact results as without batch, but much faster in training and inference.
1 change: 1 addition & 0 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ def parse_args():
cuda_parser = parser.add_mutually_exclusive_group(required=False)
cuda_parser.add_argument('--cuda', dest='cuda', action='store_true')
cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false')
cuda_parser.add_argument('--use_batch', dest='use_batch', action='store_true')
parser.set_defaults(cuda=True)

args = parser.parse_args()
Expand Down
35 changes: 24 additions & 11 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,21 @@
from treelstm import Trainer
# CONFIG PARSER
from config import parse_args
from main_test import get_avg_grad


def set_optimizer(model, lr, wd):
if args.optim == 'adam':
optimizer = optim.Adam(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
elif args.optim == 'adagrad':
optimizer = optim.Adagrad(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
elif args.optim == 'sgd':
optimizer = optim.SGD(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
return optimizer

# MAIN BLOCK
def main():
global args
Expand Down Expand Up @@ -111,7 +124,7 @@ def main():
args.num_classes,
args.sparse,
args.freeze_embed)
criterion = nn.KLDivLoss()
criterion = nn.KLDivLoss(reduction='none')

# for words common to dataset vocab and GLOVE, use GLOVE vectors
# for other words in dataset vocab, use random normal vectors
Expand All @@ -137,21 +150,14 @@ def main():
model.emb.weight.data.copy_(emb)

model.to(device), criterion.to(device)
if args.optim == 'adam':
optimizer = optim.Adam(filter(lambda p: p.requires_grad,
model.parameters()), lr=args.lr, weight_decay=args.wd)
elif args.optim == 'adagrad':
optimizer = optim.Adagrad(filter(lambda p: p.requires_grad,
model.parameters()), lr=args.lr, weight_decay=args.wd)
elif args.optim == 'sgd':
optimizer = optim.SGD(filter(lambda p: p.requires_grad,
model.parameters()), lr=args.lr, weight_decay=args.wd)
optimizer = set_optimizer(model, args.lr, args.wd)
metrics = Metrics(args.num_classes)

# create trainer object for training and testing
trainer = Trainer(args, model, criterion, optimizer, device)

best = -float('inf')
best, last_dev_loss = -float('inf'), float('inf')
curr_lr = args.lr
for epoch in range(args.epochs):
train_loss = trainer.train(train_dataset)
train_loss, train_pred = trainer.test(train_dataset)
Expand All @@ -171,6 +177,13 @@ def main():
logger.info('==> Epoch {}, Test \tLoss: {}\tPearson: {}\tMSE: {}'.format(
epoch, test_loss, test_pearson, test_mse))

if dev_loss > last_dev_loss:
curr_lr = curr_lr / 5
trainer.optimizer = set_optimizer(model, curr_lr, args.wd)
print('reset lr to {}'.format(curr_lr))

last_dev_loss = dev_loss

if best < test_pearson:
best = test_pearson
checkpoint = {
Expand Down
228 changes: 228 additions & 0 deletions main_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
from __future__ import division
from __future__ import print_function

import os
import random
import logging

import torch
import torch.nn as nn
import torch.optim as optim

# IMPORT CONSTANTS
from treelstm import Constants
# NEURAL NETWORK MODULES/LAYERS
from treelstm import SimilarityTreeLSTM
# DATA HANDLING CLASSES
from treelstm import Vocab
# DATASET CLASS FOR SICK DATASET
from treelstm import SICKDataset
# METRICS CLASS FOR EVALUATION
from treelstm import Metrics
# UTILITY FUNCTIONS
from treelstm import utils
# TRAIN AND TEST HELPER FUNCTIONS
from treelstm import Trainer
# CONFIG PARSER
from config import parse_args


def set_optimizer(model, lr, wd):
if args.optim == 'adam':
optimizer = optim.Adam(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
elif args.optim == 'adagrad':
optimizer = optim.Adagrad(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
elif args.optim == 'sgd':
optimizer = optim.SGD(filter(lambda p: p.requires_grad,
model.parameters()), lr=lr, weight_decay=wd)
return optimizer


def get_avg_grad(named_parameters):
layers, avg_data, avg_grads = [], [], []
for name, param in named_parameters:
if (param.requires_grad) and ("bias" not in name):
layers.append(name)
avg_data.append(param.data.abs().mean())
if param.grad is not None:
avg_grads.append(param.grad.abs().mean())
return layers, avg_data, avg_grads

# MAIN BLOCK
def main():
global args
args = parse_args()
# global logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter("[%(asctime)s] %(levelname)s:%(name)s:%(message)s")
# file logger
fh = logging.FileHandler(os.path.join(args.save, args.expname)+'.log', mode='w')
fh.setLevel(logging.INFO)
fh.setFormatter(formatter)
logger.addHandler(fh)
# console logger
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(formatter)
logger.addHandler(ch)
# argument validation
args.cuda = args.cuda and torch.cuda.is_available()
device = torch.device("cuda:0" if args.cuda else "cpu")
if args.sparse and args.wd != 0:
logger.error('Sparsity and weight decay are incompatible, pick one!')
exit()
logger.debug(args)
torch.manual_seed(args.seed)
random.seed(args.seed)
if args.cuda:
torch.cuda.manual_seed(args.seed)
torch.backends.cudnn.benchmark = True
if not os.path.exists(args.save):
os.makedirs(args.save)

train_dir = os.path.join(args.data, 'train/')
dev_dir = os.path.join(args.data, 'dev/')
test_dir = os.path.join(args.data, 'test/')

# write unique words from all token files
sick_vocab_file = os.path.join(args.data, 'sick.vocab')
if not os.path.isfile(sick_vocab_file):
token_files_b = [os.path.join(split, 'b.toks') for split in [train_dir, dev_dir, test_dir]]
token_files_a = [os.path.join(split, 'a.toks') for split in [train_dir, dev_dir, test_dir]]
token_files = token_files_a + token_files_b
sick_vocab_file = os.path.join(args.data, 'sick.vocab')
utils.build_vocab(token_files, sick_vocab_file)

# get vocab object from vocab file previously written
vocab = Vocab(filename=sick_vocab_file,
data=[Constants.PAD_WORD, Constants.UNK_WORD,
Constants.BOS_WORD, Constants.EOS_WORD])
logger.debug('==> SICK vocabulary size : %d ' % vocab.size())

# load SICK dataset splits
train_file = os.path.join(args.data, 'sick_train.pth')
if os.path.isfile(train_file):
train_dataset = torch.load(train_file)
else:
train_dataset = SICKDataset(train_dir, vocab, args.num_classes)
torch.save(train_dataset, train_file)
logger.debug('==> Size of train data : %d ' % len(train_dataset))
dev_file = os.path.join(args.data, 'sick_dev.pth')
if os.path.isfile(dev_file):
dev_dataset = torch.load(dev_file)
else:
dev_dataset = SICKDataset(dev_dir, vocab, args.num_classes)
torch.save(dev_dataset, dev_file)
logger.debug('==> Size of dev data : %d ' % len(dev_dataset))
test_file = os.path.join(args.data, 'sick_test.pth')
if os.path.isfile(test_file):
test_dataset = torch.load(test_file)
else:
test_dataset = SICKDataset(test_dir, vocab, args.num_classes)
torch.save(test_dataset, test_file)
logger.debug('==> Size of test data : %d ' % len(test_dataset))

# initialize model, criterion/loss_function, optimizer
model = SimilarityTreeLSTM(
vocab.size(),
args.input_dim,
args.mem_dim,
args.hidden_dim,
args.num_classes,
args.sparse,
args.freeze_embed)
criterion = nn.KLDivLoss(reduce=False)

# for words common to dataset vocab and GLOVE, use GLOVE vectors
# for other words in dataset vocab, use random normal vectors
emb_file = os.path.join(args.data, 'sick_embed.pth')
if os.path.isfile(emb_file):
emb = torch.load(emb_file)
else:
# load glove embeddings and vocab
glove_vocab, glove_emb = utils.load_word_vectors(
os.path.join(args.glove, 'glove.840B.300d'))
logger.debug('==> GLOVE vocabulary size: %d ' % glove_vocab.size())
emb = torch.zeros(vocab.size(), glove_emb.size(1), dtype=torch.float, device=device)
emb.normal_(0, 0.05)
# zero out the embeddings for padding and other special words if they are absent in vocab
for idx, item in enumerate([Constants.PAD_WORD, Constants.UNK_WORD,
Constants.BOS_WORD, Constants.EOS_WORD]):
emb[idx].zero_()
for word in vocab.labelToIdx.keys():
if glove_vocab.getIndex(word):
emb[vocab.getIndex(word)] = glove_emb[glove_vocab.getIndex(word)]
torch.save(emb, emb_file)
# plug these into embedding matrix inside model
model.emb.weight.data.copy_(emb)

model.to(device), criterion.to(device)
optimizer = set_optimizer(model, args.lr, args.wd)
metrics = Metrics(args.num_classes)

# create trainer object for training and testing
trainer = Trainer(args, model, criterion, optimizer, device)

init_layers, init_avg_data, init_avg_grad = get_avg_grad(model.named_parameters())
best, last_dev_loss = -float('inf'), float('inf')
dataset = train_dataset

for epoch in range(args.epochs):
model.train()
optimizer.zero_grad()
total_loss = 0.0
outputs_nobatch, losses_nobatch = [], []
lstates_nobatch, rstates_nobatch = [], []
for idx in range(args.batchsize):
ltree, linput, rtree, rinput, label = dataset[idx]
lroot, ltree = ltree[0], ltree[1]
rroot, rtree = rtree[0], rtree[1]
target = utils.map_label_to_target(label, dataset.num_classes)
linput, rinput = linput.to(device), rinput.to(device)
target = target.to(device)
linputs = model.emb(linput)
rinputs = model.emb(rinput)
lstate, lhidden = model.childsumtreelstm(lroot, linputs)
rstate, rhidden = model.childsumtreelstm(rroot, rinputs)
output = model.similarity(lstate, rstate)
#output = model(lroot, linput, rroot, rinput)
outputs_nobatch.append(output)
lstates_nobatch.append(lstate)
rstates_nobatch.append(rstate)
loss = criterion(output, target)
losses_nobatch.append(loss)
total_loss += loss.sum()
loss.sum().backward()
print(total_loss / args.batchsize)
layers1, avg_data1, avg_grad1 = get_avg_grad(model.named_parameters())

model.train()
optimizer.zero_grad()
total_loss = 0.0
ltrees, linputs, rtrees, rinputs, labels = dataset.get_next_batch(0, args.batchsize)
targets = []
for i in range(len(linputs)):
linputs[i] = linputs[i].to(device)
rinputs[i] = rinputs[i].to(device)
target = utils.map_label_to_target(labels[i], dataset.num_classes)
targets.append(target.to(device))
targets = torch.cat(targets, dim=0)
linputs_tensor, rinputs_tensor = [], []
for i in range(len(linputs)):
linputs_tensor.append(model.emb(linputs[i]))
rinputs_tensor.append(model.emb(rinputs[i]))
lstates, lhidden = model.childsumtreelstm(ltrees, linputs_tensor)
rstates, rhidden = model.childsumtreelstm(rtrees, rinputs_tensor)
outputs = model.similarity(lstates, rstates)
losses = criterion(outputs, targets)
total_loss += losses.sum()
losses.sum().backward()
layers2, avg_data2, avg_grad2 = get_avg_grad(model.named_parameters())
import pdb; pdb.set_trace()


if __name__ == "__main__":
main()
Loading