-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Natural Language Transformers for Classification #7
Comments
Hmm, so I don't know what you mean by "treatment sequence." Usually, I've seem these transformer models trained as big unsupervised predictors of the next character. |
The idea would be modeling it after something like the SQuAD/SWAG dataset for Question Answer, where you have typically a large body of text as initial context (virus sequence), followed by the answer and the positions of the spans for that answer, if found in text (vaccine/cure sequence). Example of a BioBERT dataset formatted for SQuAD: Additional dataset from BioASQ: I also compiled additional sequence data which may or may not overlap with the download script you had. https://drive.google.com/drive/folders/18aAuP3OhGMLKV8jZpt_8vpLY5JSqOS9E?usp=sharing There are 3 sets - Coronaviruses, Influenzaviruses, and SARS related. The jsonl files are the raw data information that was compiled by filtering for complete sequences, and the virus families, and then using the accession code to download the sequences, which are the json files - so they should match the same format as your allseq.json file
|
@trisongz I downloaded the files and put something together. Let me know if it's similar to what you are suggesting? By the way, I am familiar with the transformers library, and I don't think you can use the pre-trained language models (vocabulary) for these types of sequences. Anyways, here's the Colab link of what I put together - let me know if it's related! |
@amoux That's pretty awesome! I hadn't thought of using a node graph, mainly because I don't work with them as often as I'd like to. So I've been messing around with different methods and out of the box, transformers won't necessarily work. You pointed out the first one, which is creating the vocabulary. There wasn't a single number that every sequence was divisible by, so what I did instead was process the sequences to find the lowest prime number for that given sequence, and split the sequence by that prime.
Afterwards, I compiled all the split sequence chunks into a list, and deduplicated the list to have a remaining list of unique sequence chunks.
Still a massive vocab for most models, so I tried using XLNet (the values are a bit messed up here - realized I had 1 as a prime, as seen in the above, which led to much smaller size)
This is where I'm currently at. My first goal is to attempt for Sequence Classification/Entailment. Stuck on how to pre-process the data into the correct format for that task. Also - I realized that the flu dataset is a lot smaller than it should be, so I'll reupload the updated version in the folder soon. |
Glad I stumbled upon this project - was working on a theory using the same base dataset.
Since protein/genes are essentially sequences of letters, it led me to the idea of using Transformer models like BERT to classify sequences to their structure. If that theory was valid, I'd want to try a multi-task approach to pairing the valid treatment sequence to the virus sequence and look at whether the model can predict the treatment sequence given the input virus sequence.
I haven't studied the structure as much as you guys probably have - so I'd defer to you on whether this would be plausible/feasible given what we know so far.
Here's a few other starting points I've looked at:
ReSimNet: Drug Response Similarity Prediction using Siamese Neural Networks
Jeon and Park et al., 2018
https://github.com/dmis-lab/ReSimNet
BERN is a BioBERT-based multi-type NER tool that also supports normalization of extracted entities.
https://github.com/dmis-lab/bern
The text was updated successfully, but these errors were encountered: