Language-model-Group-C

The commands used for creating our LM's are in Guide.txt, with some expections this is how we ran it in most cases

Guide.txt

First line creates a vocabulary file and an n-gram ditribution file
Second line Takes these files and turns them into an n-gram LM
Third line is used for testing perplexity
Forth and fifth lines rescore the lattices and create a hypothesis file

For the last step in "Guide.txt" we need two files

An in-lattice-list "lattice.txt" which has references to every lattice that needs to be rescored. You need to change the path in "lattice.txt" to match where these files are locally on your own machine
in-lattice-list "upLats.txt" which has references to the newly rescored lattices. We set the path for the previous step into a folder called Uplats, and had "uplats.txt" just reference every file in there. When we needed to redo this step we had to rename the upLats folder to something new

Datasets

Pizza palace training set provided by Andreas Søeborg Kirkedal
Verizon provided by Andreas Søeborg Kirkedal
Googles Taskmaster (2019 -2020) dataset, we only used the food-ordering json files
Ubuntu Dialogue Corpus
Santa Barbara Corpus of Spoken American English
MetaLWOZ We only used the categories, Phone setting, Pizza ordering and Phone plans.

Folders

pp.trains contains different version of the pizza palace training set, according with what we saw fit after each itertation
models_lms contains all the iterations of pizza palace project. Each iteration contains the LM's used, the hyphesis file both normalized and untouched, aswell as a folder called upLats with the updated lattices
Lattices contains the starting lattices, that has not yet been scored

Scripts

ww.py is an helper module we created to cleant the files. It contain the functions, "remove_filler" (this is actually caught by ourder silencewords.txt, which makes it redundant), "remove_confirmation" and "pickUp" (concatenates pick up into 1 word)
normhyp.py first step of normalization, which removes all the ektra information given the way we create our hypothesis file in Guide.txt. This step would allways be done before applying further normalization e.i. ww.py
pathFinder.py can be used to create the in-lattice-lists mentioned in Guide.txt

Text normalization standard

We only work with lowercase words
No punctuation allowed
All numbers are written out, 3 = "three". numbers out of range 0-9 are listed as a sequence of wordsnumbers with spaces inbetween

Jupyter notebooks

Every "clean" jupyter notebook is used to normalize data according to the above given rules as well as some extra requirements depending on what we saw fit during the iterations

Most of the files in the notebooks contain paths that will not work on a general setting. Make sure to check if paths are set correctly for your environment

clean_pptrain.ipynb used to clean pp.train.txt by removing "and" and "okay" if they are in the beginning of the sentences. It also changces occurences of "pizza pizza" to "pizza palace" aswell as concatanating "pick up" with using functionality from ww.py
clean2019-2020TM.ipynb Cleans the Taskmaster data set from google. Some regex magic was need to fix timestamps and prices which contained "$", "." and ":" so we made sure do this before removing punctuation. We also found it fitting to remove ["no", "yes", "okay", "ok", "sure", "and", "yep"] if they occur in the beginning of the sentence. It does all of this for both files and concatenates them.
OOVTest.verizon.ipynb calculates number of "Our of vocabulary words". This was required since SRILM was given us a wrong number.
MetaLWOZ.pizza.ipynb cleans the subset of MetaLWOZ that contains pizza delivery
MetaLWOZ.phone.ipynb cleans the subset of MetaLWOZ that contains Phone setting and Phone plans
UbuntuClean.ipynb cleans the smallest of the datasets of the datasets from Ubuntu Dialogue Corpus
Clean_SBC.ipynb cleans the Santa Barbara Corpus of Spoken American English

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-model-Group-C

Guide.txt

Datasets

Folders

Scripts

Text normalization standard

Jupyter notebooks

Every "clean" jupyter notebook is used to normalize data according to the above given rules as well as some extra requirements depending on what we saw fit during the iterations

Most of the files in the notebooks contain paths that will not work on a general setting. Make sure to check if paths are set correctly for your environment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Jupyter notebooks		Jupyter notebooks
Lattices		Lattices
Verizon LMs		Verizon LMs
models_lms		models_lms
.gitattributes		.gitattributes
3_tm_pp_hypnormal.clean.conf.txt		3_tm_pp_hypnormal.clean.conf.txt
Commands for verizon evaluation.docx		Commands for verizon evaluation.docx
Guide.txt		Guide.txt
README.md		README.md
lattice.txt		lattice.txt
normhyp.py		normhyp.py
pathFinder.py		pathFinder.py
upLats.txt		upLats.txt
ww.py		ww.py
~$mmands for verizon evaluation.docx		~$mmands for verizon evaluation.docx

C-Roensholt/Language-Modelling

Folders and files

Latest commit

History

Repository files navigation

Language-model-Group-C

Guide.txt

Datasets

Folders

Scripts

Text normalization standard

Jupyter notebooks

Every "clean" jupyter notebook is used to normalize data according to the above given rules as well as some extra requirements depending on what we saw fit during the iterations

Most of the files in the notebooks contain paths that will not work on a general setting. Make sure to check if paths are set correctly for your environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages