Sentence Coupling Analysis and Sentence Generation - SCASG
- This is a service for analyzing and generating NLP training sentences based on
TF-IDF
,WordNet
, andspaCy
. - The details are published in 2023 International Computer Symposium (ICS) and accessible here.
- Best Student Paper
Simply execute the command:
python app.py
And you can use SCASG
in conjunction with BOTEN
- This research suggests that developers pay attention to the coupling between training sentences, reducing the similarity of sentences belong different Intents, to avoid recognition errors.
- If the meaning of the sentences under the same intent is too similar, the trained NLP model may only be able to judge the sentence input of a single pattern, resulting in
overfitting
. - In order to avoid the
overfitting
of the model, developers should try to write a variety of training sentence patterns, so this research did not list Cohesion improvement as a research target. - In our research, we will use
TF-IDF
,WordNet
andspaCy
to calculate the similarity between training sentences, and raise warnings for the sentences with similar meanings between different Intents, and suggest developers modify them.
- Convert all text to lowercase.
- Lemmatize the words to original.
- Remove
stop words
. - Use
TF-IDF
to calculate the weight of all words in individual sentences, and build a Corpus Index. - Represent the constituent words of each training sentence with a weight value, and the sentence can be converted into a one-dimensional vector.
- Calculate the Cosine Similarity between the sentence vectors to get the similarity value of the two sentences.
This method may cause two sentences with similar meanings, but the Cosine Similarity is still low because the same words do not appear. Therefore, we propose two improved methods:
- Use
WordNet
to look up the synonyms of the words used in the training sentences, and usespaCy
to calculate the similarity between the synonyms and the original word. - The weight value of the original word is multiplied by the similarity between the synonym and the original word as the weight value of the synonym in the corpus index.
- We regard these synonyms as the words of the sentences to which the original words belong, and the similarity between all training sentences is recalculated.
SpaCy
is also used to calculate the similarity between sentences. Since the Corpus of the two calculation methods are different, two different similarity values between the two sentences are obtained, and then these two values are weighted with a specific ratio as the weighted similarity between the two sentences.
By using the above two methods, this system has been able to accurately evaluate the similarity between training sentences, and raise warnings for sentences with excessive similarity but belonging to different Intents. The following is the evaluation method:
- All the sentences are used as the reference sentence in turn, let all sentences compare with the reference sentence and calculate the similarity, and then add the similarity with same Intent up.
- Compare these added values.
- If there is any Intent whose similarity is greater than the Intent which the reference sentence belongs, the sentence with maximum similarity before the Intent is added is regarded as a sentence that is too similar to the reference sentence.
Through the above method, the diversity of sentences under the same Intent can be preserved, and excessive ones can be found.
In addition to the evaluation of training sentences, this research also proposes a mechanism for generating training sentences, and it can be divided into the following steps:
- Preprocess: converting all text to lowercase, lemmatization, and removing
stop words
. - Use
WordNet
to look up the synonyms for each meaningful Token. - Calculate the similarity between the synonyms and the original Token through
WordNet
. If the similarity is higher than0.3
, it will be regarded as a synonym for the token (this value is obtained through our experiments). - According to their relative positions in the original sentence, these synonyms will replace the words other than
stop words
in turn to combine multiple sentences with the same meaning as the original sentence.
Among the generated sentences of WordNet
, some sentences have meanings that are quite different from the original sentence, which could lead to training an inaccurate model, so we further use spaCy
filtering the sentences, it can be divided into the following steps:
- Calculate the similarity of generated sentence with the original sentence in turn.
- Preserve the sentences with a similarity greater than
0.7
as training sentences to complete the expansion (this value is obtained through our experiments).