The developer guide is targeted to people that want to extend or understand the functionality of the vec4ir
package. We will go through the implementation details in a top-down manner (See Figure ). We start with the core functionality (Section ). Then, we describe the implementation of the matching operation (Section ), provided retrieval models (Section ), query expansion techniques (Section ) and helpful utilities (Section ) in detail. The API design of vec4ir
is heavily inspired by the one of sklearn
(Buitinck et al. 2013).
The core functionality includes all functionality required for setting up a retrieval model and evaluating it. It consists of a generic information retrieval pipeline implementation, embedded vectorisation (i.e. word vector aggregation) along with the evaluation of retrieval models.
The Retrieval
class wraps the functionality of the matching operation, query expansion and similarity scoring.
In the constructor, the references to the retrieval model, the matching operation and the query expansion instances are stored. Additionally, users may set an identifier name for their composition of retrieval models. The labels_
property will be filled on invoking fit
.
def __init__(self, retrieval_model, matching=None,
query_expansion=None, name='RM'):
BaseEstimator.__init__(self)
self._retrieval_model = retrieval_model
self._matching = matching
self._query_expansion = query_expansion
self.name = name
self.labels_ = None
Upon fit(X[, y])
, the respective Retrieval
instance (as a whole) fits the documents. The fit
method is called on the delegates of retrieval model instance as well as the matching or query expansion instances (if provided). If additional document identifiers are provided as argument y
to fit
, then these are stored inside the labels_
property, such that in all further computation, the indices of X
may be used for accessing the documents.
def fit(self, X, y=None):
assert y is None or len(X) == len(y)
self.labels_ = np.asarray(y) if y is not None else np.arange(len(X))
if query_expansion:
self._query_expansion.fit(X)
if matching:
self._matching.fit(X)
self._retrieval_model.fit(X)
The query(q[, k])
methods allows to query the Retrieval
instance for k
relevant documents to the query string q
. First, in case a query expansion delegate is provided, it is called to transform the q
according to the respective query expansion strategy. In a second step the matching operation is conducted (formally also optional) by invocation of its predict
method. The predict
method is expected to return a set of indices that matched the query q
. The result of the matching operation ind
is used in two ways. On the one hand, it is passed to the query
method of the delegate of the retrieval model, such that it may (and should) reduce its internal representation according to the matching indices. On the other hand the view on the stored labels is reduced by labels = labels[ind]
. The benefit of this reduction is that the relative indices retrieved_indices
, returned by the retrieval model, can be used for accessing the labels array directly. Hence, the original document identifiers are restored. In between, we assert that the retrieval model does not return more than k documents by slicing retrieved_indices = retrieved_indices[:k]
the top k indices.
def query(self, q, k=None):
labels = self.labels_
if self._query_expansion:
q = self._query_expansion.transform(q)
if self._matching:
ind = self._matching.predict(q)
if len(ind) == 0:
return []
labels = labels[ind] # Reduce our own view
else:
ind = None
retrieved_indices = self._retrieval_model.query(q, k=k, indices=ind)
if k is not None:
retrieved_indices = retrieved_indices[:k]
return labels[retrieved_indices]
Since vec4ir
is dedicated to embedding-based retrieval models, a class is provided that aggregates word vectors according to the respective term-document frequencies. We call this process “embedded vectorisation”. The EmbeddedVectorizer
extends the behaviour of sklearn.feature_extration.text.TfidfVectorizer
as a subclass.
In the constructor, the index2word
property of the word vectors embedding
is used to initialise the vocabulary of its super class TfidfVectorizer
. This results in the crucial property that the indices of the embedding model and the transformed term-document matrix of the TfidfVectorizer
correspond to each other. All additional arguments of the constructor are passed directly to the TfidfVectorizer
so that its functionality is completely retained. Most notably, passing use_idf=False
disables the discounting of frequent terms over the collection (which is enabled by default). If normalize=True
is given, the word frequencies are normalised to unit L2-norm for each document.
def __init__(self, embedding, **kwargs):
vocabulary = embedding.index2word
self.embedding = embedding
TfidfVectorizer.__init__(self, vocabulary=vocabulary, **kwargs)
The fit method call is passed to the delegate of the super class TfidfVectorizer
.
def fit(self, raw_docs, y=None):
super().fit(raw_docs)
return self
In the transform(raw_documents, y=None)
method, we also start with the invocation of the transform
method of the superclass. The obtained Xt
are the term frequencies of the documents. Depending on the parameters passed to the TfidfRetrieval
super class in the constructor, the values of Xt
are already re-weighted and normalised to unit L2-norm. In the final step we aggregate the word vectors with a single matrix multiplication Xt @ syn0
. Thus, word frequency with unit L2-norm will result in the centroid of the respective embedding word vectors.
def transform(self, raw_documents, y=None):
Xt = super().transform(raw_documents)
# Xt is sparse matrix of word frequencies
syn0 = self.embedding.syn0
# syn0 are the word vectors
return (Xt @ syn0)
For convenient evaluation, we provide the RetriEvalMixin
class. A retrieval model class, that implements a query(q[, k])
method, may inherit the provided evaluate
method from the RetriEvalMixin
class. The evaluate
method is an interface for executing a set of queries by a single call and computing various ranking metrics mean average precision, mean reciprocal rank, normalised discounted cumulative gain, precision, recall, F1-score, precision@5, precision@10 as well as the respond time in seconds. The evaluate
method expects the following arguments:
-
X
list of query identifier and query string pairs -
Y
the gold standard, either a dictionary of dictionaries, apandas.DataFrame
with hierarchical index or annumpy
/scipy.sparse
array-like. The outer index is expected to be the query identifiers while the inner index needs to correspond to the document identifiers. A value greater than zero indicates relevance. Greater values indicate more relevance. -
k
The desired amount of documents to retrieve. Except for precision@5 and precision@10, all metrics are computed with respect to this parameter. -
verbose
Controls whether intermediate results should be printed tostdout
. -
replacement
The value to use for documents which could not be found inY
for the specific query identifier. The default value is zero. Hence missing query-document pairs are regarded non-relevant. -
n_jobs
If greater than 1, the evaluation will be executed in parallel (not supported by all retrieval models).
The evaluate method of the RetriEvalMixin
class is implemented as follows:
def evaluate(self, X, Y, k=20, verbose=0, replacement=0, n_jobs=1):
# ... multi-processing code ...
values = defaultdict(list)
for qid, query in X:
t0 = timer()
result = self.query(query, k=k)
values["time_per_query"].append(timer() - t0)
scored_result = [harvest(Y, qid, docid, replacement)
for docid in result]
r = scored_result[:k] if k is not None else scored_result
# NDCG
gold = harvest(Y, qid)
idcg = rm.dcg_at_k(gold[argtopk(gold, k)], k)
ndcg = rm.dcg_at_k(scored_result, k) / idcg
values["ndcg"].append(ndcg)
# MAP@k
ap = rm.average_precision(r)
values["MAP"].append(ap)
# MRR
ind = np.asarray(r).nonzero()[0]
mrr = (1. / (ind[0] + 1)) if ind.size else 0.
values["MRR"].append(mrr)
# ... other metrics ... #
return values
The matching operation is implemented in the Matching
class and designed to be employed as a component of the Retrieval
class. The interface to implement for a custom matching operation is:
- In the
fit(raw_documents)
method, an internal representation of the documents is stored. - The
predict(query)
returns the index set of matching documents.
The generic Matching
class reduces the complexity of adding a matching algorithm to the implementation a single function. The default representation is a binary term-document matrix which should suffice for all boolean models. The representation is computed by an sklearn.feature_extraction.text.CountVectorizer
. This process can be further customised by users, since additional keyword arguments are passed directly to the CountVectorizer
.
def __init__(self, match_fn=TermMatch, binary=True, dtype=np.bool_,
**cv_params):
self._match_fn = match_fn
self._vect = CountVectorizer(binary=binary, dtype=dtype,
**cv_params)
Upon calling predict
, the matching function match_fn
is employed to compute the index set of the matching documents. Hence, the user can customise the behaviour (conjunctive or disjunctive) of each Matching
instance by passing a function to the constructor. The match_fn
function object is expected to return the matching indices, when provided with the term-document matrix X
and the query string: match_fn(X, q)
→ matching_indices
.
Disjunctive (OR
) matching of single query terms is the default query parsing method for most information retrieval systems. Thus, the framework provides a dedicated function for it, which is also the default value of the match_fn
argument for the constructor of the Matching
class.
def TermMatch(X, q):
inverted_index = X.transpose()
query_terms = q.nonzero()[1]
matching_terms = inverted_index[query_terms, :]
matching_doc_indices = np.unique(matching_terms.nonzero()[1])
return matching_doc_indices
First, the index is inverted and the indices of the query tokens are extracted by finding the np.nonzero
entries in the second dimension [1]
. Second, we look up the indices of the query tokens in the inverted index. Finally, we only keep one index for each the matching document by calling np.unique
.
The Tfidf
class implements the popular retrieval model based on TF-IDF re-weighting introduced by Salton and Buckley (1988). The term frequencies of a document are scaled by the inverse document frequency of the specific terms in the corpus D (F. Pedregosa et al. 2011):
tfidf(t, d, D)=tf(t, d)⋅idf(t, D)
The t**f(t, d) is the number of occurrences of the word t in the document d:
tf(t, d)=freq(t, d)
The inverse document frequency is a measure for the fraction of documents that contain some term t:
The Tfidf
retrieval model extends the TfidfVectorizer
of sklearn
the Tfidf
. The documents are stored as an L2-normalised matrix of IDF re-weighted word frequencies. After transforming the query in the same way, we can compute the cosine similarity to all matching documents. As both the query and the documents are L2-normalised, the linear kernel yields the desired cosine similarity. Finally we use the function argtopk
(See Section ) to retrieve the top-k documents with respect to the result of the linear kernel.
class Tfidf(TfidfVectorizer):
def __init__(self, analyzer='word', use_idf=True):
TfidfVectorizer.__init__(self,
analyzer=analyzer,
use_idf=use_idf,
norm='l2')
self._fit_X = None
def fit(self, X):
Xt = super().fit_transform(X)
self._fit_X = Xt
return self
def query(self, query, k=None, indices=None):
if self._fit_X is None:
raise NotFittedError
q = super().transform([query])
if indices is not None:
fit_X = self._fit_X[indices]
else:
fit_X = self._fit_X
# both fit_X and q are l2-normalised
D = linear_kernel(q, fit_X)
ind = argtopk(D[0], k)
return ind
The word centroid similarity aggregates the word vectors of the documents to their centroids. It is implemented in the WordCentroidSimilarity
class. The implementation makes extensive use of the EmbeddedVectorizer
(See Section ) class for aggregation of word vectors. Hence, it is possible to choose between using IDF re-weighted word frequencies or plain word frequencies for aggregation. The parameters passed to the EmbeddedVectorizer
delegate could be further extended. The current implementation limits the possible configuration to the single parameter use_idf
. Providing use_idf=True
results in IDF re-weighted word centroid similarity.
class WordCentroidSimilarity(BaseEstimator):
def __init__(self, embedding, analyzer='word', use_idf=True):
self.vect = EmbeddedVectorizer(embedding,
analyzer=analyzer,
use_idf=use_idf)
self.centroids = None
def fit(self, X):
Xt = self.vect.fit_transform(X)
Xt = normalize(Xt, copy=False)
self.centroids = Xt
def query(self, query, k=None, indices=None):
centroids = self.centroids
if centroids is None:
raise NotFittedError
if indices is not None:
centroids = centroids[indices]
q = self.vect.transform([query])
q = normalize(q, copy=False)
# l2 normalised, so linear kernel
D = linear_kernel(q, centroids)
ret = argtopk(D[0], k=k)
return ret
Please note, that the aggregated centroids need to be re-normalised to unit L2-norm (even if the word frequencies were normalised in the first place), so that the linear_kernel
corresponds the cosine similarity. Finally, the results of the linear kernel are ranked by the argtopk
(See Section ) function.
In order to compute the Word Mover’s distance (Kusner et al. 2015), we employ the gensim
implementation Word2Vec.wmdistance
(Řehuřek and Sojka 2010). Additionally, we make the analysis function analyze_fn
as well as a completeness parameter complete
available as arguments to the constructor. The analyze_fn
argument is expected to be a function that returns a list of tokens, given a string. The parameter complete
allows specification whether the full Word Mover’s distance to all documents or only to the top k documents returned by WordCentroidSimilarity
should be computed. The WordCentroidSimilarity
is in turn customisable via the use_idf
constructor argument. The complete
parameter is expected to be a float value between 1 (compute the Word Mover’s distance to all documents) and 0 (only compute the Word Mover’s distance to the documents returned by WCS). A fraction in between takes the corresponding amount of additional documents into account. The crucial functionality can be summarised in the following lines of code.
incomplete = complete < 1.0
# inside fit method
docs = np.array([self.analyze_fn(doc) for doc in raw_docs])
if incomplete:
self.wcd.fit(raw_docs)
# inside query method
if incomplete:
wcd_ind = self.wcd.query(query, k=n_req, indices=indices)
docs = docs[wcd_ind]
q = self.analyze_fn(query)
dists = np.array([self.embedding.wmdistance(q, d) for d in docs])
ind = np.argsort(dists)[:k]
if incomplete:
# stay relative to the matching indices
ind = wcd_ind[ind]
The Doc2VecInference
class implements a retrieval model based on a paragraph vector model (Le and Mikolov 2014), i.e., Doc2Vec
from gensim
(Řehuřek and Sojka 2010). Given an existing Doc2Vec
model, it is used to infer the document vector of a new document. We store these inferred document vectors and also infer a document vector for the query at query time. Afterwards, we compute the cosine similarity of all matching documents to the query and return the respective indices in descending order. The inference step considers three hyper-parameters:
-
alpha
The initial learning rate -
min_alpha
The final learning rate -
steps
The number of training epochs
The inference process itself (provided by gensim) runs steps
training epochs with fixed weights and a linearly decaying learning rate from alpha
to min_alpha
. As the model does operate on a list of tokens, an analyser is required to split the documents into tokens. We compute the representation of the documents (as well as the query) as follows:
analyzed_docs = (analyzed(doc) for doc in docs)
representation = [doc2vec.infer_vector(doc,
steps=self.epochs,
alpha=self.alpha,
min_alpha=self.min_alpha)
for doc in analyzed_docs]
representation = normalize(np.asarray(representation), copy=False)
The final normalisation step prepares the computation of the cosine similarity with a linear kernel. At query time, the cosine similarity between the vectors of the query and the documents is computed. The documents are then ranked in descending order.
All query expansion techniques should implement the following interface:
-
fit(raw_documents)
Fits the query expansion technique to the raw documents. -
tranform(query)
Expands the query string.
As two implementations of this interface, we present the naive centroid based expansion as well as embedding-based query language models by Zamani and Croft (2016) . Please note, that for research it is strongly recommended to reduce the employed word embedding to the tokens that actually appear in the collection. Otherwise, words could be added to the query, that never appear in the collection and thus, do not affect the matching operation.
Centroid expansion is a query expansion technique that computes the (possibly IDF re-weighted) centroid v
of the query. It expands the query by the m nearest words with respect to the cosine distance to v
. As a first step, we once again compute the centroid vector of the query using the EmbeddedVectorizer
after fitting the collection (to build a vocabulary). Then, we employ the similar_by_vector
method provided by the gensim.models.Word2Vec
class to obtain the nearest tokens.
## inside transform method
v = vect.transform([query])[0]
exp_tuples = wv.similar_by_vector(v, topn=self.m)
words, __scores = zip(*exp_tuples)
expanded_query = query + ' ' + ' '.join(words)
The embedding-based query language models proposed by Zamani and Croft (2016) includes two techniques for query expansion (EQE1
and EQE2
), both rely on the underlying probability distribution based on the (weighted) sigmoid of the cosine similarity between terms. While EQE1
assumes statistical independence of the query terms, EQE2
assumes query-independent term similarities. The probabilities of the respective query language models are defined with respect to a delta
function. The delta
function transforms the cosine distance between two word vectors by a parametrised sigmoid function. We implement this behaviour in a dedicated delta
function:
def delta(X, Y=None, n_jobs=-1, a=1, c=0):
D = pairwise_distances(X, Y, metric="cosine", n_jobs=n_jobs)
D -= c
D *= a
D = expit(D)
return D
We compute the pairwise distances for all word vectors D = delta(E,E)
. While both variants are implemented in the framework, we only present the computation of the EQE1
variant:
prior = np.sum(D, axis=1)
conditional = D[q] / prior
posterior = prior * np.product(conditional, axis=0)
topm = np.argpartition(posterior, -m)[-m:]
expansion = [wv.index2word[i] for i in topm]
When dealing with retrieval models, a common operation is to retrieve the top k indices from a list of scores in sorted order. Sorting the complete list and then slicing the top k documents would lead to unnecessary computation. Thus, vec4ir
includes a useful helper function for this specific operation: argtopk
. The function makes extensive of numpy.argpartion
, which performs one step of the quicksort algorithm. On invocation by np.argpartition(A, k)
, the element at position k
of the returned index set is at the correct (with respect to sorted order) position. All indices i = 0, …, k − 1 smaller than k
also have smaller values A[i] <= A[k]
. Since we are interested in the k highest values, we invoke argpartition
with -k
and slice the top k
values by [-k:]
.
def argtopk(A, k=None):
A = np.asarray(A)
if k is None or k >= A.size:
# if A contains too few elements or k is None,
# return all indices in sort order
return np.argsort(A, axis=axis)[::-1]
ind = np.argpartition(A, -k, axis=-1)[-k:]
ind = ind[np.argsort(A[ind], axis=-1)][::-1]
return ind
The RetriEvalMixin
(as described in Section ) allows several different data types for the gold standard Y
. We provide a unifying function harvest
to extract the relevance score for the desired respective query-document pair for all of the supported data types. The immediate benefit is that the data neither has to be transformed nor copied before calling evaluate
on the implementing class. Additionally, the function can be used to obtain the whole set of values for one specific query (docid=None
). On the one hand, the desired behaviour is to raise an exception, if a query identifier is not found in the gold standard. On the other hand, missing values for the document identifier for one specific query should not raise an exception but return the default
value (typically zero). The latter behaviour can also be achieved by using a defaultdict
or a sparse matrix as inner data structure. However, we want to relief the user from those issues. The harvest
function takes the following arguments:
-
source
The gold standard of query-document relevance, a two-level data structure with type eitherDataFrame
,dict
ofdict
s or two-dimensionalndarray
, -
query_id
The query identifier to use as index for the first level ofsource
. -
doc_id
IfNone
return a list of relevance scores for the query given byquery_id
, else look up thedoc_id
in the second level ofsource
. -
default
The default value in case a document identifier is not found on the second level (typically zero).
In the implementation, we first access the query_id
of the source
and raise an exception if query_id
is not prevalent in source
. In a second step, we look up doc_id
on the second level of the data structure and return the value. If the second step fails, the default
value is returned instead.
We introduce IRDataSetBase
as an abstract base class for the data sets. Subclasses should implement the following properties:
-
docs
: Returns the documents of the corpus. -
topics
: Returns the topics, i.e. the queries. -
rels
: Returns the relevance judgements for the queries either as adict
ofdefaultdict
s or as apandas.Series
object with a hierarchical index (multi-index).
All other options for the data set (such as the path to its root directory and caching) should be placed in the respective constructor. As an example subclass of the IRDataSetBase
, we present the Quadflor-like dataset format.
For convenience we adopt the data set format and specification from Quadflor. Quadflor is a framework for multi-label classification. A Quadflor-like data set consists of
-
X
The documents, either a path to a directory containing<id>.txt
documents or a tsv file with columnsid
andcontent
. -
y
The gold standard: label annotations for the documents. A tsv file with document id in the first column, and subsequent columns resemble label identifier. -
thes
A thesaurus consisting of a hierarchy of concepts which are used as query. A minimal format would be{'<query_id>' : { 'prefLabel': [ "Here is the query" ] }, '<query_id>:' { 'prefLabel': [ "Yet another query string."] } }
The query ids should match the ones iny
.
In our prior work on Quadflor, the gold standard y
was used as target labels in a multi-label classification task. We employ the data set in a different manner. We extract the preferred labels of each concept from the thesaurus thes
and use them as queries. When the label annotations (y
) for some document contain the specific concept identifier, we consider the document as relevant to the query.
Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, et al. 2013. “API Design for Machine Learning Software: Experiences from the Scikit-Learn Project.” In ECML Pkdd Workshop: Languages for Data Mining and Machine Learning, 108–22.
Kusner, Matt J., Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. “From Word Embeddings to Document Distances.” In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, edited by Francis R. Bach and David M. Blei, 37:957–66. JMLR Workshop and Conference Proceedings. JMLR.org. http://jmlr.org/proceedings/papers/v37/kusnerb15.html.
Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 32:1188–96. JMLR Workshop and Conference Proceedings. JMLR.org. http://jmlr.org/proceedings/papers/v32/le14.html.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Řehuřek, Radim, and Petr Sojka. 2010. “Software Framework for Topic Modelling with Large Corpora.” In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
Salton, Gerard, and Chris Buckley. 1988. “Term-Weighting Approaches in Automatic Text Retrieval.” Inf. Process. Manage. 24 (5): 513–23. doi:10.1016/0306-4573(88)90021-0.
Zamani, Hamed, and W. Bruce Croft. 2016. “Embedding-Based Query Language Models.” In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, ICTIR 2016, Newark, de, Usa, September 12- 6, 2016, edited by Ben Carterette, Hui Fang, Mounia Lalmas, and Jian-Yun Nie, 147–56. ACM. doi:10.1145/2970398.2970405.