The lagoon
tool maintains some database metadata in two tables called
sources
and sourceColumns
. To create these tables, run
# lagoon-server init-db --db-admin-pass '<password>'
Database initialized
The database administrator password can be changed later using
# lagoon-server change-db-admin-pass --old-db-admin-pass '<old>' --new-db-admin-pass '<new>'
To migrate an existing database initialized by an older version of the ingest tool, run
# lagoon-server migrate
Migration complete
If you want to reset the contents of the database, you can reset the state of the database to a clean state with
# lagoon-server reset-db --db-admin-pass '<password>'
Database reset
Note the reset-db
operation will proceed to delete all datasets and metadata
in the database from all users.
Although the server can be started without any command line options at all, you will probably want to specify at least a path to a yaml file with the PostgreSQL settings:
# lagoon-server --config /path/to/config.yaml
To enable secure connections to the server, run the server with
# lagoon-server --tls-cert /path/to/cert.pem --tls-key /path/to/key.pem ...
When started like this, the server will reject insecure (plaintext) connections
from clients; if that is not desirable, the --allow-insecure
option can be
used.
During testing you might want to use a self-signed certificate; the directory
tls
contains an example. When using this certificate you will need to
configure clients not to verify the server certificate; for example, when using
curl
you will need to use the curl
option --insecure
.
When S3 support is enabled, remotes of the form s3://bucket_name/object_key
will be interpreted as S3 objects with key object_key
living in the AWS S3
bucket bucket_name
(see the notes on encoding).
To enable S3 support for remote files, run the server with
# lagoon-server --s3-env
or
# lagoon-server --s3-file --s3-credentials <file> --s3-key <key>
The former enables S3 support using AWS credentials set in the environment
(AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
). The latter enables S3 support
using AWS credentials set in a file <file>
, using the key specified by
<key>
. The file should have the following format:
my_key_1 aws_key_1 aws_secret_1
my_key_2 aws_key_2 aws_secret_2
...
and the correct set of AWS key and secret will be looked by the key provided by
--s3-key
in the file specified by --s3-credentials
(both have defaults). To
use aws_key_1
and aws_secret_1
in a file as above stored in
/home/someuser/aws-credentials
use the following:
# lagoon-server --s3-file --s3-credentials /home/someuser/aws-credentials --s3-key my_key_1
AWS S3 allows non-URL characters in the bucket names and object keys. In order
to deal with this, the Lagoon REST API requires the bucket name and object
keys to be URL encoded. For instance with bucket my_bucket
and object key my object
the remote
URI would look like this:
s3://my_bucket/my%20object
and the whole request parameter may look something like this:
...remote=s3%3A%2F%2Fmy_bucket%2Fmy%2520object&...
(note how my object
is double encoded in the final request).
The server currently supports the following external authentication servers:
-
LDAP. To use LDAP, specify the URL to the LDAP server using
--ldap
. It is strongly recommended to use LDAPS, as passwords will be sent in plaintext across the network otherwise.When using LDAPS, you can set
TLS_REQCERT allow
in~/.ldaprc
or/etc/ldap/ldap.conf
to configure the server not to verify the credentials of the LDAP server (for testing purposes only). -
HTTP Basic Auth. To use basic auth, pass the
--basic-auth
argument with the URL to the HTTP endpoint. It is strongly recommended to use HTTPS, as passwords will be sent in plaintext across the network otherwise.When using HTTPS the
--auth-ignore-cert
can be used to configure the server to not verify the certificate of the HTTP server (for testing purposes only).
NOTE: The output of the server is not pretty-printed. To improve legibility for human readers, you might want to use a JSON pretty printer such as jq:
curl http://localhost:22089/sources | jq .
To authenticate, post credentials in JSON to /user/login
:
# curl -c cookiejar \
-d '{"user": "username", "pass": "password"}' \
-H 'Content-Type: application/json' \
http://localhost:22089/user/login
{"ok":{"username":"username"}}
When successful, this will then set a sessionId
cookie. As long as that
cookie is passed on subsequent requests, those requests will have the
appropriate privileges. To terminate the session, post to /user/logout
:
# curl -b cookiejar -X POST http://localhost:22089/user/logout
(TODO: The server does not currently inform the client at this point that the cookie is expired.)
In order to persist the session, clients can request an authentication token:
# curl -b cookiejar -X POST http://localhost:22089/user/token
"9f81fb6a-48a2-445b-8e47-808795b24f22"
Note that as with all server responses, this is in JSON format, hence the quotes. This token can be stored in a file or somewhere else and subsequently used as an alternative way to login:
# curl -b cookiejar \
-d '"9f81fb6a-48a2-445b-8e47-808795b24f22"' \
-H 'Content-Type: application/json' \
http://localhost:22089/user/resume
{"ok":{"username":"username"}}
This will then set a new cookie. The token is good until posting to
/user/logout
.
Basic call:
curl http://localhost:22089/sources
Examples using filtering/sorting/searching:
curl "http://localhost:22089/sources?limit=2&offset=2"
curl "http://localhost:22089/sources?createdAfter=2016-10-19%2012:00:00"
curl "http://localhost:22089/sources?orderDesc=description&orderAsc=created&orderDesc=ix"
Of course many other combinations are also possible.
Query parameters:
Parameter | Type | Ingest equivalent |
---|---|---|
offset |
Int |
--offset |
limit |
Int |
--limit |
search |
String |
--search |
ix |
Ix |
--ix |
tag |
TagName |
--tag |
description |
Description |
--description |
name |
SourceName |
--name |
user |
UserName |
--user |
column |
ColumnName |
--column |
createdAfter |
Timestamp |
--created-after |
createdBefore |
Timestamp |
--created-before |
orderAsc /orderDesc |
ColumnName |
--order-asc /--order-desc |
includeDeprecated |
(flag) | --include-deprecated |
includePrivate |
(flag) | --include-private |
fact |
String |
--fact |
class |
String |
--class |
group |
String |
--group |
curl -v http://localhost:22089/source/5
curl -X DELETE -v http://localhost:22089/source/5
curl -v -X POST -d '["a","b"]' \
-H "Content-Type: application/json" \
http://localhost:22089/source/5/tags
curl -v -X DELETE http://localhost:22089/source/5/tag/a
To upload unzipped files:
curl -v -X POST \
--data-binary @sources/proteinatlas/subcellular_location.csv \
'http://localhost:22089/sources?name=subcellular'
If you want to upload zipped files, can either manually specify the decompression method:
curl -v -X POST \
--data-binary @sources/proteinatlas/subcellular_location.csv.zip \
'http://localhost:22089/sources?name=subcellular&decompressMethod=unzip'
or alternatively specify the filename (presumably the filename of the file the user selected on his local file system) using
curl -v -X POST \
--data-binary @sources/proteinatlas/subcellular_location.csv.zip \
'http://localhost:22089/sources?name=subcellular&input=subcellular_location.csv.zip'
Since all the functionality is shared with the ingest tool, if an input filename is specified, input format and decompression method will be inferred from the filename (unless manually overridden).
It is possible to post a source with a unique identifier (sourceIdentifier
).
If a source already has the same identifier, the new source won't be ingested
and the already identified source will be returned. If no such source already
exists, the new source will be tagged with the identifier.
curl -v -X POST \
--data-binary @sources/proteinatlas/subcellular_location.csv.zip \
'http://localhost:22089/sources?sourceIdentifier=1246dafebb1
Note that remote files can also be ingested by specifying the file's URL in the
remote
parameter (see below for a list of parameters). In such case the request body will be empty and the dataset will be fetched by the lagoon-server:
curl -v -X POST \
'http://localhost:22089/sources?name=subcellular&remote=http%3A%2F%2Fsome-server.example.com%2Ffoo%2Fdataset.txt'
Note that value of the remote parameter,
http://some-server.example.com/foo/dataset.txt
, had to be HTTP-escaped. This
should be handled for you if you are using third party libraries to craft HTTP
requests. The Lagoon server can work with http://
, https://
and s3://
protocols (see Using S3 below for more information about S3).
Note that all the query parameters described above (like name
, tag
, etc)
will work with remote files as they would with locally uploaded files. For
instance using the tag
query parameter to upload an S3-stored file
some-object-key
fromt the bucket some-buvcket
:
curl -v -X POST \
'http://localhost:22089/sources?name=subcellular&remote=s3%3A%2F%2Fsome-bucket%2Fsome-object-key%2Fdataset.txt&tag=foo&tag=bar'
No matter how the file is uploaded, the result of a successful ingest is
a JSON structure similar to the one returned by /source/:SourceIx
.
The response body for uploading files to the server will consist of a stream of JSON progress values, separated by newlines, with a final JSON value with information about the created source.
Query parameters:
Parameter | Type | Ingest equivalent |
---|---|---|
name |
String |
--name |
fileType |
FileType |
--comma /--tab /--json |
peekAt |
Int |
--peek-at |
decompressMethod |
DecompressMethod |
--unzip |
jsonPath |
JsonPath |
--json-path |
encoding |
Encoding |
--latin1 , --utf8 |
description |
String |
--description |
input |
FilePath |
(Name of the uploaded file) |
remote |
FilePath |
(URL of the file to ingest) |
tag |
TagName |
--tag (can be used multiple times) |
sourceIdentifier |
String |
--source-identifier |
Query flags:
Flag | Ingest equivalent |
---|---|
noHeaders |
--no-headers |
disableQuoteChar |
--disable-quote-char |
noTypeInference |
--no-type-inference |
noTyped |
--no-typed |
noIndices |
--no-indices |
Some information for the client-side implementation:
- Sending just the file, without any kind of FORM encoding: http://stackoverflow.com/a/28193031/742991
- How to use
XMLHttpRequest
to monitor progress: https://zinoui.com/blog/ajax-request-progress-bar - The specification of
XMLHttpRequest
: https://xhr.spec.whatwg.org/
In earlier versions of XMLHttpRequest
there was no way of monitoring progress,
and some kind of secondary connection to the server was necessary. If we can get
away with only supporting modern browsers then we don't need to worry about
this.
TODO: Although the browser can monitor file upload progress, there is currently no specific provisions for monitoring ingest progress itself.
curl http://localhost:22089/source/5/download -o output-file
The server supports a bunch of different ways to search for sources. They divide into two categories: filtering based on individual bits of data, and full text search.
The results of a search can be sorted through the orderAsc
or orderDesc
query parameters. These parameters can be specified multiple times to indicate
sorting on several columns. The values for these parameters are ix
,
sourcename
, url
, version
, created
, addedby
, schema
, tablename
,
viewname
, description
. The number of search results can be limited
through the limit
parameter; pagination is supported through offset
.
The full text query language is relatively full featured and builds on top of PostgreSQL full text support. The grammar is given by
<expr> ::= <expr> | <expr>
| <expr> & <expr>
| <expr> <expr> -- treated as '&'
| ! <expr>
| <label> : <expr>
| LEXEME
| ( <expr> )
where <label>
is one of
description
name
user
tag
column
and LEXEME
is a lexeme (search term consisting of letters only, which will
be stemmed).
This query language is parsed and then translated to PostgreSQL's internal syntax for queries, which from the point of view of Lagoon clients (software or users) is just an implementation detail that they do not need to be concerned with.
Some examples:
- If we have two sources with description "CPIC Guideline for allopurinol and HLA B" and "CPIC Guideline for warfarin and CYP2C9 VKORC1" respectively, and both have a tag "dosing", then the search query
"dosing guidelines"
will find both of these sources; note that in this example,"dosing"
gets matched against the tag whereas"guidelines"
gets matched against"guideline"
in the description (thanks to stemming). - The same two sources would also be matched by the search query
"tag:dosing guidelines"
, neither of them would match the search query"dosing tag:guidelines"
- The query
"cpic !warfarin"
would match the first but not the second, while"cpic | !warfarin"
would additionally match a whole lot of other sources (as long as they either match"cpic"
or do not match"warfarin"
). - The query
"tag:protein"
would match sources tagged with"proteinatlas"
(prefix matching). However, full text search does not support general substring matching; searching for"tag:atlas"
will not find these sources.
The parser is carefully written to be very tolerant; all syntax errors are silenty corrected; for example,
"tag:pharm && !tag:dosing"
(double&
) will be treated as"tag:pharm tag:dosing"
"x & (y | z"
(missing closing parenthesis) will be treated as"x & (y | z)"
- Invalid labels (
"foo:dosing"
) will be ignored - Invalid characters (symbols, digits, etc.) will silently be dropped.
etc. No search queries should ever lead to an SQL exception; if that is not true, this should be considered a bug and be fixed.
Moreover, we explicitly check for empty search queries; since PostgreSQL removes
stop-words from search queries, it might not always be obvious when a search
query is empty. For instance, the search query "a"
is empty, since "a"
(indefinite particle) is a stop-word. When a search query is empty, we simply
drop that part of the search specification (i.e., we might return all sources,
depending on other search parameters).
KNOWN LIMITATION: Support for labels ("tag:..."
, "description:..."
, etc.) is based on PostgreSQL's support for weights of search terms. Currently we weigh terms as follows:
- Source name: weight A (most relevant)
- Tags and column names: weight B
- Description: weight C
- User: weight D (least relevant)
These weights can be used for ranking (although we don't currently do that),
but they are also (necessarily) what underlies support for labels. However,
PostgreSQL supports a maximum of 4 weight categories (A, B, C, D). Currently
both tags and column names are given weight B and consequently searching for
"tag:foo"
might match sources that have a tag matching "foo"
or a column
matching "foo"
(and similarly when searching for "column:foo"
).
We support filtering on the following non-textual data:
- Filter on source identifier; query parameter
ix
. This is mostly just here for completeness; if the front-end requires a specific course it can also use the/source/:SourceIx
GET endpoint. Uses the primary key index. - Filter on creation date; query parameters
createdAfter
andcreatedBefore
. The format for these parameters is currently a timestamp of the formYYYY-MM-DD HH:MM:SS
. Uses a btree index.
Then we support filtering on a number of text fields. These searches all rely
on PostgreSQL's ILIKE
operator and are in all cases supported by a trigram
GIST index to make that efficient.
- Filter on description; query parameter
description
. - Filter on source name; query parameter
name
. - Filter on tags; query parameter
tag
. If this parameter is specified multiple times, all tags are required to be present. - Filter on column names; query parameter
column
. If this parameter is specified multiple times, only sources are returned that have all of the specified columns. - Filter on user name; query parameter
user
. If this parameter is specified multiple times, returns sources created by any of the specified users. (Requiring all would not make sense, as sources are only created by a single user.)
PostgreSQL's support for full text search is quite sophisticated, but it is geared towards natural language search. Documents are split into tokens, stemming is applied, stop words are removed, etc. This is often what we want, but not always. Suppose we have a source with description
Result of the analysis
Then if we do a full text search for results
we will find this source, even
though we used a plural form of results
; this is a consequence of stemming.
However, if we search for sis
, we will not find this source using full text
search; after all, from a natural language perspective, analysis
and sis
have nothing to do with each other. But an ILIKE
(substring) search for sis
will find this source.
Currently we do not support approximate matching when doing a substring search; if we can rely on PostgreSQL 9.6, then this would be very easy to do. If we want this feature but use older versions of PostgreSQL, it would be relatively expensive to implement.
There are many ways in which we can search. The above provides one approach, which is relatively flexible and hopefully provides some good defaults. But of course as requirements emerge the design can be altered.
All of the searches are supported by an appropriate index, so searching should scale relatively well as the number of sources goes up. If efficiency does become a concern, there are two further improvements we can make:
-
Right now we only have indices for each individual kind of query; we have no combined indices. This means that if the user searches on multiple kinds of data, we need to consult several indices and then combine the results. If the number of sources becomes really huge, it might be beneficial to provide some combined indices for common combinations of search types.
-
There are no special provisions in place for making pagination efficient. If this is a problem, we would need to have some kind of server state to main a cursor into a result set which the client can then query.