In Durham we have discussed that the current approach to searching (which, essentially, is based on translating CQL queries to complex PSQL queries with LIKE/ILIKE/regex and creating BTREE and GIN indices on particular down-cased/unaccented columns, with support from RMB and CQL2PG) may not be flexible enough to provide the level of search functionality and performance expected. (Note: we do not have concrete requirements here, we are assuming that what is expected matches functionality in modern OPAC systems).
One of the approaches discussed was to use the Postgres built-in "Full Text Search" capability, documented here:
In short, the essential pieces of full text search in PG are two functions/data structures:
tsvector takes a “document” which can be a single column, multiple columns from a single table or multiple tables, a view, or the entire JSONB document and returns a list (vector) of lexemes, which are normalised tokens (case/accent normalised, with stopwords filtered and roots extrated).
tsquery does the same but with the user input, e.g a list of words to match. tsquery supports the same booleans as CQL so the translation is rather trivial — & for AND, || for OR, ! for NOT, <-> for PROX
One can run both tsvector and tsquery at the query time, but this obviously does not scale so normally tsvector is used at the indexing time to build a GIN index on the selected data.
tsvector/tsquery are the most fundamental data structures, PG includes more advanced functionalities expected in a search engine: ranking, highlighting/snippets, etc. It does not include specialized facet data structure – this needs to be implemented through other methods.
Since PG 10, tsvector can take the whole JSONB contents in one go:
This, however, drops the keys and only indexes the “contents” of the JSONB, so we would not be able to “scope” the searches to a given field. One proposed approach here is to use triggers to do advanced indexing: