Details
-
New Feature
-
Status: Closed (View Workflow)
-
P3
-
Resolution: Done
-
None
-
customfield_11100 23199
-
Core: Platform - Sprint 61, CP: sprint 63
-
8
-
Core: Platform
Description
Current CQL generation algorithm
if field is 'id' //special case validate UUID right truncation (= range) is allowed for == and <> == <> < <= > >= are allowed if comparing to a valid UUID if field in 'fullTextIndex' section // use ts_vector (right now mod-inv-storage it is used for title, languages, instanceTypeId) else //this branch also includes handling of the JSON schema for the record if relation is <=, <, >=, > then use < .. (same OP) + unaccent/lower function if relation is = then use ~ for a regex match + unaccent/lower function 1) if relation is == or <> then use LIKE / NOT LIKE match +unaccent/lower function 1)
1) If the CQL query contains a /respectAccents or /respectCase modifier It always uses the unaccent/lower index and filters the results from the index afterwards.
New CQL generation algorithm
The entity schema should no longer be used when generating SQL. SQL should be generate only based on the index configuration from schema.json and query generation should fail when there is no index setup for a given field.
if relation is = (alias adj) or <> #= behaves the same as == for all index types but fulltext if field in 'fullTextIndex' section use ts_vector, if right truncation use :* else if field in 'ginIndex' section use LIKE + unaccent/lower, if truncation use "%" else if field in 'index' section use exact SQL match, no truncation allowed (it will result in linear scan) else fail if relation == #behaves the same as = but does not work with fulltext index if field in 'ginIndex' section use LIKE + unaccent/lower, if truncation use "%" else if field in 'index' section use exact SQL match, no truncation allowed (results in linear scan) else fail if relation < > <= >= if field in 'index' use SQL equivalents, no truncation allowed (results in linear scan) else fail
1) If the CQL query contains a /respectAccents or /respectCase modifier It always uses the unaccent/lower index and filters the results from the index afterwards.
Proposed new CQL generation algorithm
THIS IS NOT THE ALGO IMPLEMENTED IN THE END, SEE BELOW
The entity schema should no longer be used when generating SQL. SQL should be generate only based on the index configuration from schema.json and query generation should fail when there is no index setup for a given field.
if relation is = (alias adj) or <> #= behaves the same as == for all index types but fulltext if field in 'fullTextIndex' section use ts_vector, if right truncation use :* else if field in 'ginIndex' section use LIKE + unaccent/lower, if truncation use "%" else if field in 'index' section use exact SQL match, no truncation allowed (it will result in linear scan) else fail if relation == #behaves the same as = but does not work with fulltext index if field in 'ginIndex' section use LIKE + unaccent/lower, if truncation use "%" else if field in 'index' section use exact SQL match, no truncation allowed (results in linear scan) else fail if relation < > <= >= if field in 'index' use SQL equivalents, no truncation allowed (results in linear scan) else fail
2)
Implemented new CQL generation algorithm
if relation is = if (modifier format number) use exact SQL match, warn if no btree index specified if (modifiers are ignoreAccents and ignoreCase) #which is the default value of those modifiers use ts_vector (fulltext), warn if no index specified else (none of the above) use LIKE + unaccent/lower, if truncation use "%", warn if no GIN index specified if relation is adj, all OR any use ts_vector (fulltext), warn if no index specified if relation == OR <> if (modifers format string) use LIKE + unaccent/lower, if truncation use "%", warn if no GIN index specified if (none of the above) use exact SQL match, warn if no btree index specified if relation < > <= >= use exact SQL match, warn if no btree index specified
Note 1: we will no longer try to emulate fulltext matching for = (aka adj) using the ~ operator and regex word boundaries, the tokenization will be handled by the fulltext index for all fulltext fields like "title" but also for tokenization of arrays (depending on the outcome of CQLPG-83. The query time code in CQL2PG should be removed. In the case where there is no fulltext index, = (aka adj) behaves the same as ==.
Example regex case that should no longer be used:
WHERE lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"water.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))
Note 6: we could consider aliasing = with == to simplify the algorithm further, what this would mean in practice is that we would lose the ability to do "exact" matches for any fields with a fulltext index (e.g title but also array fields if we handle them with the fulltext index). We could consider adding this at a later stage e.g by introducing a modifier like string for the relation
Note 2: the "ginIndex" should be used only for "string" identifier or facet fields (e.g resourceType of instance) and always use an exact match both for = and == relations (no distinction). The only normalisation we will do during index/query time is lowercase and unaccent. We will support right truncation as well (LIKE xx% supports it).
Note 3: This mean that by default we will only support ignoreCase and ignoreAccents modifiers. All other modifiers will be failed.
Note 4: the regular btree "index" will be used for ordered fields only like numbers, and for sorting.
Note 5: only right truncation is allowed for fulltext so it will be problematic for contributor searches (if we enable fulltext) as those searches use truncation to march within a JSON array
TestRail: Results
Attachments
Issue Links
- blocks
-
CQLPG-41 Implement foreign key searches
-
- Closed
-
-
FOLIO-1934 RMB 25 release (performance, refactoring)
-
- Closed
-
-
STCOM-492 Change filter search cql syntax to use == not =
-
- Closed
-
- is blocked by
-
CQLPG-87 SPIKE: Determine the scope of the work required to implement cross-table queries
-
- Closed
-
-
CQLPG-90 Remove legacy query translation based on individual json schema files
-
- Closed
-
- is duplicated by
-
CQLPG-59 Get rid of the regexp indexes and queries
-
- Closed
-
- relates to
-
CQLPG-85 Redesign CQL to SQL generation (part 2)
-
- Closed
-
-
MODINVSTOR-182 analyze and optimize identifiers array indexes (ISBN, ISSN, LCCN, OCLC, …)
-
- Closed
-
-
RMB-380 implement new approach for searching array-of-strings and array-of-objects fields
-
- Closed
-
-
RMB-399 CQL2PG: Use "caseSensitive" "removeAccents" index options from schema.json
-
- Open
-
-
CQLPG-47 SPIKE: how to handle Numeric indexes in schema.json
-
- Open
-
-
CQLPG-91 generated queries are not aligned with lower/f_unaccent for indexes that use them
-
- Closed
-
-
FOLIO-1857 extract all unique CQL queries used from Okapi logfile on folio-snapshot
-
- Closed
-
-
MODINVSTOR-166 Analyze and optimize search indexes
-
- Closed
-