Uploaded image for project: 'cql2pgjson'
  1. cql2pgjson
  2. CQLPG-81

Redesign CQL to SQL generation

    XMLWordPrintable

Details

    • Core: Platform - Sprint 61, CP: sprint 63
    • 8
    • Core: Platform

    Description

      Current CQL generation algorithm

      if field is 'id' //special case
        validate UUID
        right truncation (= range) is allowed for == and <>
        == <> < <= > >= are allowed if comparing to a valid UUID
      if field in 'fullTextIndex' section //
        use ts_vector (right now mod-inv-storage it is used for title, languages, instanceTypeId)
      else //this branch also includes handling of the JSON schema for the record
        if relation is <=, <,  >=, > then use < .. (same OP) + unaccent/lower function
        if relation is = then use ~ for a regex match + unaccent/lower function 1)
        if relation is == or <> then use LIKE / NOT LIKE  match +unaccent/lower function 1)
      

      1) If the CQL query contains a /respectAccents or /respectCase modifier It always uses the unaccent/lower index and filters the results from the index afterwards.

      New CQL generation algorithm

      The entity schema should no longer be used when generating SQL. SQL should be generate only based on the index configuration from schema.json and query generation should fail when there is no index setup for a given field.

      if relation is = (alias adj) or <> #= behaves the same as == for all index types but fulltext
        if field in 'fullTextIndex' section
          use ts_vector, if right truncation use :*
        else if field in 'ginIndex' section
          use LIKE + unaccent/lower, if truncation use "%"
        else if field in 'index' section
          use exact SQL match, no truncation allowed (it will result in linear scan)
        else fail
      if relation == #behaves the same as = but does not work with fulltext index
        if field in 'ginIndex' section
          use LIKE + unaccent/lower, if truncation use "%"
        else if field in 'index' section
          use exact SQL match, no truncation allowed (results in linear scan)
        else fail
      if relation < > <= >=
        if field in 'index'
          use SQL equivalents, no truncation allowed (results in linear scan)
        else fail
      

      1) If the CQL query contains a /respectAccents or /respectCase modifier It always uses the unaccent/lower index and filters the results from the index afterwards.

      Proposed new CQL generation algorithm

      THIS IS NOT THE ALGO IMPLEMENTED IN THE END, SEE BELOW

      The entity schema should no longer be used when generating SQL. SQL should be generate only based on the index configuration from schema.json and query generation should fail when there is no index setup for a given field.

      if relation is = (alias adj) or <> #= behaves the same as == for all index types but fulltext
        if field in 'fullTextIndex' section
          use ts_vector, if right truncation use :*
        else if field in 'ginIndex' section
          use LIKE + unaccent/lower, if truncation use "%"
        else if field in 'index' section
          use exact SQL match, no truncation allowed (it will result in linear scan)
        else fail
      if relation == #behaves the same as = but does not work with fulltext index
        if field in 'ginIndex' section
          use LIKE + unaccent/lower, if truncation use "%"
        else if field in 'index' section
          use exact SQL match, no truncation allowed (results in linear scan)
        else fail
      if relation < > <= >=
        if field in 'index'
          use SQL equivalents, no truncation allowed (results in linear scan)
        else fail
      

      2)

      Implemented new CQL generation algorithm

      if relation is =
        if (modifier format number)
          use exact SQL match, warn if no btree index specified
        if (modifiers are ignoreAccents and ignoreCase) #which is the default value of those modifiers
          use ts_vector (fulltext), warn if no index specified
        else (none of the above)
           use LIKE + unaccent/lower, if truncation use "%", warn if no GIN index specified 
      if relation is adj, all OR any
         use ts_vector (fulltext), warn if no index specified
      if relation == OR <>
        if (modifers format string)
           use LIKE + unaccent/lower, if truncation use "%", warn if no GIN index specified 
        if (none of the above)
          use exact SQL match, warn if no btree index specified
      if relation < > <= >=
        use exact SQL match, warn if no btree index specified
      

      Note 1: we will no longer try to emulate fulltext matching for = (aka adj) using the ~ operator and regex word boundaries, the tokenization will be handled by the fulltext index for all fulltext fields like "title" but also for tokenization of arrays (depending on the outcome of CQLPG-83. The query time code in CQL2PG should be removed. In the case where there is no fulltext index, = (aka adj) behaves the same as ==.

      Example regex case that should no longer be used:

      WHERE lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"water.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))
      

      Note 6: we could consider aliasing = with == to simplify the algorithm further, what this would mean in practice is that we would lose the ability to do "exact" matches for any fields with a fulltext index (e.g title but also array fields if we handle them with the fulltext index). We could consider adding this at a later stage e.g by introducing a modifier like string for the relation

      Note 2: the "ginIndex" should be used only for "string" identifier or facet fields (e.g resourceType of instance) and always use an exact match both for = and == relations (no distinction). The only normalisation we will do during index/query time is lowercase and unaccent. We will support right truncation as well (LIKE xx% supports it).

      Note 3: This mean that by default we will only support ignoreCase and ignoreAccents modifiers. All other modifiers will be failed.

      Note 4: the regular btree "index" will be used for ordered fields only like numbers, and for sorting.

      Note 5: only right truncation is allowed for fulltext so it will be problematic for contributor searches (if we enable fulltext) as those searches use truncation to march within a JSON array

      TestRail: Results

        Attachments

          Issue Links

            Activity

              People

                hji Hongwei Ji
                jakub Jakub Skoczen
                Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:

                  TestRail: Runs

                    TestRail: Cases