Uploaded image for project: 'mod-inventory-storage'
  1. mod-inventory-storage
  2. MODINVSTOR-705

Query is 3x slower with 10M records than with 7.5M records

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • P2
    • Resolution: Done
    • None
    • 20.1.0
    • None
    • Standard Bug Write-Up Format
    • CP: sprint 110
    • 3
    • Core: Platform
    • R1 2021 Bug Fix

    Description

      Searching instances by the contributor field in a 10 million records table is 3x slower than in a 7.5 million records table.

      How to reproduce:
      In Inventory search select "Contributors" search and enter

      Bach, Johann Sebastian

      CQL generated by the front-end:

      (contributors =/@name "Bach, Johann Sebastian") sortby title 

      CQL query request sent by the front-end:

      GET /inventory/instances query=%28contributors+%3D%2F%40name+%22Bach%2C+Johann+Sebastian%22%29+sortby+title&limit=100&offset=0
      

      SQL query:

      SELECT jsonb, (lower(f_unaccent(jsonb->>'title')) ) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE (get_tsvector(f_unaccent(instance.jsonb->>'contributors')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->'contributors') as c ) as t where (get_tsvector(f_unaccent(t.c->>'name')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))) AND left(lower(f_unaccent(jsonb->>'title')) ,600) < ( SELECT left(lower(f_unaccent(jsonb->>'title')) ,600) FROM fs09000000_mod_inventory_storage.instance ORDER BY left(lower(f_unaccent(jsonb->>'title')) ,600) OFFSET 10000 LIMIT 1 ) ORDER BY left(lower(f_unaccent(jsonb->>'title')) ,600) LIMIT 100 OFFSET 0 ), allrecords AS ( SELECT jsonb, lower(f_unaccent(jsonb->>'title')) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE (get_tsvector(f_unaccent(instance.jsonb->>'contributors')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->'contributors') as c ) as t where (get_tsvector(f_unaccent(t.c->>'name')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))) AND (SELECT COUNT(*) FROM headrecords) < 100 ), totalCount AS (SELECT fs09000000_mod_inventory_storage.count_estimate(' SELECT lower(f_unaccent(jsonb->>''title'')) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE get_tsvector(f_unaccent(instance.jsonb->>''contributors'')) @@ tsquery_phrase(f_unaccent(''Bach, Johann Sebastian'')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->''contributors'') as c ) as t where (get_tsvector(f_unaccent(t.c->>''name'')) @@ tsquery_phrase(f_unaccent(''Bach, Johann Sebastian''))))') AS count) SELECT jsonb, data_column, (SELECT count FROM totalCount) FROM headrecords WHERE (SELECT COUNT(*) FROM headrecords) >= 100 UNION (SELECT jsonb, data_column, (SELECT count FROM totalCount) FROM allrecords ORDER BY data_column LIMIT 100 OFFSET 0 ) ORDER BY data_column
      

      This API call took about 30s to run end-to-end, from the UI in the 10M instances table. The same call took about 10s to run end to end, from the UI, in the 7.5M instances table.

      Can this query be optimized so that it doesn't suffer when the dataset's size is 10M+?

      Attached are explain-analyzed logs for the query above, one from bugfest with 7.5M instances, the other from an environment with 10M instances. Comparing the two logs, does line 43 suggest that the index only being used to produce the CTEs but that when recursing through them, the index isn't used because it's working off of the CTEs, hence there is a exception for each of the inner queries? In which case the query execution time would become roughly linear with the number or rows. Looks like the second CTE being the killer.

      10M records database, line 43: -> Subquery Scan on t_1 (cost=0.00..52.27 rows=1 width=16) (actual time=0.058..0.058 rows=1 loops=16412)
      7.5M records database, line 43: -> Subquery Scan on t_1 (cost=0.00..52.27 rows=1 width=16) (actual time=0.031..0.031 rows=1 loops=6580)

      Also, could (SELECT COUNT FROM headrecords) < 100 results in a lot of loops in the case of the second CTE (allrecords)?

      Query Analysis

      The contributors array of JSON objects looks like this:

      "contributors" : [ {
          "contributorNameTypeId" : "35b01201-597f-4938-a0d1-110245ba9d67",
          "name" : "Bach, Johann Sebastian, 1685-1750",
          "contributorTypeId" : null,
          "contributorTypeText" : null,
          "primary" : true
        }, {
          "contributorNameTypeId" : "35b01201-597f-4938-a0d1-110245ba9d67",
          "name" : "Palmer, Willard A.",
          "contributorTypeId" : null,
          "contributorTypeText" : null,
          "primary" : false
        }, {
      

      Using "sortby title" results in a long optimized SQL query, for details see https://github.com/folio-org/raml-module-builder/blob/v32.2.0/domain-models-runtime/src/main/java/org/folio/rest/persist/PgUtil.java#L1587

      CQL query without sorting:

      contributors =/@name "Bach, Johann Sebastian"
      

      Generated SQL:

      WHERE get_tsvector(f_unaccent(instance.jsonb->>'contributors'))
                @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))
      AND id in
          (SELECT t.id from
              (SELECT id as id,
                      jsonb_array_elements(instance.jsonb->'contributors') as c
              ) as t
           WHERE (get_tsvector(f_unaccent(t.c->>'name'))
                    @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))
      

      This is a two-step search:
      First step is in the first two lines: Use the contributors full text index and find the phrase.
      This matches in all contributors fields, not only in the name field, for example in contributorTypeText.
      The remaining SQL code extracts the name field and matches the name field text. This requires to fetch the complete records of the first step because we don't have an index for the contributors name field.

      TestRail: Results

        Attachments

          Issue Links

            Activity

              People

                hji Hongwei Ji
                mtraneis Martin Tran
                Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:

                  TestRail: Runs

                    TestRail: Cases