Details
-
Bug
-
Status: Closed (View Workflow)
-
P2
-
Resolution: Done
-
None
-
None
-
-
CP: sprint 110
-
3
-
Core: Platform
-
R1 2021 Bug Fix
Description
Searching instances by the contributor field in a 10 million records table is 3x slower than in a 7.5 million records table.
How to reproduce:
In Inventory search select "Contributors" search and enter
Bach, Johann Sebastian
CQL generated by the front-end:
(contributors =/@name "Bach, Johann Sebastian") sortby title
CQL query request sent by the front-end:
GET /inventory/instances query=%28contributors+%3D%2F%40name+%22Bach%2C+Johann+Sebastian%22%29+sortby+title&limit=100&offset=0
SQL query:
SELECT jsonb, (lower(f_unaccent(jsonb->>'title')) ) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE (get_tsvector(f_unaccent(instance.jsonb->>'contributors')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->'contributors') as c ) as t where (get_tsvector(f_unaccent(t.c->>'name')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))) AND left(lower(f_unaccent(jsonb->>'title')) ,600) < ( SELECT left(lower(f_unaccent(jsonb->>'title')) ,600) FROM fs09000000_mod_inventory_storage.instance ORDER BY left(lower(f_unaccent(jsonb->>'title')) ,600) OFFSET 10000 LIMIT 1 ) ORDER BY left(lower(f_unaccent(jsonb->>'title')) ,600) LIMIT 100 OFFSET 0 ), allrecords AS ( SELECT jsonb, lower(f_unaccent(jsonb->>'title')) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE (get_tsvector(f_unaccent(instance.jsonb->>'contributors')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->'contributors') as c ) as t where (get_tsvector(f_unaccent(t.c->>'name')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))) AND (SELECT COUNT(*) FROM headrecords) < 100 ), totalCount AS (SELECT fs09000000_mod_inventory_storage.count_estimate(' SELECT lower(f_unaccent(jsonb->>''title'')) AS data_column FROM fs09000000_mod_inventory_storage.instance WHERE get_tsvector(f_unaccent(instance.jsonb->>''contributors'')) @@ tsquery_phrase(f_unaccent(''Bach, Johann Sebastian'')) AND id in (select t.id from (select id as id, jsonb_array_elements(instance.jsonb->''contributors'') as c ) as t where (get_tsvector(f_unaccent(t.c->>''name'')) @@ tsquery_phrase(f_unaccent(''Bach, Johann Sebastian''))))') AS count) SELECT jsonb, data_column, (SELECT count FROM totalCount) FROM headrecords WHERE (SELECT COUNT(*) FROM headrecords) >= 100 UNION (SELECT jsonb, data_column, (SELECT count FROM totalCount) FROM allrecords ORDER BY data_column LIMIT 100 OFFSET 0 ) ORDER BY data_column
This API call took about 30s to run end-to-end, from the UI in the 10M instances table. The same call took about 10s to run end to end, from the UI, in the 7.5M instances table.
Can this query be optimized so that it doesn't suffer when the dataset's size is 10M+?
Attached are explain-analyzed logs for the query above, one from bugfest with 7.5M instances, the other from an environment with 10M instances. Comparing the two logs, does line 43 suggest that the index only being used to produce the CTEs but that when recursing through them, the index isn't used because it's working off of the CTEs, hence there is a exception for each of the inner queries? In which case the query execution time would become roughly linear with the number or rows. Looks like the second CTE being the killer.
10M records database, line 43: -> Subquery Scan on t_1 (cost=0.00..52.27 rows=1 width=16) (actual time=0.058..0.058 rows=1 loops=16412)
7.5M records database, line 43: -> Subquery Scan on t_1 (cost=0.00..52.27 rows=1 width=16) (actual time=0.031..0.031 rows=1 loops=6580)
Also, could (SELECT COUNT FROM headrecords) < 100 results in a lot of loops in the case of the second CTE (allrecords)?
Query Analysis
The contributors array of JSON objects looks like this:
"contributors" : [ { "contributorNameTypeId" : "35b01201-597f-4938-a0d1-110245ba9d67", "name" : "Bach, Johann Sebastian, 1685-1750", "contributorTypeId" : null, "contributorTypeText" : null, "primary" : true }, { "contributorNameTypeId" : "35b01201-597f-4938-a0d1-110245ba9d67", "name" : "Palmer, Willard A.", "contributorTypeId" : null, "contributorTypeText" : null, "primary" : false }, {
Using "sortby title" results in a long optimized SQL query, for details see https://github.com/folio-org/raml-module-builder/blob/v32.2.0/domain-models-runtime/src/main/java/org/folio/rest/persist/PgUtil.java#L1587
CQL query without sorting:
contributors =/@name "Bach, Johann Sebastian"
Generated SQL:
WHERE get_tsvector(f_unaccent(instance.jsonb->>'contributors')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian')) AND id in (SELECT t.id from (SELECT id as id, jsonb_array_elements(instance.jsonb->'contributors') as c ) as t WHERE (get_tsvector(f_unaccent(t.c->>'name')) @@ tsquery_phrase(f_unaccent('Bach, Johann Sebastian'))))
This is a two-step search:
First step is in the first two lines: Use the contributors full text index and find the phrase.
This matches in all contributors fields, not only in the name field, for example in contributorTypeText.
The remaining SQL code extracts the name field and matches the name field text. This requires to fetch the complete records of the first step because we don't have an index for the contributors name field.
TestRail: Results
Attachments
Issue Links
- is blocked by
-
RMB-644 implement generic sort optimization
-
- Blocked
-
- relates to
-
MODINVSTOR-707 Relese version v20.1.0
-
- Closed
-
-
UIIN-1451 use contributorsNames index for efficiency
-
- Closed
-