Uploaded image for project: 'mod-users'
  1. mod-users
  2. MODUSERS-217

Improve users bulk insert/update performance

    XMLWordPrintable

    Details

    • Template:
    • Development Team:
      None

      Description

      I did a load of our 116,521 users using the mod-users CRUD web API. It took an hour and 33 minutes. The same load using SQL took a minute and 17 seconds.

      I enabled PostgreSQL SQL logging to see what it was doing and found that it is doing the following for each INSERT of a user.

      2020-09-01 17:07:47.293 CDT [564045] diku_mod_users@folio LOG:  statement: SELECT count_estimate('SELECT * FROM diku_mod_users.custom_fields WHERE true')
      2020-09-01 17:07:47.294 CDT [564045] diku_mod_users@folio LOG:  statement: SELECT * FROM diku_mod_users.custom_fields WHERE true ORDER BY custom_fields.jsonb->'order' LIMIT 2147483647 OFFSET 0
      2020-09-01 17:07:47.295 CDT [563965] diku_mod_users@folio LOG:  statement: SELECT count_estimate('SELECT jsonb FROM diku_mod_users.addresstype WHERE id = ''2abe6fe7-519f-40fd-ba24-9c2a14fbf4cb''')
      2020-09-01 17:07:47.295 CDT [563965] diku_mod_users@folio LOG:  statement: SELECT jsonb FROM diku_mod_users.addresstype WHERE id = '2abe6fe7-519f-40fd-ba24-9c2a14fbf4cb'
      2020-09-01 17:07:47.296 CDT [564045] diku_mod_users@folio LOG:  execute <unnamed>: SELECT jsonb FROM diku_mod_users.groups WHERE id= $1
      2020-09-01 17:07:47.296 CDT [564045] diku_mod_users@folio DETAIL:  parameters: $1 = '7da4ba23-cfc5-4aec-bdb2-ee6720b0b76d'
      2020-09-01 17:07:47.296 CDT [563965] diku_mod_users@folio LOG:  execute <unnamed>: INSERT INTO diku_mod_users.users (id, jsonb) VALUES ($1, $2) RETURNING jsonb
      2020-09-01 17:07:47.296 CDT [563965] diku_mod_users@folio DETAIL:  parameters: $1 = 'cb92353e-208b-45bb-81ce-1673176455de', $2 = '{"id": "cb92353e-208b-45bb-81ce-1673176455de"... }'
      2020-09-01 17:07:47.308 CDT [564032] diku_mod_permissions@folio LOG:  statement: SELECT count_estimate('SELECT jsonb FROM diku_mod_permissions.permissions_users WHERE ((jsonb->>''id'') = ''cb92353e-208b-45bb-81ce-1673176455de'' OR (jsonb->>''userId'') = ''cb92353e-208b-45bb-81ce-1673176455de'')')
      2020-09-01 17:07:47.324 CDT [564032] diku_mod_permissions@folio LOG:  statement: BEGIN
      2020-09-01 17:07:47.324 CDT [564032] diku_mod_permissions@folio LOG:  execute <unnamed>: INSERT INTO diku_mod_permissions.permissions_users (id, jsonb) VALUES ($1, $2) RETURNING id
      2020-09-01 17:07:47.324 CDT [564032] diku_mod_permissions@folio DETAIL:  parameters: $1 = 'cb92353e-208b-45bb-81ce-1673176455de', $2 = '{"id": "cb92353e-208b-45bb-81ce-1673176455de"... }'
      2020-09-01 17:07:47.325 CDT [564032] diku_mod_permissions@folio LOG:  statement: COMMIT
      

      To improve performance I would recommend making the following changes.

      1. RMB-724, RMB-718: For web API query methods, don't include a count by default. Add a parameter that optionally includes it. It may make sense to include it for a web app grid control, but, there are plenty of other times where you don't want it and it is just unnecessarily slowing things down. That would eliminate the calls to count_estimate() above.
      2. MODUSERS-3: Add a web API that accepts an array of objects like the ones that already exist for instances, holdings, and items. As you can see above an SQL COMMIT is happening for each user inserted. For decent performance, you need to have the ability to do a commit with a larger batch size.
      3. Eliminate the custom_fields query. An idea is to store the custom field info as part of the object's JSON schema. You could do something like read the JSON schema file when the module is loaded, and dynamically add the custom field info to it. Then, only when the custom fields are modified reload the schema.
      4. Eliminate the groups query. The patronGroup JSON property gets copied to the patrongroup column which has a foreign key constraint on it. Therefore, if the groups query is happening to validate the group id, it isn't needed as the foreign key constraint will take care of that automatically.
      5. MODUSERS-204: Eliminate the INSERT into permissions_users. The only time a user should have to have a permissions_user row is if the user actually has permissions. At our site, there will likely be less than 800 users that have permissions, out of the close to 120,000 total users. It's a waste of time inserting permissions_users for users that have no permissions. It is probably better that users and permissions modules not have that tight of coupling anyway. This woudl also eliminate the query to see whether the permissions_user exists.
      6. Remove the addresses array from the users table JSON schema and store it in a separate user_addresses table. If you do this, then, you can add a foreign key constraint to the addressTypeId field and can eliminate the query to the addresstypes table.

      In summary, to improve bulk insert performance of users, you need to remove the N+1 queries that are happening on each INSERT of a user. You also need to be able to control the batch size and have the ability to do something like only do a SQL COMMIT every 1,000 rows. If you do all of the above, I think you will see a dramatic improvement in performance.

      I haven't looked at UPDATE. I'm assuming it probably works similarly.

        TestRail: Results

          Attachments

            Issue Links

              Activity

                People

                Assignee:
                Unassigned Unassigned
                Reporter:
                jemiller Jon Miller
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                  Dates

                  Created:
                  Updated:

                    TestRail: Runs

                      TestRail: Cases