Uploaded image for project: 'mod-inventory'
  1. mod-inventory
  2. MODINV-444

Spike: CPU spikes related to events_cache topic



    • 0
    • Folijet Support
    • Lotus R1 2022
    • Cornell
    • Not a bug


      Data Import's performance, and that of mod-inventory, seems to be affected by CPU spikes in mod-inventory. This happened during and even well after the import completed. This resulted in a slower import. Here are the observations

      • mod-inventory's CPU spiking during and even for several hours after the import.
        A data import took place between the blue lines, between 22:00 and 00:48. Spikes after that are abnormal.
      • Most of the spikes correspond to the events_cache's rate of incoming message of 1K or more per second.
        Graph corresponds to the same import noted above. Note that in this instance there were no events_cache spikes after 00:48
      • One of the Kafka brokers' CPU also spiked at the same time.
        Kafka brokers CPU graph corresponds to the import noted above. Note that the broker's CPU spikes match up to mod-inventory's spikes
      • Prometheus's data appears to have gaps. The two graphs show gaps in data collection, and that toward the end of the gaps was when the events_cache spikes happened. Was the broker's performance so bad that data had gaps. Interestingly, in the Error graph, the spikes in errors happened at the same time. This is the metric for when the brokers "can't write" (not sure to where), according to AWS.
      • Note the lulls in all DI topics before the spikes (area between the blue drawn lines) . This is when mod-inventory's threads are being blocked.
      • Messages that were logged during the 'lull' in activities before the spikes in mod-inventory and events_cache. A full message is:
        23:39:23 [] [] [] [] WARN ? Thread Thread[vert.x-worker-thread-8,5,main] has been blocked for 823487 ms, time limit is 60000 ms
      • This graph shows the brokers' CPU at the time of the lull, notably broker 2 spiked up at that time.
      • Attached is broker 2's logs. 0629-2325-2340-broker2-lull.csv
      • Attached are a couple of exceptions being logged after the lull, during and around the peak. ThreadsBlockedExceptions.txt

      The goal of this JIRA is to understand why there are CPU spikes and how they affect DI's performance, and finally come up with a fix for it.

      Steps to Reproduce:
      This happens randomly. Sometimes during an import there are no spikes at all, sometimes there are multiple spikes. Many times there are several spikes that happened periodically after the import is completed.

      Additional Information:
      mod-inventory's logs can be provided.

      Interested parties: abreaux

      TestRail: Results


          1. 0629-2325-2340-broker2-lull.csv
            327 kB
          2. brokers-CPU.png
            32 kB
          3. image-2021-06-24-05-56-17-004.png
            114 kB
          4. image-2021-06-24-05-59-15-685.png
            66 kB
          5. image-2021-06-24-06-01-38-684.png
            119 kB
          6. image-2021-06-24-06-03-49-115.png
            50 kB
          7. image-2021-06-24-06-04-32-985.png
            37 kB
          8. LullMessages.png
            123 kB
          9. lull-period.png
            85 kB
          10. mod-inventory-kafka-wrapper.txt
            108 kB
          11. Screen Shot 2021-08-12 at 8.01.28 AM.png
            Screen Shot 2021-08-12 at 8.01.28 AM.png
            155 kB
          12. Screen Shot 2021-08-12 at 8.14.37 AM.png
            Screen Shot 2021-08-12 at 8.14.37 AM.png
            265 kB
          13. ThreadsBlockedExceptions.txt
            5 kB

          Issue Links



                Unassigned Unassigned
                mtraneis Martin Tran
                0 Vote for this issue
                5 Start watching this issue



                  TestRail: Runs

                    TestRail: Cases