Uploaded image for project: 'mod-inventory'
  1. mod-inventory
  2. MODINV-444

Spike: CPU spikes related to events_cache topic

    XMLWordPrintable

Details

    • 0
    • Folijet Support
    • Lotus R1 2022
    • Cornell
    • Not a bug

    Description

      Overview:
      Data Import's performance, and that of mod-inventory, seems to be affected by CPU spikes in mod-inventory. This happened during and even well after the import completed. This resulted in a slower import. Here are the observations

      • mod-inventory's CPU spiking during and even for several hours after the import.
        A data import took place between the blue lines, between 22:00 and 00:48. Spikes after that are abnormal.
      • Most of the spikes correspond to the events_cache's rate of incoming message of 1K or more per second.
        Graph corresponds to the same import noted above. Note that in this instance there were no events_cache spikes after 00:48
      • One of the Kafka brokers' CPU also spiked at the same time.
        Kafka brokers CPU graph corresponds to the import noted above. Note that the broker's CPU spikes match up to mod-inventory's spikes
      • Prometheus's data appears to have gaps. The two graphs show gaps in data collection, and that toward the end of the gaps was when the events_cache spikes happened. Was the broker's performance so bad that data had gaps. Interestingly, in the Error graph, the spikes in errors happened at the same time. This is the metric for when the brokers "can't write" (not sure to where), according to AWS.
      • Note the lulls in all DI topics before the spikes (area between the blue drawn lines) . This is when mod-inventory's threads are being blocked.
      • Messages that were logged during the 'lull' in activities before the spikes in mod-inventory and events_cache. A full message is:
        23:39:23 [] [] [] [] WARN ? Thread Thread[vert.x-worker-thread-8,5,main] has been blocked for 823487 ms, time limit is 60000 ms
      • This graph shows the brokers' CPU at the time of the lull, notably broker 2 spiked up at that time.
      • Attached is broker 2's logs. 0629-2325-2340-broker2-lull.csv
      • Attached are a couple of exceptions being logged after the lull, during and around the peak. ThreadsBlockedExceptions.txt

      The goal of this JIRA is to understand why there are CPU spikes and how they affect DI's performance, and finally come up with a fix for it.

      Steps to Reproduce:
      This happens randomly. Sometimes during an import there are no spikes at all, sometimes there are multiple spikes. Many times there are several spikes that happened periodically after the import is completed.

      Additional Information:
      mod-inventory's logs can be provided.

      Interested parties: abreaux

      TestRail: Results

        Attachments

          1. 0629-2325-2340-broker2-lull.csv
            327 kB
          2. brokers-CPU.png
            brokers-CPU.png
            32 kB
          3. image-2021-06-24-05-56-17-004.png
            image-2021-06-24-05-56-17-004.png
            114 kB
          4. image-2021-06-24-05-59-15-685.png
            image-2021-06-24-05-59-15-685.png
            66 kB
          5. image-2021-06-24-06-01-38-684.png
            image-2021-06-24-06-01-38-684.png
            119 kB
          6. image-2021-06-24-06-03-49-115.png
            image-2021-06-24-06-03-49-115.png
            50 kB
          7. image-2021-06-24-06-04-32-985.png
            image-2021-06-24-06-04-32-985.png
            37 kB
          8. LullMessages.png
            LullMessages.png
            123 kB
          9. lull-period.png
            lull-period.png
            85 kB
          10. mod-inventory-kafka-wrapper.txt
            108 kB
          11. Screen Shot 2021-08-12 at 8.01.28 AM.png
            Screen Shot 2021-08-12 at 8.01.28 AM.png
            155 kB
          12. Screen Shot 2021-08-12 at 8.14.37 AM.png
            Screen Shot 2021-08-12 at 8.14.37 AM.png
            265 kB
          13. ThreadsBlockedExceptions.txt
            5 kB

          Issue Links

            Activity

              People

                Unassigned Unassigned
                mtraneis Martin Tran
                Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved:

                  TestRail: Runs

                    TestRail: Cases