[Bug 1946793] Re: aodh uses deprecated gnocchi api to aggregate metrics and doesn't work properly

Thu Feb 2 14:30:09 UTC 2023

Hi @seyeongkim

These are some notes I have from after applying the patch and trying to
do auto-scaling VMs with heat. Template ->
https://paste.ubuntu.com/p/zQdtWRPMYd/

"""
Things to consider when Autoscaling:
----- 1. Granularity comes tied to the metric. Check ceilometer + gnocchi configs for the metric.
----- 2. cpu_util deprecated. Using cpu metric instead considering the vcpu count of the flavor.

# The difference between successive measures
%CPU * vCPUs * Granularity * 10.000.000 =  Δ cpu metric
1%   * 1     * 300 s       * 10.000.000 =  3.000.000.000  [Eg. m1.nano]
1%   * 2     * 300 s       * 10.000.000 =  6.000.000.000  [Eg. m1.small]

----- 3. Granularity (openstack configuration) < Cooldown (heat template)  
Test A. granularity=300  < cooldown=600 -> Ok https://paste.ubuntu.com/p/mHvmBNq7KF/
Test B. granularity=300  > cooldown=300 -> Not ok https://paste.ubuntu.com/p/dKDtGYcdG8/
Test C. granularity=300 == cooldown=300 -> Not ok https://paste.ubuntu.com/p/pdc7rGFtMY/

What to do if the desire is to have a smaller granularity
1. Change ceilometer pulling interval for that metric (cpu in this case)
2. Change metric - archive-policy in gnocchi. Which seems that cannot be updated once it is initially set on the metric. Haven't found how to make it have effect if updated.

---------- Other notes on telemetry
-- Ceilometer
https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html
# Enable more metrics
juju config ceilometer enable-all-pollsters=true
juju config ceilometer-agent enable-all-pollsters=true

# Metrics granularity vs Ceilometer polling freq. 
juju config ceilometer polling-interval=300
juju config ceilometer-agent polling-interval=300

# Check configs
juju ssh ceilometer/0 'sudo cat /etc/ceilometer/pipeline.yaml'
juju ssh ceilometer-agent/0 'sudo cat /etc/ceilometer/polling.yaml'

-- Gnocchi
https://gnocchi.osci.io/operating.html
# Resource / Metric 
$ openstack metric resource show -c metrics --type instance $VM_UUID

# Metric / Measures 
$ openstack metric measures show -r $VM_UUID cpu
$ openstack metric measures show $METRIC_UUID

# Archive policies
$ openstack metric archive-policy list

# Mapping Metric <-> Archive policy 
# NOTE: Archive policy of a metric cannot be changed
$ openstack metric archive-policy-rule create <rule-name> --archive-policy-name  <archive-policy-name>

-- Aodh
openstack alarm create \
  --name cpu_70_percent_1vcpu \
  --type gnocchi_resources_threshold \
  --description 'Instance CPU High' \
  --metric cpu \
  --threshold 210000000000 \
  --comparison-operator gt \
  --aggregation-method mean \
  --granularity 300 \
  --evaluation-periods 1 \
  --alarm-action 'log://' \
  --resource-type instance \
  --resource-id $INSTANCE_ID

openstack alarm-history show $ALARM_UUID
"""

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1946793

Title:
  aodh uses deprecated gnocchi api to aggregate metrics and doesn't work
  properly

Status in Aodh:
  Fix Released
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Committed
Status in Ubuntu Cloud Archive xena series:
  Fix Committed
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in aodh package in Ubuntu:
  Fix Released

Bug description:
  [Impact]
  aodh uses older gnocchi api. this causes issue when we are using metric command

  openstack metric measures aggregation
  openstack metric aggregates

  [Test Case]
  1. deploy openstack env with telemetry and heat ( heat template could be from the comment )
  2. heat template should be adjusted for #1's env.
  - any variables for openstack
  - desired number 2
  3. openstack stack create test -t ./heat
  4. assume stack id = 136bf93d-9dc9-4b3f-862d-6fdec1b6abf7
  5. access to instance, and dd command to give cpu load
  6. test openstack metric command

  openstack metric measures aggregation --query
  'server_group=136bf93d-9dc9-4b3f-862d-6fdec1b6abf7' --aggregation
  rate:mean --metric cpu --resource-type instance --fill null

  openstack metric aggregates '(aggregate rate:mean (metric cpu mean))'
  'server_group=136bf93d-9dc9-4b3f-862d-6fdec1b6abf7' --resource-type
  instance --granularity 300 --fill null

  7. then, check gnocchi log(apache log) if it calls v1/aggregation or
  v1/aggregates

  [Where problems could occur]
  shortage while upgrading.
  getting metrics could have issue.

  [Others]

  Original Description below

  On gnocchi API docs, there are 2 API methods to aggregate metrics

  1. /v1/aggregation/metric?

  See: https://gnocchi.osci.io/rest.html#aggregation-across-metrics-
  deprecated

  This one is deprecated

  2. /v1/aggregates?

  See: https://gnocchi.osci.io/rest.html#dynamic-aggregates

  aodh uses the 1st one to aggregate metrics, for example:

  ```
          if isinstance(start, datetime.datetime):
              start = start.isoformat()
          if isinstance(stop, datetime.datetime):
              stop = stop.isoformat()

          params = dict(start=start, stop=stop, aggregation=aggregation,
                        reaggregation=reaggregation, granularity=granularity,
                        needed_overlap=needed_overlap, groupby=groupby,
                        refresh=refresh, resample=resample, fill=fill)
          if query is None:
              for metric in metrics:
                  self._ensure_metric_is_uuid(metric)
              params['metric'] = metrics
              measures = self._get("v1/aggregation/metric",
                                   params=params).json()
  ```

  aodh doesn't work properly in our production environment after
  upgraded to Ussuri.

  When there is only 1 instance, aodh works properly and alarms can be
  triggered when the load on the instance is higher than the threshold.

  However, after the stack is scaled up, and the second instance is
  created. The average cpu usage got from gnocchi by aodh evaluator is
  not correct. The metric measures are negative sometimes.

  I manually pulled metrics with gnocchi command

  The aggregation of metrics is correct with command

  ```
  openstack metric aggregates
  ```

  It uses new API in the backend

  The aggregation of metrics is not correct with command

  ```
  openstack metric measures aggregation
  ```

  It uses the deprecated API which aodh is using.

To manage notifications about this bug go to:
https://bugs.launchpad.net/aodh/+bug/1946793/+subscriptions