Setup Datadog monitoring

Hello!

I am trying to get job-related metrics sent to Datadog.

With airbyte-metrics-reporter configured in my docker-compose.yaml, I am now able to get the following metrics in my Datadog Dashboard:

  • est_num_metrics_emitted_by_reporter
  • num_pending_jobs
  • num_running_jobs
  • num_active_conn_per_workspace
  • oldest_pending_job_age_secs
  • oldest_running_job_age_secs
  • overall_job_runtime_in_last_hour_by_terminal_state_secs

But I’m still trying to get these ones:

  • attempt_failed_by_failure_origin
  • job_cancelled_by_release_stage
  • job_failed_by_release_stage
  • job_succeeded_by_release_stage

etc, which, from what I see, are related to the airbyte-worker container. So I guess I need a specific config for in my docker-compose.yaml file for airbyte-worker to get these metrics published and sent to Datadog right? If so, what should I add? If not, what should I do to get these metrics?

Thanks in advance for your help!

Rachel

Hi @rachelr,
This is great! You are testing out a feature that is quite hidden at the moment :smiley: We plan to better document it in the future.
Did you try to add the DD_DOGSTATSD_PORT , DD_AGENT_HOST and PUBLISH_METRICS env var on the airbyte-worker service in your docker-compose.yaml file too?
Do you mind sharing the snippets from your docker-compose.yaml that is declaring the airbyte-metrics-reporter service?
I’m sure @davinchia will have more insights to share than I do on this topic.

@rachelr

Great! You are right in that if you pass the same env vars to the worker deployment, you should start getting the metrics.

Hey everyone,

Thanks a lot for being so quick to help :slight_smile:

Regarding this:

Did you try to add the DD_DOGSTATSD_PORT , DD_AGENT_HOST and PUBLISH_METRICS env var on the airbyte-worker service in your docker-compose.yaml file too?

I did try to add the env var to the airbyte-worker service, but I can’t see any metrics like I do for metrics-reporter. But now I suspect this comes from how we implemented the datadog agent (directly installed on the instance and not containerised). I’m now retrying with the agent as a container. I’ll keep you posted! (+ I’ll post the snippets it this can help anyone trying to get these metrics)

1 Like

You are right to try this Rachel as the DD_AGENT_HOST must be reachable by the worker. Have a dockerized datadog agent will make it reachable by the worker without extra network configuration (like using network_mode: "host" on the worker). Keep us posted!

Hi again!

Have a dockerized datadog agent will make it reachable by the worker without extra network configuration (like using network_mode: "host" on the worker)

You were absolutely right, I had to setup the network_mode as host, otherwise it did not work :slight_smile:

So now I’ve dockerised the agent with this configuration:

datadog:
  image: gcr.io/datadoghq/agent:7
  container_name: dd-agent
  environment:
    - DD_API_KEY=${DD_API_KEY}
    - DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - /proc/:/host/proc/:ro
    - /sys/fs/cgroup:/host/sys/fs/cgroup:ro

airbyte-metrics:
  image: airbyte/metrics-reporter:${VERSION}
  container_name: airbyte-metrics
  environment:
    - PUBLISH_METRICS=true
    - DD_AGENT_HOST=dd-agent
    - DD_DOGSTATSD_PORT=8125
    - DATABASE_USER=${DATABASE_USER}
    - DATABASE_URL=${DATABASE_URL}
    - DATABASE_PASSWORD=${DATABASE_PASSWORD}

and it works perfectly for airbyte-metrics-reporter (I’m able to see the metrics in my DD dashboards, under metrics_reporter.name_of_metric). However, adding the three env var to the worker container:

worker:
  image: airbyte/worker:${VERSION}
  logging: *default-logging
  container_name: airbyte-worker
  restart: unless-stopped
  environment:
    - AIRBYTE_VERSION=${VERSION}
    - AUTO_DISABLE_FAILING_CONNECTIONS=${AUTO_DISABLE_FAILING_CONNECTIONS}
    - etc.
    - PUBLISH_METRICS=true
    - DD_AGENT_HOST=dd-agent
    - DD_DOGSTATSD_PORT=8125

Does not seem to trigger any monitoring for worker on my side… Am I missing something?

I’ve also tried to add the following to the datadog container, with no results either:

  datadog:
    image: gcr.io/datadoghq/agent:7
    container_name: dd-agent
    environment:
      - DD_API_KEY=${DD_API_KEY}
      - DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true
      - DD_LOGS_ENABLED=true
      - DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
      - DD_SITE=datadoghq.com
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup:/host/sys/fs/cgroup:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro

Note: I’m on v0.38.4-alpha version

Thanks again for your help! :pray:

Thank you for sharing your setup!
@davinchia do you think additional setup is required to collect metric from the worker? @rachelr are your running jobs? I think metrics get collected by the worker on job run.

@alafanechere yes, since it comes from airbyte-worker I assumed it needed a job to send metrics :slight_smile: I tried with two connections:

  • E2E testing (as source & destination)
  • File downloaded via HTTPS as source + Local file as destination

Then each time I waited a bit to see if a worker-related metric popped in available metrics (airbyte-metrics-reporter did detect the running jobs).

I also tried putting specific datadog labels/tags to retrieve the metric but it did not change anything for worker :thinking:

Update: I’ve upgraded version to v0.39-17alpha but still not able to get the metrics.

If someone has an idea that would be awesome :pray:

Your settings look correct to me.

Do you see logs like these coming up in the worker container?

airbyte-worker-54db8c4c76-wlspw worker 2022-06-23 22:04:35 INFO i.a.m.l.DogStatsDMetricClient(count):76 - publishing count, name: ATTEMPT_CREATED_BY_RELEASE_STAGE, value: 1, tags: [release_stage:alpha]
airbyte-worker-54db8c4c76-wlspw worker 2022-06-23 22:04:35 INFO i.a.m.l.DogStatsDMetricClient(count):76 - publishing count, name: ATTEMPT_CREATED_BY_RELEASE_STAGE, value: 1, tags: [release_stage:alpha]
airbyte-worker-54db8c4c76-wlspw worker 2022-06-23 22:05:32 INFO i.a.w.t.TemporalAttemptExecution(get):110 - Cloud storage job log path: /workspace/260540/0/logs.log
airbyte-worker-54db8c4c76-z9xml worker 2022-06-24 00:12:29 INFO i.a.m.l.DogStatsDMetricClient(count):76 - publishing count, name: JOB_CREATED_BY_RELEASE_STAGE, value: 1, tags: [release_stage:alpha]
airbyte-worker-54db8c4c76-wlspw worker Using cache monitor: TimePeriodBasedBufferMonitor(periodInSeconds: 60)
airbyte-worker-54db8c4c76-z9xml worker 2022-06-24 00:12:29 INFO i.a.m.l.DogStatsDMetricClient(count):76 - publishing count, name: JOB_CREATED_BY_RELEASE_STAGE, value: 1, tags: [release_stage:alpha]

Can you try upgrading to the latest? We’ve been making improvements in our metrics implementation. Since 0.39.19-alpha, in additional to PUBLISH_METRICS set to true, we also need to set the METRIC_CLIENT variable to datadog.

If both are set correctly, we should see the above log lines come up in the applications emitting metrics.

Thanks a lot @davinchia, upgrading + adding METRIC_CLIENT did the trick!