Skip to content

태그: Monitoring

SLA: Service Level Agreements SLA는 고객이 서비스를 사용할 때 기대하는 서비스 레벨이다. SLA를 정의할 떄는 단순한 지표를 사용하는 것이 좋다. SLO: Service Level Objectives SLO란 시스템에서 기대하는 가용성을 설정한 목표이다. SLA가 사용자가 기대하는 수치라면, SLO는 실제로 팀에서 지키기 위해 노력할 달성 목표이다. SLO는 발생할 수 있는 변수를 감안하여 SLA보다 더 높은 값으로 설정하는 경우가 많다. 목표에 집중하기 위해선 SLO는 최소 갯수만 정의하는 것이 좋다. SLI: Service Level Indicator SLI란 사용자가 시스템의 가용성을 경험하는 방식을 정량적으로 측정한 것이다. 즉, 목표에 대비한 실제 지표이다. SL

Pyroscope Distributor and Ingester

Distributor Distributor는 Agent로부터 프로파일링 데이터를 받아 처리하는 Stateless 컴포넌트이다. Distributor는 데이터를 일괄 처리하여 여러 Ingesters에 병렬로 보내고, 시리즈를 Ingesters 사이에 나누며, 각 시리즈를 구성된 복제 요소에 따라 복제한다. 기본적으로 구성된 복제 요소는 세 개이다. 유효성 검사 Distributor는 데이터를 Ingester에 전달하기 전에 유효성을 검사, 변환 절차를 거친다. 데이터 중 일부 샘플만 유효하다면 유효한 데이터만 Ingester에 전달하고, 유효하지 않은 데이터는 Ingesters에 보내지지 않는다. 요청에 유효하지 않은 데이터가 포함되어 있으면 Distributor는 Bad Request 코

Grafana Agent is an OpenTelemetry Collector distribution with configuration. It is designed to be flexible, performant, and compatible with multiple ecosystems such as Prometheus and OpenTelemetry. Grafana Agent is based around components. Components are wired together to form programmable observability pipelines for telemetry collection, processing, and delivery. Grafana Agent is available in thr

What is DeepFlow DeepFlow is an observability product, designed to provide in-depth observability for complex cloud infrastructure and cloud-native application. Based on eBPF, DeepFlow implements application performance metrics, distributed tracing, continuous profiling, and other observation signals with zero-disturbance (Zero Code) collection, integrating intelligent tags (SmartEncoding) technol

DeepFlow 논문 요약

DeepFlow is an observability product, designed to provide in-depth observability for complex cloud infrastructure and cloud-native application. Network-Centric Tracing Plane A narrow-waist instrumentation model with two sets of functions: ingress-egress and enter-exit DeepFlow instruments ten system call ABIs and cl

Grok is a tool to parce crappy unstructured log data into something structured and queryable. Grok is heavily used in Logstash to provide log data as input for ElesticSearch. Grok ships with about 120 predefined patterns for syslog logs, apache and other webserver logs, mysql logs, etc. It is easy to extend Grok with custom patterns. The grok_exporter aims at porting Grok from the ELK stack to Pro

💡 APM, log, Infrastructure를 통합적으로 모니터링·관리하는 클라우드 모니터링 솔루션 여러 클라우드 환경에 나뉘어있는 리소스들을 통합적으로 모니터링 가능하다. 클라우드의 상태를 지속적으로 감시하여 예기치 못한 상황과 오류를 대비, 대응할 수 있다. 장점 에러를 빠르게 확인하여 신속한 대응 가능 애플리케이션 정보(log, query 등) 축적하여 데이터 기반 개선 개발자, 운영팀, 비즈니스 유저간 긴밀한 협업 다양한 언어과 환경을 지원하기 때문에, 원하는 애플리케이션에 확장 가능 커스텀 대시보드 생성 가능 공식 문서가 친절함 단점 비용이 많이 든다. 기능이 많아서 실무에 도입하기 위해 사전 지식이 필요함. Datadog의 주요기능 Integrations 여러가지 서비

datadog APM 기능 사용하기

서버에 datadog agent를 설치하면 CPU 점유율, Memory, Disk사용량 등의 중요한 성능 정보를 모니터링할 수 있다. 하지만 애플리케이션의 전반적인 LifeCycle에 대한 리포트 (ex: GC, JVM, I/O 등)를 바탕으로 에러나 병목현상에 더 빠르게 대응할 수 있도록 하고싶다면 Datadog APM을 연결해야한다. APM 이란? Application Performance Monitoring 의 약자로 구동 중인 애플리케이션의 대한 성능측정과 에러탐지 등, 전반적인 애플리케이션 라이프사이클의 정보를 수집해 모니터링할 수 있게 해준다. 보다 편리성을 위해서 다양하게 시각화한 Metrics, 그리고 API 테스트도 지원한다. 여러 대의 애플리케이션에 설치가 가능하며 이를 한꺼번에 같은 UI

datadog 아키텍처

1. Datedog Agent가 하는 일 (Application에서 서버로) Datedog 사용은 아래와 같은 흐름으로 진행된다. ☝🏻 Datadog 사용 흐름 3단계1. 서버에 Datadog agent를 설치한다. (api키 입력)2. agent가 서버나 애플리케이션의 정보를 수집하여 Datedog 서버로 보낸다.3. 유저가 웹에서 대시보드를 확인한다. Datadog Agent가 어떤 일을 하는지, Agent는 어떤 구조로 구성되어있는지 알아보자. Datadog agent 서버에 설치된 agent는 해당 서버의 시스템 정보를 수집하여 Datadog 서버로 전송한다. 추가적인 설정을 통해 DB, 메모리 스토어 등에서 추가적인 메트릭을 수집할 수 있다. (APM) SNMP SNMP(Simple

helmChart로 Agent 설치

1. helm을 설치한다. 맥에서는 brew install helm을 통해 설치할 수 있고, 윈도우에서는 Chocolatey, 리눅스에서는 Snap에서 패키지를 다운받으면 된다. 또는 바이너리 릴리즈를 다운받아서 직접 설치하는 방법도 있다. 자세한 것은 공식문서에서 확인해보자. 2. Datadog Operator Datadog Operator를 Helm을 통해 설치하는 명령어는 다음과 같다. Terminal window$ helm repo add datadog helm install -n datadog --create-namespace --set fullnameOverride="dd-op"

ELK는 Elasticsearch, Logstash 및 Kibana, 이 오픈 소스 프로젝트 세 개를 뜻하는 약자이다. Elasticsearch : 검색 및 분석 엔진 Logstash : 여러 소스에 동시에 데이터를 수집하여 변환한 후 Elasticsearch 같은 “stash”로 전송하는 서버 사이드 데이터 처리 파이프라인 Kibana : 사용자가 Elasticsearch에서 차트와 그래프를 이용해 데이터를 시각화 여기에 데이터 수집기인 Beats를 추가한 것을 ELK Stack이라고 한다. Beats를 추가하면 다른 서버에서 데이터를 가져오는 것도 가능해진다. ubuntu 기준으로 elk를 구축해보겠다. Elasticsearch 설치 Terminal window 설치wget

ElasticSearch 검색 명령어

Elasicsearch 검색 명령어 클러스터 상태 (Health) 클러스터가 어떻게 진행되고 있는지 기본적인 확인 클러스터 상태를 확인하기 위해 _cat API를 사용 curl를 사용하여 수행 가능 -노드 정보: GET /_cat/nodes?v 상태 정보 : GET /_cat/health?v Elasticsearch에서 _언더바가 붙은 것들이 API v는 상세하게 보여달라는 의미 녹색 : 모든 것이 정상 동작 노란색 : 모든 데이터를 사용 가능하지만 일부 복제본은 아직 할당되지 않음(클러스터는 완전히 동작) 빨간색 : 어떤 이유로든 일부 데이터를 사용 불가능(클러스터가 부분적으로만 동작) 데이터베이스(index)가 가진 데이터 확인하기 index는 일반 RDB에서의 DB 역할 모든 인덱스 항목을

확장성이 뛰어난 오픈 소스 전체 텍스트 검색 및 분석 엔진 대량의 데이터를 신속하고 거의 실시간으로 저장, 검색 및 분석 일반적으로 복잡한 검색 기능과 요구 사항이 있는 응용 프로그램을 구동하는 기본 엔진 / 기술 핵심 개념 Near Realtime (NRT) Elastic Search는 거의 실시간 검색 플랫폼 문서를 색인할 때부터 검색 기능할 때까지 약간의 대기시간(일반적으로 1초)이 매우 짧음 클러스터(Cluster) 전체 데이터를 함께 보유하고 모든 노드에서 연합 인덱싱 및 검색 기능을 제공하는 하나 이상의 노드(서버) 모음 -노드의 그룹이라고 생각 클러스터는 기본적으로 elasticsearch 라는 고유한 이름으로 식별 이 이름은 노드가 이름으로 클러스터에 참여하도록 설정된 경우 노드가

Logstash는 실시간 파이프라인 기능을 가진 데이터 수집 엔진 오픈소스이다. Logstash는 서로 다른 소스의 데이터를 동적으로 통합하고 원하는 대상으로 데이터를 정규화 할 수 있는 능력을 가진다. 다양한 입력과 필터 및 출력 플러그인을 통해, 모든 유형의 이벤트를 보강하고 변환할 수 있으며, 많은 기본 코텍이 처리 과정을 단순화한다. 따라서 Logstash는 더 많은 양과 다양한 데이터를 활용하여 통찰력 있게 볼 수 있게 해 준다. Logtash 파이프라인 Logstash의 전체적인 파이프라인에는 INPUTS과 FILTERS, 그리고 OUTPUT이 있다. 이 중에서 2가지의 필수적인 요소는 INPUTS과 OUTPUTS이고, 파싱 여부에 따라 필터는 선택적으로 사용이 가능하다. Logstash.ym

Loki Canary is a standalone app that audits the log-capturing performance of a Grafana Loki cluster. Loki Canary generates artificial log lines. These log lines are sent to the Loki cluster. Loki Canary communicates with the Loki cluster to capture metrics about the artificial log lines, such that Loki Canary forms inforation about the performance of the Loki cluster. The information is avai

kiali with prometheus

Kiali requires Prometheus to generate the topology graph, show metrics, calculate health and for several other features. If Prometheus is missing or Kiali can’t reach it, Kiali won’t work properly. By default, Kiali assumes that Prometheus is available at the URL of the form http://prometheus.&x3C;istio_namespace_name>:9090, which is the usual case if you are using the Prometheus Istio add-on. If

Prometheus is an open-source systems monitoring and alerting toolkit. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Prometheus work well for recording any purely numeric time series. It fits both machine-centric monitoring of highly dynamic service-ori

prometheus agent mode

The core design of Prometheus is inpired by Google’s Borgmon monitoring system, you can deploy a Prometheus server alongside the applications you want to monitor, tell Prometheus how to reach them, and allow to scrape the current values of their metrics at regular intervals. Such a collection method, which is often referred to as the “pull model”, is the core principle that allow Prometheus to

prometheus glossary

Core Prometheus Prometheus usually refers to the core binary of the Prometheus system. It may also refer to the Prometheus monitoring system as a whole. Target A target is the definition of an object to scrape. For example, what labels to apply, any authentication required to connect, or other information that defines how the scrape will occur. Endpoint A source of metrics that can be

prometheus storage

Prometheus includes a local on-disk time series database, but also optionally integrates with remote storage systems. Local storage Prometheus’s local time series database stores data in a custom, highly efficient format on local storage. On-disk layout Ingested samples are grouped into blocks of two hours. Each two-hour block consists of a directory containing a chunks subdirectory containing all

Telemetry automatically collects, transmits and measures data from remote sources, using sensors and other devices to collect data. It uses communication systems to transmit the data back to a central location. Subsequently, the data is analyzed to monitor and control the remote system. Collecting telemetry data is essential for administering and managing various IT infrastructres. This data is us

NOTE: It is recommended to keep deploying rules inside the relevant Prometheus servers locally. Use ruler only on specific cases. Read details below why. The rule component should in particular not be used to circumvent solving rule deployment properly at the configuration management level. The thanos rule command evaluates Prometheus recording and alerting rules against chosen query API via rep

Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity, which can be added seamlessly on top of existing Prometheus deployments that included in CNCF Incubating project. Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies. Addi