ELK集中化日志解决方案

博主： xinyu_he
发布时间：2026 年 05 月 13 日
2 次浏览
暂无评论
15635字数
分类： elk

一、前言：为什么需要集中式日志？

在软件团队里，有两个永恒经典的问题：

“只涉及一行代码的变更需要多久才能上线？” —— 关注“交付”效率
“定位一个线上问题需要多久？流程是什么？” —— 关注“保障”能力

很多早期团队定位生产问题的方式是：挨个登录服务器，用cd / tail / grep / sed / awk等命令手动查找日志。这种方式在单机或少量服务器时尚可应付，但在大规模分布式场景下会面临：

日志量太大，归档困难
文本搜索太慢
无法多维度、跨服务器关联查询

因此，我们需要集中式日志管理系统：将所有节点上的日志统一收集、管理、检索。

ELK（Elasticsearch + Logstash + Kibana）正是目前最主流的开源解决方案。

二、ELK 三大组件简介

组件	作用	特点
Logstash	日志搜集、分析、过滤	支持多种数据源，C/S架构，功能强大但较重
Elasticsearch	存储日志、建立索引、提供搜索	开源分布式搜索引擎，RESTful接口，自动分片与副本
Kibana	日志可视化展示	将日志转化为图表，提供强大的数据可视化

三、ELK 架构的四个演进阶段

1. 入门级：直接连接

App → Log → Logstash → Elasticsearch → Kibana

优点：简单缺点：

大并发下日志峰值高，Elasticsearch 的 HTTP API 处理能力有限，可能超时、丢失
Logstash 运行在应用服务器上，消耗系统资源，影响业务

2. 升级版：引入消息队列 + 拆分 Logstash

App → Log → Logstash Shipper → 消息队列 → Logstash Indexer → Elasticsearch → Kibana

改进点：

加入缓冲中间件（Redis / Kafka），应对流量峰值
将 Logstash 拆分为 Shipper（收集）和 Indexer（处理）

关于消息队列的选择：

特性	Redis	Kafka
消息可靠性	❌ 可能丢失	✅ 可靠
吞吐量	一般	✅ 高
数据持久化	内存（受限于内存）	✅ 硬盘
集群模式	主从	✅ 分布式
适用场景	小规模、允许丢失	生产环境、要求可靠性

结论：生产环境推荐使用 Kafka。

3. 大师版：引入 Filebeat + 业务隔离

App → Log → Filebeat → Kafka → Logstash → Elasticsearch（按业务分集群） → Kibana

核心变化：

使用 Filebeat 替代 Logstash Shipper —— 更轻量级，资源占用少（约20-30MB），官方推荐
Elasticsearch 按业务拆分为多个独立集群，避免相互影响

4. 专家版：冷热数据分离

核心思想：以时间（如30天）为界限区分冷热数据

热集群：存放近期数据（如3-7天），使用 SSD 存储，查询速度快
冷集群：存放历史数据（7-90天），使用 SATA 存储，成本低，查询较慢但可接受

四、核心组件工作原理

1. Filebeat 工作原理

由两个核心组件协同工作：

组件	职责
Harvester（收割机）	负责读取单个文件内容，逐行发送到输出
Prospector（勘测者）	管理 Harvester，发现并跟踪所有读取源

状态记录：

文件状态记录在 /var/lib/filebeat/registry，包含读取偏移量
重启时利用 registry 恢复状态，确保不丢数据

关键配置参数：

filebeat.yml:
  close_inactive: 5m        # 文件5分钟无变化则关闭句柄
  scan_frequency: 10s       # 每10秒扫描新文件
  shutdown_timeout: 30s     # 关闭前等待确认的最大时间

2. Logstash 工作原理

三个阶段：Inputs → Filters → Outputs

阶段	常用插件	作用
Inputs	file, syslog, redis, beats	接收数据
Filters	grok, mutate, drop, geoip	解析、转换、过滤
Outputs	elasticsearch, file, graphite	输出处理结果

Grok 正则示例：

# Nginx 日志解析
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  mutate {
    convert => { "response" => "integer" }
    remove_field => ["message"]
  }
}

3. Elasticsearch 核心原理：倒排索引

场景举例：检索包含“望”字的诗词

传统 SQL：

SELECT * FROM poems WHERE content LIKE '%望%'  -- 全表扫描，极慢

倒排索引方案：

提取每篇文档中的所有关键词
建立“关键词 → 文档列表”的映射关系
对关键词本身建立索引

数据结构：

Term（词项） + PostingList（文档ID列表）
额外记录：词项出现过的文档总数（doc_freq）、在单个文档中的出现次数（term_freq）、位置信息（positions）等

五、性能优化专题

Q1：日志查询慢，怎么优化？

原因分析：

可能原因	排查方法
索引太大（日增量>100GB）	`GET _cat/indices?v` 查看索引大小
查询未命中索引	检查 Kibana 查询语句是否带`_exists_` 等可用索引的字段
分片数量不合理	查看`/_cat/shards` 分片分布
硬件不足（IO/内存/CPU）	监控系统资源
冷热数据未分离	热数据查询快，冷数据慢是正常现象

优化方案：

① 优化查询语句

// 不好的查询（全字段扫描）
{
  "query": {
    "query_string": { "query": "error" }
  }
}

// 好的查询（指定字段）
{
  "query": {
    "match": { "message": "error" }
  }
}

② 使用索引生命周期管理（ILM）

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": { "rollover": { "max_size": "50GB", "max_age": "1d" } }
      },
      "warm": {
        "min_age": "3d",
        "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } }
      },
      "cold": {
        "min_age": "7d",
        "actions": { "freeze": {} }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

③ 调整查询并发与超时

// 设置查询超时
{
  "timeout": "10s",
  "query": { "match_all": {} }
}

// 控制查询并发
GET _cluster/settings
{
  "transient": {
    "search.max_buckets": 20000,
    "search.max_open_context": 5000
  }
}

Q2：Elasticsearch 怎么优化？

2.1 操作系统层面

# 1. 关闭交换分区
swapoff -a
# 永久关闭：/etc/fstab 注释 swap 行

# 2. 调整内存锁定
# /etc/security/limits.conf
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

# 3. 增大文件描述符
echo "elasticsearch soft nofile 65535" >> /etc/security/limits.conf
echo "elasticsearch hard nofile 65535" >> /etc/security/limits.conf

# 4. 调整 vm.max_map_count
sysctl -w vm.max_map_count=262144
echo "vm.max_map_count=262144" >> /etc/sysctl.conf

2.2 JVM 层面

# jvm.options
-Xms16g    # 设置堆内存为物理内存的50%，但不超过32GB
-Xmx16g    # 因为超过32GB会压缩指针，降低性能

# 垃圾回收器配置（ES7+推荐G1）
-XX:+UseG1GC
-XX:G1ReservePercent=25

堆内存配置黄金法则：

最小和最大设置相同，避免动态调整
不超过物理内存的50%，留一半给操作系统缓存
不要超过32GB（若超过，指针压缩失败，性能下降）

2.3 集群层面

// 设置合理的分片数
PUT /_template/logs_template
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 3,           // 每个节点2-3个分片为佳
    "number_of_replicas": 1,         // 至少1个副本保证高可用
    "refresh_interval": "30s",       // 降低刷新频率，提升写入性能
    "translog.durability": "async",  // 异步刷盘，提升性能
    "translog.sync_interval": "5s"
  }
}

// 调整节点角色
# 设置专用的 master 节点（不存储数据）
node.master: true
node.data: false
node.ingest: false

# 设置专用的 data 节点
node.master: false
node.data: true
node.ingest: false

# 设置专用的协调节点（负载均衡）
node.master: false
node.data: false
node.ingest: true

Q3：ES 索引怎么优化？

3.1 索引设计层面

① 字段类型选择

场景	正确类型	错误类型	原因
状态码、枚举	`keyword`	`text`	keyword 可聚合、排序
日志内容	`text` + `fields`	`text` only	需要保留 keyword 用于聚合
数字范围查询	`integer/long`	`text`	数字类型更高效
IP 地址	`ip`	`text`	ip 类型支持范围查询

PUT /logs/_mapping
{
  "properties": {
    "status_code": { "type": "keyword" },      // 状态码
    "message": {                                // 日志内容
      "type": "text",
      "fields": {
        "keyword": { "type": "keyword", "ignore_above": 256 }
      }
    },
    "duration": { "type": "integer" },         // 耗时
    "client_ip": { "type": "ip" }              // IP地址
  }
}

② 禁止不必要的字段

// index template 中禁用 _all 和 _source 的部分字段（谨慎）
PUT /_template/logs
{
  "mappings": {
    "_source": { "enabled": true },  // 建议保留，用于 update、reindex
    "_all": { "enabled": false }      // ES6+ 默认禁用，ES7+ 已移除
  }
}

③ 合理使用 index 选项

{
  "properties": {
    "trace_id": {
      "type": "keyword",
      "index": true,      // 需要查询的字段
      "doc_values": true  // 需要排序/聚合的字段
    },
    "raw_log": {
      "type": "text",
      "index": false,     // 不查询的字段禁止索引，节省存储
      "store": false
    }
  }
}

3.2 索引生命周期（ILM）配置

// 热阶段：高频写入，高性能
PUT _ilm/policy/hot_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "forcemerge": { "max_num_segments": 1 },
          "shrink": { "number_of_shards": 1 },
          "readonly": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

3.3 索引生命周期管理（别名使用）

// 创建索引并使用别名
PUT /logs-2025.01.01-000001
{
  "aliases": { "logs-write": { "is_write_index": true } }
}

// 写入时使用别名
POST /logs-write/_doc
{ "message": "log content" }

// 滚动索引
POST /logs-write/_rollover
{
  "conditions": { "max_age": "1d", "max_size": "50GB" }
}

Q4：冷热数据具体怎么实现？

4.1 节点角色分离

配置热节点（hot）：

# elasticsearch.yml
node.name: node-hot-1
node.attr.temperature: hot
node.master: false
node.data: true

# 硬件要求：SSD，大内存，CPU 高主频

配置冷节点（cold/warm）：

# elasticsearch.yml
node.name: node-cold-1
node.attr.temperature: cold
node.master: false
node.data: true

# 硬件要求：SATA 大容量硬盘，内存适中

4.2 分片分配策略

// 创建索引时指定属性
PUT /logs-hot-2025.01.01
{
  "settings": {
    "index.routing.allocation.require.temperature": "hot",
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

// 动态配置
PUT logs-*/_settings
{
  "index.routing.allocation.require.temperature": "hot"
}

4.3 Curator/ILM 实现自动迁移

使用 ILM（推荐 Elasticsearch 6.6+）：

// 热→冷迁移
PUT _ilm/policy/hot-to-cold
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": { "max_size": "50GB" },
          "set_priority": { "priority": 100 }
        }
      },
      "cold": {
        "min_age": "3d",
        "actions": {
          "allocate": {
            "require": { "temperature": "cold" },
            "number_of_replicas": 1
          },
          "freeze": {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

使用 Curator（旧版本）：

# curator.yml 配置
actions:
  1:
    action: allocation
    description: "迁移7天前的索引到冷节点"
    options:
      key: temperature
      value: cold
      allocation_type: require
    filters:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 7
    - filtertype: pattern
      kind: prefix
      value: logs-

4.4 完整的冷热分离架构

# 物理部署结构
- 热节点集群（3台SSD）：
  - 保留最近3天的数据
  - 索引状态：hot
  - 硬件：CPU 16核，内存 64GB，磁盘 SSD 2TB
  
- 温节点集群（3台SATA）：
  - 保留4-30天的数据
  - 索引状态：warm
  - 硬件：CPU 8核，内存 32GB，磁盘 SATA 10TB
  
- 冷节点集群（2台SATA）：
  - 保留31-90天的数据
  - 索引状态：cold/frozen
  - 硬件：CPU 4核，内存 16GB，磁盘 SATA 20TB（可关闭）

六、常见问题解答（FAQ）

Q5：Filebeat 日志丢失怎么办？

排查步骤：

# 1. 检查 registry 文件
cat /var/lib/filebeat/registry

# 2. 查看 Filebeat 自身日志
journalctl -u filebeat -n 100

# 3. 检查 output 配置
filebeat test output        # 测试输出连通性
filebeat test config        # 测试配置正确性

解决方案：

# 提高可靠性配置
output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092"]
  topic: 'logs'
  required_acks: 1           # 至少 leader 确认
  compression: gzip

filebeat.yml:
  shutdown_timeout: 30s      # 关闭前等待确认
  queue.mem.events: 8192     # 内存队列大小
  queue.mem.flush.min_events: 2048

Q6：Logstash 处理慢如何优化？

# 1. 调整 pipeline 配置
pipeline.workers: 8          # CPU 核心数
pipeline.batch.size: 2000    # 批次大小
pipeline.batch.delay: 50     # 批次延迟(ms)

# 2. 使用多个 pipeline
# pipelines.yml
- pipeline.id: main
  path.config: /etc/logstash/conf.d/main
  pipeline.workers: 6
- pipeline.id: high-traffic
  path.config: /etc/logstash/conf.d/high
  pipeline.workers: 2

# 3. 减少不必要的过滤
filter {
  # 使用条件判断避免对不匹配的日志应用 grok
  if [type] == "nginx-access" {
    grok { ... }
  }
}

Q7：ES 集群 red/yellow 状态怎么处理？

状态	含义	解决方案
green	所有分片正常	👍
yellow	主分片正常，副本分片未分配	添加数据节点或调整副本数
red	部分主分片未分配	恢复挂载的分片或从快照恢复

// 查看集群健康状态
GET _cluster/health

// 查看未分配分片原因
GET _cluster/allocation/explain

// 重新路由未分配分片
POST _cluster/reroute?retry_failed=true

// 查看分片分配
GET _cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason

Q8：磁盘空间不足怎么清理？

// 1. 删除过期索引
DELETE logs-2024.*

// 2. 删除文档但保留索引
POST logs-2025.01.01/_delete_by_query
{
  "query": { "range": { "@timestamp": { "lt": "now-30d" } } }
}

// 3. 强制合并分段（释放磁盘空间）
POST logs-2025.01.01/_forcemerge?max_num_segments=1

// 4. 清理缓存
POST _cache/clear

// 5. 收缩索引（减少分片数）
POST logs-2025.01.01/_shrink/logs-2025.01.01-shrink
{
  "settings": { "index.number_of_shards": 1 }
}

Q9：如何监控 ES 集群性能？

关键监控指标：

# 1. 查询性能
GET _nodes/stats/indices/search
# 关注：query_total, query_time_in_millis, fetch_total

# 2. 写入性能
GET _nodes/stats/indices/indexing
# 关注：index_total, index_time_in_millis, index_failed

# 3. JVM 内存
GET _nodes/stats/jvm
# 关注：heap_used_percent, gc.collectors.old.collection_time

# 4. 线程池
GET _nodes/stats/thread_pool
# 关注：search/rejected, write/rejected, bulk/rejected

# 5. 缓存命中率
GET _nodes/stats/indices/query_cache,request_cache

推荐监控工具：

Prometheus + Grafana：通过 elasticsearch_exporter 采集指标
Elastic Stack 自带监控：启用 xpack.monitoring
Cerebro：开源 ES 管理工具

Q10：日志中有大量重复/错误日志怎么办？

// 使用 Logstash 的 drop 过滤
filter {
  if [level] == "DEBUG" {
    drop { }    # 丢弃 DEBUG 日志
  }
  if [message] =~ "Heartbeat" {
    drop { }    # 丢弃心跳日志
  }
}

// 使用 fingerprint 去重
filter {
  fingerprint {
    source => ["message", "level"]
    method => "SHA256"
    key => "log-dedup"
  }
  # 配合 redis 或 memcached 存储已见过的指纹
}

七、总结与最佳实践

架构选型建议

日志量（日）	推荐架构	节点配置
< 100GB	入门级（Filebeat → ES → Kibana）	3节点 × 32GB内存，1TB SSD
100GB-1TB	升级版（Filebeat → Kafka → Logstash → ES）	5节点 × 64GB内存，3TB SSD
1TB-10TB	大师版（业务分集群）	每业务3节点，按需扩容
> 10TB	专家版（冷热分离 + 多集群）	热集群SSD，冷集群SATA

核心优化 Checklist

[ ] 操作系统：禁用 swap，调整 vm.max_map_count，增大文件描述符
[ ] JVM：堆内存设置不超过32GB，使用 G1GC
[ ] 索引：合理设置分片数（节点数×1.5~3），配置 ILM
[ ] 冷热分离：按时间维度分配节点属性
[ ] 查询优化：指定字段，使用 filter context，避免通配符前缀
[ ] 监控：接入 Prometheus + Grafana，设置关键指标告警

常见陷阱与避坑

陷阱	后果	正确做法
分片数过多	集群 unstable，查询慢	分片大小控制在20-50GB
mapping 动态映射	字段类型错误，存储膨胀	禁用 dynamic: true
不设置慢查询日志	问题难定位	设置 slowlog 阈值
集群节点角色未分离	master 节点负载过高	设置专用 master 节点

一句话总结

ELK 优化的核心：合理规划分片与副本、实施冷热分离、优化查询语句、监控关键指标，并定期清理/归档历史数据。

正文到此结束

本文作者：xinyu.he
文章标题：ELK集中化日志解决方案
本文地址：https://www.hxy.bj.cn/archives/817/
版权说明：若无注明，本文皆Xinyu.he blog原创，转载请保留文章出处。

最后修改：2026 年 05 月 13 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址 *

今日已经过去小时

这周已经过去天

本月已经过去天

今年已经过去个月

ELK集中化日志解决方案

xinyu_he • 2026 年 05 月 13 日

<h2>一、前言：为什么需要集中式日志？</h2><p>在软件团队里，有两个永恒经典的问题：</p><ol><li><strong>“只涉及一行代码的变更需要多久才能上线？”</strong> —— 关注“交付”效率</li><li><strong>“定位一个线上问题需要多久？流程是什么？”</strong> —— 关注“保障”能力</li></ol><p>很多早期团队定位生产问题的方式是：挨个登录服务器，用<code>cd / tail / grep / sed / awk</code>等命令手动查找日志。这种方式在单机或少量服务器时尚可应付，但在大规模分布式场景下会面临：</p><ul><li>日志量太大，归档困难</li><li>文本搜索太慢</li><li>无法多维度、跨服务器关联查询</li></ul><p>因此，我们需要<strong>集中式日志管理系统</strong>：将所有节点上的日志统一收集、管理、检索。</p><p><strong>ELK</strong>（Elasticsearch + Logstash + Kibana）正是目前最主流的开源解决方案。</p><h2>二、ELK 三大组件简介</h2><table><thead><tr><th>组件</th><th>作用</th><th>特点</th></tr></thead><tbody><tr><td><strong>Logstash</strong></td><td>日志搜集、分析、过滤</td><td>支持多种数据源，C/S架构，功能强大但较重</td></tr><tr><td><strong>Elasticsearch</strong></td><td>存储日志、建立索引、提供搜索</td><td>开源分布式搜索引擎，RESTful接口，自动分片与副本</td></tr><tr><td><strong>Kibana</strong></td><td>日志可视化展示</td><td>将日志转化为图表，提供强大的数据可视化</td></tr></tbody></table><h2>三、ELK 架构的四个演进阶段</h2><h3>1. 入门级：直接连接</h3><pre><code>App → Log → Logstash → Elasticsearch → Kibana</code></pre><p><strong>优点</strong>：简单<strong>缺点</strong>：</p><ul><li>大并发下日志峰值高，Elasticsearch 的 HTTP API 处理能力有限，可能超时、丢失</li><li>Logstash 运行在应用服务器上，消耗系统资源，影响业务</li></ul><h3>2. 升级版：引入消息队列 + 拆分 Logstash</h3><pre><code>App → Log → Logstash Shipper → 消息队列 → Logstash Indexer → Elasticsearch → Kibana</code></pre><p><strong>改进点</strong>：</p><ul><li>加入缓冲中间件（Redis / Kafka），应对流量峰值</li><li>将 Logstash 拆分为 <strong>Shipper</strong>（收集）和 <strong>Indexer</strong>（处理）</li></ul><p><strong>关于消息队列的选择</strong>：</p><table><thead><tr><th>特性</th><th>Redis</th><th>Kafka</th></tr></thead><tbody><tr><td>消息可靠性</td><td>❌ 可能丢失</td><td>✅ 可靠</td></tr><tr><td>吞吐量</td><td>一般</td><td>✅ 高</td></tr><tr><td>数据持久化</td><td>内存（受限于内存）</td><td>✅ 硬盘</td></tr><tr><td>集群模式</td><td>主从</td><td>✅ 分布式</td></tr><tr><td>适用场景</td><td>小规模、允许丢失</td><td>生产环境、要求可靠性</td></tr></tbody></table><p><strong>结论</strong>：生产环境推荐使用 <strong>Kafka</strong>。</p><h3>3. 大师版：引入 Filebeat + 业务隔离</h3><pre><code>App → Log → Filebeat → Kafka → Logstash → Elasticsearch（按业务分集群） → Kibana</code></pre><p><strong>核心变化</strong>：</p><ul><li>使用 <strong>Filebeat</strong> 替代 Logstash Shipper —— 更轻量级，资源占用少（约20-30MB），官方推荐</li><li>Elasticsearch 按业务拆分为多个独立集群，避免相互影响</li></ul><h3>4. 专家版：冷热数据分离</h3><p><strong>核心思想</strong>：以时间（如30天）为界限区分冷热数据</p><ul><li><strong>热集群</strong>：存放近期数据（如3-7天），使用 SSD 存储，查询速度快</li><li><strong>冷集群</strong>：存放历史数据（7-90天），使用 SATA 存储，成本低，查询较慢但可接受</li></ul><h2>四、核心组件工作原理</h2><h3>1. Filebeat 工作原理</h3><p>由两个核心组件协同工作：</p><table><thead><tr><th>组件</th><th>职责</th></tr></thead><tbody><tr><td><strong>Harvester（收割机）</strong></td><td>负责读取单个文件内容，逐行发送到输出</td></tr><tr><td><strong>Prospector（勘测者）</strong></td><td>管理 Harvester，发现并跟踪所有读取源</td></tr></tbody></table><p><strong>状态记录</strong>：</p><ul><li>文件状态记录在 <code>/var/lib/filebeat/registry</code>，包含读取偏移量</li><li>重启时利用 registry 恢复状态，确保不丢数据</li></ul><p><strong>关键配置参数</strong>：</p><pre><code class="lang-yaml">filebeat.yml:
  close_inactive: 5m        # 文件5分钟无变化则关闭句柄
  scan_frequency: 10s       # 每10秒扫描新文件
  shutdown_timeout: 30s     # 关闭前等待确认的最大时间</code></pre><h3>2. Logstash 工作原理</h3><p>三个阶段：<strong>Inputs → Filters → Outputs</strong></p><table><thead><tr><th>阶段</th><th>常用插件</th><th>作用</th></tr></thead><tbody><tr><td>Inputs</td><td>file, syslog, redis, beats</td><td>接收数据</td></tr><tr><td>Filters</td><td>grok, mutate, drop, geoip</td><td>解析、转换、过滤</td></tr><tr><td>Outputs</td><td>elasticsearch, file, graphite</td><td>输出处理结果</td></tr></tbody></table><p><strong>Grok 正则示例</strong>：</p><pre><code class="lang-ruby"># Nginx 日志解析
filter {
  grok {
    match =&gt; { &quot;message&quot; =&gt; &quot;%{COMBINEDAPACHELOG}&quot; }
  }
  mutate {
    convert =&gt; { &quot;response&quot; =&gt; &quot;integer&quot; }
    remove_field =&gt; [&quot;message&quot;]
  }
}</code></pre><h3>3. Elasticsearch 核心原理：倒排索引</h3><p><strong>场景举例</strong>：检索包含“望”字的诗词</p><p>传统 SQL：</p><pre><code class="lang-sql">SELECT * FROM poems WHERE content LIKE '%望%'  -- 全表扫描，极慢</code></pre><p><strong>倒排索引方案</strong>：</p><ol><li>提取每篇文档中的所有关键词</li><li>建立“关键词 → 文档列表”的映射关系</li><li>对关键词本身建立索引</li></ol><p><strong>数据结构</strong>：</p><ul><li><code>Term（词项）</code> + <code>PostingList（文档ID列表）</code></li><li>额外记录：词项出现过的文档总数（doc_freq）、在单个文档中的出现次数（term_freq）、位置信息（positions）等</li></ul><h2>五、性能优化专题</h2><h3>Q1：日志查询慢，怎么优化？</h3><p><strong>原因分析</strong>：</p><table><thead><tr><th>可能原因</th><th>排查方法</th></tr></thead><tbody><tr><td>索引太大（日增量&gt;100GB）</td><td><code>GET _cat/indices?v</code> 查看索引大小</td></tr><tr><td>查询未命中索引</td><td>检查 Kibana 查询语句是否带<code>_exists_</code> 等可用索引的字段</td></tr><tr><td>分片数量不合理</td><td>查看<code>/_cat/shards</code> 分片分布</td></tr><tr><td>硬件不足（IO/内存/CPU）</td><td>监控系统资源</td></tr><tr><td>冷热数据未分离</td><td>热数据查询快，冷数据慢是正常现象</td></tr></tbody></table><p><strong>优化方案</strong>：</p><h4>① 优化查询语句</h4><pre><code class="lang-json">// 不好的查询（全字段扫描）
{
  &quot;query&quot;: {
    &quot;query_string&quot;: { &quot;query&quot;: &quot;error&quot; }
  }
}

// 好的查询（指定字段）
{
  &quot;query&quot;: {
    &quot;match&quot;: { &quot;message&quot;: &quot;error&quot; }
  }
}</code></pre><h4>② 使用索引生命周期管理（ILM）</h4><pre><code class="lang-json">PUT _ilm/policy/logs_policy
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;min_age&quot;: &quot;0ms&quot;,
        &quot;actions&quot;: { &quot;rollover&quot;: { &quot;max_size&quot;: &quot;50GB&quot;, &quot;max_age&quot;: &quot;1d&quot; } }
      },
      &quot;warm&quot;: {
        &quot;min_age&quot;: &quot;3d&quot;,
        &quot;actions&quot;: { &quot;shrink&quot;: { &quot;number_of_shards&quot;: 1 }, &quot;forcemerge&quot;: { &quot;max_num_segments&quot;: 1 } }
      },
      &quot;cold&quot;: {
        &quot;min_age&quot;: &quot;7d&quot;,
        &quot;actions&quot;: { &quot;freeze&quot;: {} }
      },
      &quot;delete&quot;: {
        &quot;min_age&quot;: &quot;90d&quot;,
        &quot;actions&quot;: { &quot;delete&quot;: {} }
      }
    }
  }
}</code></pre><h4>③ 调整查询并发与超时</h4><pre><code class="lang-json">// 设置查询超时
{
  &quot;timeout&quot;: &quot;10s&quot;,
  &quot;query&quot;: { &quot;match_all&quot;: {} }
}

// 控制查询并发
GET _cluster/settings
{
  &quot;transient&quot;: {
    &quot;search.max_buckets&quot;: 20000,
    &quot;search.max_open_context&quot;: 5000
  }
}</code></pre><hr><h3>Q2：Elasticsearch 怎么优化？</h3><h4>2.1 操作系统层面</h4><pre><code class="lang-bash"># 1. 关闭交换分区
swapoff -a
# 永久关闭：/etc/fstab 注释 swap 行

# 2. 调整内存锁定
# /etc/security/limits.conf
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

# 3. 增大文件描述符
echo &quot;elasticsearch soft nofile 65535&quot; &gt;&gt; /etc/security/limits.conf
echo &quot;elasticsearch hard nofile 65535&quot; &gt;&gt; /etc/security/limits.conf

# 4. 调整 vm.max_map_count
sysctl -w vm.max_map_count=262144
echo &quot;vm.max_map_count=262144&quot; &gt;&gt; /etc/sysctl.conf</code></pre><h4>2.2 JVM 层面</h4><pre><code class="lang-yaml"># jvm.options
-Xms16g    # 设置堆内存为物理内存的50%，但不超过32GB
-Xmx16g    # 因为超过32GB会压缩指针，降低性能

# 垃圾回收器配置（ES7+推荐G1）
-XX:+UseG1GC
-XX:G1ReservePercent=25</code></pre><p><strong>堆内存配置黄金法则</strong>：</p><ul><li>最小和最大设置相同，避免动态调整</li><li>不超过物理内存的50%，留一半给操作系统缓存</li><li>不要超过32GB（若超过，指针压缩失败，性能下降）</li></ul><h4>2.3 集群层面</h4><pre><code class="lang-json">// 设置合理的分片数
PUT /_template/logs_template
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;settings&quot;: {
    &quot;number_of_shards&quot;: 3,           // 每个节点2-3个分片为佳
    &quot;number_of_replicas&quot;: 1,         // 至少1个副本保证高可用
    &quot;refresh_interval&quot;: &quot;30s&quot;,       // 降低刷新频率，提升写入性能
    &quot;translog.durability&quot;: &quot;async&quot;,  // 异步刷盘，提升性能
    &quot;translog.sync_interval&quot;: &quot;5s&quot;
  }
}

// 调整节点角色
# 设置专用的 master 节点（不存储数据）
node.master: true
node.data: false
node.ingest: false

# 设置专用的 data 节点
node.master: false
node.data: true
node.ingest: false

# 设置专用的协调节点（负载均衡）
node.master: false
node.data: false
node.ingest: true</code></pre><hr><h3>Q3：ES 索引怎么优化？</h3><h4>3.1 索引设计层面</h4><p><strong>① 字段类型选择</strong></p><table><thead><tr><th>场景</th><th>正确类型</th><th>错误类型</th><th>原因</th></tr></thead><tbody><tr><td>状态码、枚举</td><td><code>keyword</code></td><td><code>text</code></td><td>keyword 可聚合、排序</td></tr><tr><td>日志内容</td><td><code>text</code> + <code>fields</code></td><td><code>text</code> only</td><td>需要保留 keyword 用于聚合</td></tr><tr><td>数字范围查询</td><td><code>integer/long</code></td><td><code>text</code></td><td>数字类型更高效</td></tr><tr><td>IP 地址</td><td><code>ip</code></td><td><code>text</code></td><td>ip 类型支持范围查询</td></tr></tbody></table><pre><code class="lang-json">PUT /logs/_mapping
{
  &quot;properties&quot;: {
    &quot;status_code&quot;: { &quot;type&quot;: &quot;keyword&quot; },      // 状态码
    &quot;message&quot;: {                                // 日志内容
      &quot;type&quot;: &quot;text&quot;,
      &quot;fields&quot;: {
        &quot;keyword&quot;: { &quot;type&quot;: &quot;keyword&quot;, &quot;ignore_above&quot;: 256 }
      }
    },
    &quot;duration&quot;: { &quot;type&quot;: &quot;integer&quot; },         // 耗时
    &quot;client_ip&quot;: { &quot;type&quot;: &quot;ip&quot; }              // IP地址
  }
}</code></pre><p><strong>② 禁止不必要的字段</strong></p><pre><code class="lang-json">// index template 中禁用 _all 和 _source 的部分字段（谨慎）
PUT /_template/logs
{
  &quot;mappings&quot;: {
    &quot;_source&quot;: { &quot;enabled&quot;: true },  // 建议保留，用于 update、reindex
    &quot;_all&quot;: { &quot;enabled&quot;: false }      // ES6+ 默认禁用，ES7+ 已移除
  }
}</code></pre><p><strong>③ 合理使用 index 选项</strong></p><pre><code class="lang-json">{
  &quot;properties&quot;: {
    &quot;trace_id&quot;: {
      &quot;type&quot;: &quot;keyword&quot;,
      &quot;index&quot;: true,      // 需要查询的字段
      &quot;doc_values&quot;: true  // 需要排序/聚合的字段
    },
    &quot;raw_log&quot;: {
      &quot;type&quot;: &quot;text&quot;,
      &quot;index&quot;: false,     // 不查询的字段禁止索引，节省存储
      &quot;store&quot;: false
    }
  }
}</code></pre><h4>3.2 索引生命周期（ILM）配置</h4><pre><code class="lang-json">// 热阶段：高频写入，高性能
PUT _ilm/policy/hot_policy
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;min_age&quot;: &quot;0ms&quot;,
        &quot;actions&quot;: {
          &quot;rollover&quot;: {
            &quot;max_size&quot;: &quot;50GB&quot;,
            &quot;max_age&quot;: &quot;1d&quot;
          },
          &quot;set_priority&quot;: { &quot;priority&quot;: 100 }
        }
      },
      &quot;warm&quot;: {
        &quot;min_age&quot;: &quot;2d&quot;,
        &quot;actions&quot;: {
          &quot;forcemerge&quot;: { &quot;max_num_segments&quot;: 1 },
          &quot;shrink&quot;: { &quot;number_of_shards&quot;: 1 },
          &quot;readonly&quot;: {}
        }
      },
      &quot;delete&quot;: {
        &quot;min_age&quot;: &quot;30d&quot;,
        &quot;actions&quot;: { &quot;delete&quot;: {} }
      }
    }
  }
}</code></pre><h4>3.3 索引生命周期管理（别名使用）</h4><pre><code class="lang-json">// 创建索引并使用别名
PUT /logs-2025.01.01-000001
{
  &quot;aliases&quot;: { &quot;logs-write&quot;: { &quot;is_write_index&quot;: true } }
}

// 写入时使用别名
POST /logs-write/_doc
{ &quot;message&quot;: &quot;log content&quot; }

// 滚动索引
POST /logs-write/_rollover
{
  &quot;conditions&quot;: { &quot;max_age&quot;: &quot;1d&quot;, &quot;max_size&quot;: &quot;50GB&quot; }
}</code></pre><hr><h3>Q4：冷热数据具体怎么实现？</h3><h4>4.1 节点角色分离</h4><p><strong>配置热节点（hot）</strong>：</p><pre><code class="lang-yaml"># elasticsearch.yml
node.name: node-hot-1
node.attr.temperature: hot
node.master: false
node.data: true

# 硬件要求：SSD，大内存，CPU 高主频</code></pre><p><strong>配置冷节点（cold/warm）</strong>：</p><pre><code class="lang-yaml"># elasticsearch.yml
node.name: node-cold-1
node.attr.temperature: cold
node.master: false
node.data: true

# 硬件要求：SATA 大容量硬盘，内存适中</code></pre><h4>4.2 分片分配策略</h4><pre><code class="lang-json">// 创建索引时指定属性
PUT /logs-hot-2025.01.01
{
  &quot;settings&quot;: {
    &quot;index.routing.allocation.require.temperature&quot;: &quot;hot&quot;,
    &quot;number_of_shards&quot;: 3,
    &quot;number_of_replicas&quot;: 1
  }
}

// 动态配置
PUT logs-*/_settings
{
  &quot;index.routing.allocation.require.temperature&quot;: &quot;hot&quot;
}</code></pre><h4>4.3 Curator/ILM 实现自动迁移</h4><p><strong>使用 ILM（推荐 Elasticsearch 6.6+）</strong>：</p><pre><code class="lang-json">// 热→冷迁移
PUT _ilm/policy/hot-to-cold
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;min_age&quot;: &quot;0ms&quot;,
        &quot;actions&quot;: {
          &quot;rollover&quot;: { &quot;max_size&quot;: &quot;50GB&quot; },
          &quot;set_priority&quot;: { &quot;priority&quot;: 100 }
        }
      },
      &quot;cold&quot;: {
        &quot;min_age&quot;: &quot;3d&quot;,
        &quot;actions&quot;: {
          &quot;allocate&quot;: {
            &quot;require&quot;: { &quot;temperature&quot;: &quot;cold&quot; },
            &quot;number_of_replicas&quot;: 1
          },
          &quot;freeze&quot;: {},
          &quot;set_priority&quot;: { &quot;priority&quot;: 0 }
        }
      },
      &quot;delete&quot;: {
        &quot;min_age&quot;: &quot;90d&quot;,
        &quot;actions&quot;: { &quot;delete&quot;: {} }
      }
    }
  }
}</code></pre><p><strong>使用 Curator（旧版本）</strong>：</p><pre><code class="lang-yaml"># curator.yml 配置
actions:
  1:
    action: allocation
    description: &quot;迁移7天前的索引到冷节点&quot;
    options:
      key: temperature
      value: cold
      allocation_type: require
    filters:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 7
    - filtertype: pattern
      kind: prefix
      value: logs-</code></pre><h4>4.4 完整的冷热分离架构</h4><pre><code class="lang-yaml"># 物理部署结构
- 热节点集群（3台SSD）：
  - 保留最近3天的数据
  - 索引状态：hot
  - 硬件：CPU 16核，内存 64GB，磁盘 SSD 2TB
  
- 温节点集群（3台SATA）：
  - 保留4-30天的数据
  - 索引状态：warm
  - 硬件：CPU 8核，内存 32GB，磁盘 SATA 10TB
  
- 冷节点集群（2台SATA）：
  - 保留31-90天的数据
  - 索引状态：cold/frozen
  - 硬件：CPU 4核，内存 16GB，磁盘 SATA 20TB（可关闭）</code></pre><h2>六、常见问题解答（FAQ）</h2><h3>Q5：Filebeat 日志丢失怎么办？</h3><p><strong>排查步骤</strong>：</p><pre><code class="lang-bash"># 1. 检查 registry 文件
cat /var/lib/filebeat/registry

# 2. 查看 Filebeat 自身日志
journalctl -u filebeat -n 100

# 3. 检查 output 配置
filebeat test output        # 测试输出连通性
filebeat test config        # 测试配置正确性</code></pre><p><strong>解决方案</strong>：</p><pre><code class="lang-yaml"># 提高可靠性配置
output.kafka:
  hosts: [&quot;kafka1:9092&quot;, &quot;kafka2:9092&quot;]
  topic: 'logs'
  required_acks: 1           # 至少 leader 确认
  compression: gzip

filebeat.yml:
  shutdown_timeout: 30s      # 关闭前等待确认
  queue.mem.events: 8192     # 内存队列大小
  queue.mem.flush.min_events: 2048</code></pre><h3>Q6：Logstash 处理慢如何优化？</h3><pre><code class="lang-yaml"># 1. 调整 pipeline 配置
pipeline.workers: 8          # CPU 核心数
pipeline.batch.size: 2000    # 批次大小
pipeline.batch.delay: 50     # 批次延迟(ms)

# 2. 使用多个 pipeline
# pipelines.yml
- pipeline.id: main
  path.config: /etc/logstash/conf.d/main
  pipeline.workers: 6
- pipeline.id: high-traffic
  path.config: /etc/logstash/conf.d/high
  pipeline.workers: 2

# 3. 减少不必要的过滤
filter {
  # 使用条件判断避免对不匹配的日志应用 grok
  if [type] == &quot;nginx-access&quot; {
    grok { ... }
  }
}</code></pre><h3>Q7：ES 集群 red/yellow 状态怎么处理？</h3><table><thead><tr><th>状态</th><th>含义</th><th>解决方案</th></tr></thead><tbody><tr><td><strong>green</strong></td><td>所有分片正常</td><td>👍</td></tr><tr><td><strong>yellow</strong></td><td>主分片正常，副本分片未分配</td><td>添加数据节点或调整副本数</td></tr><tr><td><strong>red</strong></td><td>部分主分片未分配</td><td>恢复挂载的分片或从快照恢复</td></tr></tbody></table><pre><code class="lang-json">// 查看集群健康状态
GET _cluster/health

// 查看未分配分片原因
GET _cluster/allocation/explain

// 重新路由未分配分片
POST _cluster/reroute?retry_failed=true

// 查看分片分配
GET _cat/shards?v&amp;h=index,shard,prirep,state,node,unassigned.reason</code></pre><h3>Q8：磁盘空间不足怎么清理？</h3><pre><code class="lang-json">// 1. 删除过期索引
DELETE logs-2024.*

// 2. 删除文档但保留索引
POST logs-2025.01.01/_delete_by_query
{
  &quot;query&quot;: { &quot;range&quot;: { &quot;@timestamp&quot;: { &quot;lt&quot;: &quot;now-30d&quot; } } }
}

// 3. 强制合并分段（释放磁盘空间）
POST logs-2025.01.01/_forcemerge?max_num_segments=1

// 4. 清理缓存
POST _cache/clear

// 5. 收缩索引（减少分片数）
POST logs-2025.01.01/_shrink/logs-2025.01.01-shrink
{
  &quot;settings&quot;: { &quot;index.number_of_shards&quot;: 1 }
}</code></pre><h3>Q9：如何监控 ES 集群性能？</h3><p><strong>关键监控指标</strong>：</p><pre><code class="lang-bash"># 1. 查询性能
GET _nodes/stats/indices/search
# 关注：query_total, query_time_in_millis, fetch_total

# 2. 写入性能
GET _nodes/stats/indices/indexing
# 关注：index_total, index_time_in_millis, index_failed

# 3. JVM 内存
GET _nodes/stats/jvm
# 关注：heap_used_percent, gc.collectors.old.collection_time

# 4. 线程池
GET _nodes/stats/thread_pool
# 关注：search/rejected, write/rejected, bulk/rejected

# 5. 缓存命中率
GET _nodes/stats/indices/query_cache,request_cache</code></pre><p><strong>推荐监控工具</strong>：</p><ul><li><strong>Prometheus + Grafana</strong>：通过 elasticsearch_exporter 采集指标</li><li><strong>Elastic Stack 自带监控</strong>：启用 xpack.monitoring</li><li><strong>Cerebro</strong>：开源 ES 管理工具</li></ul><h3>Q10：日志中有大量重复/错误日志怎么办？</h3><pre><code class="lang-json">// 使用 Logstash 的 drop 过滤
filter {
  if [level] == &quot;DEBUG&quot; {
    drop { }    # 丢弃 DEBUG 日志
  }
  if [message] =~ &quot;Heartbeat&quot; {
    drop { }    # 丢弃心跳日志
  }
}

// 使用 fingerprint 去重
filter {
  fingerprint {
    source =&gt; [&quot;message&quot;, &quot;level&quot;]
    method =&gt; &quot;SHA256&quot;
    key =&gt; &quot;log-dedup&quot;
  }
  # 配合 redis 或 memcached 存储已见过的指纹
}</code></pre><h2>七、总结与最佳实践</h2><h3>架构选型建议</h3><table><thead><tr><th>日志量（日）</th><th>推荐架构</th><th>节点配置</th></tr></thead><tbody><tr><td>&lt; 100GB</td><td>入门级（Filebeat → ES → Kibana）</td><td>3节点 × 32GB内存，1TB SSD</td></tr><tr><td>100GB-1TB</td><td>升级版（Filebeat → Kafka → Logstash → ES）</td><td>5节点 × 64GB内存，3TB SSD</td></tr><tr><td>1TB-10TB</td><td>大师版（业务分集群）</td><td>每业务3节点，按需扩容</td></tr><tr><td>&gt; 10TB</td><td>专家版（冷热分离 + 多集群）</td><td>热集群SSD，冷集群SATA</td></tr></tbody></table><h3>核心优化 Checklist</h3><ul><li>[ ]  操作系统：禁用 swap，调整 vm.max_map_count，增大文件描述符</li><li>[ ]  JVM：堆内存设置不超过32GB，使用 G1GC</li><li>[ ]  索引：合理设置分片数（节点数×1.5~3），配置 ILM</li><li>[ ]  冷热分离：按时间维度分配节点属性</li><li>[ ]  查询优化：指定字段，使用 filter context，避免通配符前缀</li><li>[ ]  监控：接入 Prometheus + Grafana，设置关键指标告警</li></ul><h3>常见陷阱与避坑</h3><table><thead><tr><th>陷阱</th><th>后果</th><th>正确做法</th></tr></thead><tbody><tr><td>分片数过多</td><td>集群 unstable，查询慢</td><td>分片大小控制在20-50GB</td></tr><tr><td>mapping 动态映射</td><td>字段类型错误，存储膨胀</td><td>禁用 dynamic: true</td></tr><tr><td>不设置慢查询日志</td><td>问题难定位</td><td>设置 slowlog 阈值</td></tr><tr><td>集群节点角色未分离</td><td>master 节点负载过高</td><td>设置专用 master 节点</td></tr></tbody></table><h3>一句话总结</h3><p><strong>ELK 优化的核心：合理规划分片与副本、实施冷热分离、优化查询语句、监控关键指标，并定期清理/归档历史数据。</strong></p>

一、前言：为什么需要集中式日志？

二、ELK 三大组件简介

三、ELK 架构的四个演进阶段

1. 入门级：直接连接

2. 升级版：引入消息队列 + 拆分 Logstash

3. 大师版：引入 Filebeat + 业务隔离

4. 专家版：冷热数据分离

四、核心组件工作原理

1. Filebeat 工作原理

2. Logstash 工作原理

3. Elasticsearch 核心原理：倒排索引

五、性能优化专题

Q1：日志查询慢，怎么优化？

① 优化查询语句

② 使用索引生命周期管理（ILM）

③ 调整查询并发与超时

Q2：Elasticsearch 怎么优化？

2.1 操作系统层面

2.2 JVM 层面

2.3 集群层面

Q3：ES 索引怎么优化？

3.1 索引设计层面

3.2 索引生命周期（ILM）配置

3.3 索引生命周期管理（别名使用）

Q4：冷热数据具体怎么实现？

4.1 节点角色分离

4.2 分片分配策略

4.3 Curator/ILM 实现自动迁移

4.4 完整的冷热分离架构

六、常见问题解答（FAQ）

Q5：Filebeat 日志丢失怎么办？

Q6：Logstash 处理慢如何优化？

Q7：ES 集群 red/yellow 状态怎么处理？

Q8：磁盘空间不足怎么清理？

Q9：如何监控 ES 集群性能？

Q10：日志中有大量重复/错误日志怎么办？

七、总结与最佳实践

架构选型建议

核心优化 Checklist

常见陷阱与避坑

一句话总结

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

ELK集中化日志解决方案

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款