- Authors
- Name
- Why OpenSearch Operations Expertise Matters
- 1. Production Cluster Architecture
- 2. Index Design and Shard Strategy
- 3. ISM (Index State Management) Lifecycle Management
- 4. Monitoring and Alert Configuration
- 5. Incident Scenarios and Recovery Strategies
- 6. AWS OpenSearch Service Operations Checklist
- 7. OpenSearch vs Elasticsearch Operations Perspective Comparison
- 8. Elasticsearch to OpenSearch Migration Checklist
- 9. Performance Tuning Quick Reference
- 10. Security Hardening Checklist
- Conclusion: 7 Things Every Operator Must Remember
- References
Why OpenSearch Operations Expertise Matters
OpenSearch is a core engine for log analytics, full-text search, and observability pipelines. Spinning up a cluster is easy, but operating it stably while controlling costs is an entirely different challenge. If the shard count for a single index is set incorrectly, the JVM heap of the entire cluster can become unstable. If a single ISM policy is missing, the disk fills up and writes stop.
This article covers the decision points that operators encounter in actual production, in the following order: cluster architecture, index design, ISM lifecycle, monitoring, incident response, security, and migration. All API examples are based on OpenSearch 2.x and can also be used on AWS OpenSearch Service.
1. Production Cluster Architecture
1.1 Node Role Separation
In production clusters, clearly separating each node type's role is essential for achieving both stability and performance.
+------------------------------------------------------------------+
| Client / Data Prepper |
| (Ingest Pipeline) |
+------------------------------+-----------------------------------+
|
+--------------------+--------------------+
v v v
+------------+ +------------+ +------------+
| Master-1 | | Master-2 | | Master-3 |
| (AZ-a) | | (AZ-b) | | (AZ-c) |
| Dedicated | | Dedicated | | Dedicated |
+------------+ +------------+ +------------+
| | |
v v v
+------------+ +------------+ +------------+
| Hot-1 | | Hot-2 | | Hot-3 |
| (AZ-a) | | (AZ-b) | | (AZ-c) |
| EBS gp3 | | EBS gp3 | | EBS gp3 |
+------------+ +------------+ +------------+
| | |
v v v
+------------+ +------------+ +------------+
| Warm-1 | | Warm-2 | | Warm-3 |
| UltraWarm | | UltraWarm | | UltraWarm |
| (S3+cache) | | (S3+cache) | | (S3+cache) |
+------------+ +------------+ +------------+
| Node Type | Role | Recommended Specs | Placement Rule |
|---|---|---|---|
| Dedicated Master | Cluster metadata management, shard allocation | Minimum 3 (odd number), latest gen instances | Must be distributed across 3 AZs |
| Hot Data Node | Active read/write processing | High CPU + fast storage (gp3) | Even distribution across AZs |
| Warm Data Node | Read-heavy, infrequent queries | High storage density, low CPU | UltraWarm (S3-based) |
| Coordinating Node | Request routing, result aggregation | High CPU + memory, minimal storage | Only for search-intensive workloads |
1.2 Capacity Sizing Formula
The core formula for production cluster sizing.
Required Storage = Daily Raw Data
x (1 + Number of Replicas)
x 1.1 (Indexing Overhead)
x 1.15 (OS Reserved + Internal Operations Margin)
x Retention Period (days)
Example: 100 GiB daily logs, 1 replica, 30-day retention: 100 x 2 x 1.1 x 1.15 x 30 = 7,590 GiB
Disk usage should be maintained at 75% maximum. Above 75%, OpenSearch begins rejecting new shard allocations, and at 85% it switches to read-only.
Actual Provisioning = Required Storage / 0.75 = 10,120 GiB
2. Index Design and Shard Strategy
2.1 Shard Sizing Standards
Shard size varies by workload type. The recommended ranges from OpenSearch official documentation and AWS guides are as follows.
| Workload | Recommended Shard Size | Rationale |
|---|---|---|
| Latency-sensitive search (e-commerce, autocomplete) | 10-30 GiB | Minimize latency, fast recovery |
| Write-heavy / log analytics | 30-50 GiB | Maximize write throughput |
| Maximum recommended limit | 50 GiB | Recovery/reallocation time spikes above this |
Primary shard count formula:
Primary Shard Count = (Raw Data Size + Growth Margin) x 1.1 / Target Shard Size
Example: 66 GiB daily logs, 4x growth expected, target shard size 30 GiB
(66 + 198) x 1.1 / 30 = approx. 10 primary shards
2.2 Per-Node Shard Limits
| Constraint | Limit |
|---|---|
| Per JVM heap | 25 shards per 1 GiB of heap |
| Per CPU | 1.5 vCPUs per shard (initial sizing) |
| OpenSearch 2.15 or earlier | Maximum 1,000 shards per node |
| OpenSearch 2.17+ | 1,000 shards per 16 GiB JVM heap, max 4,000 per node |
Key principle: Set the shard count as a multiple of data node count for even distribution. For example, with 3 data nodes, set primary shards to 3, 6, 9, 12, etc.
2.3 Index Templates and Mapping Design
Example 1: Composable Index Template
Index templates ensure consistent settings and mappings for all new indexes. Using dynamic: strict to prevent unexpected field additions (mapping explosion) is critical.
PUT _index_template/app-logs-template
{
"index_patterns": ["app-logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.translog.flush_threshold_size": "1024mb",
"index.codec": "zstd_no_dict",
"plugins.index_state_management.rollover_alias": "app-logs-write"
},
"mappings": {
"dynamic": "strict",
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text", "analyzer": "standard" },
"service": { "type": "keyword" },
"trace_id": { "type": "keyword" },
"span_id": { "type": "keyword" },
"host": { "type": "keyword" },
"duration_ms": { "type": "float" },
"http_status": { "type": "short" },
"request_path":{ "type": "keyword" },
"user_id": { "type": "keyword" },
"metadata": {
"type": "object",
"enabled": false
}
}
}
},
"composed_of": ["common-settings"],
"priority": 200
}
Setting points explained:
refresh_interval: 30s-- Increased from the default 1 second to reduce CPU/IO load. Suitable for log workloads where real-time search is not required.translog.flush_threshold_size: 1024mb-- Set to 25% of JVM heap to reduce flush frequency.dynamic: strict-- Rejects indexing of undefined fields. This is the key to preventing mapping explosion.index.codec: zstd_no_dict-- zstd compression available in OpenSearch 2.9+. Achieves 25-30% storage savings compared to default LZ4.metadata.enabled: false-- Disables indexing for nested objects that are not searched, saving resources.
2.4 Replica Strategy
| Configuration | Recommended Replicas | Reason |
|---|---|---|
| Single AZ | 1 | Data protection against node failure |
| Multi-AZ (2 AZ) | 1 | Service from other AZ during AZ failure |
| Multi-AZ with Standby (3 AZ) | 2 | 100% data availability during AZ failure |
| During bulk indexing | 0 (temporary) | Maximize indexing speed, then restore |
Replica 0 operation pattern: Setting replicas to 0 during initial bulk loading and restoring afterward can improve indexing speed by up to 2x.
// Before indexing: disable replicas
PUT my-bulk-index/_settings
{ "index.number_of_replicas": 0 }
// After indexing complete: restore replicas
PUT my-bulk-index/_settings
{ "index.number_of_replicas": 1 }
2.5 Data Stream vs Traditional Index Pattern
OpenSearch 2.6+ supports Data Streams. Compared to the traditional Rollover + Alias pattern:
| Item | Rollover + Alias | Data Stream |
|---|---|---|
| Initial Setup | Manual initial index + alias creation | Only register index template |
| Write Target | Write alias | Use data stream name directly |
| Backing Index Name | app-logs-000001 | .ds-app-logs-000001 |
| Deletion | Per index | DELETE _data_stream/app-logs |
| Recommended Workload | General purpose | Time-series (append-only) only |
3. ISM (Index State Management) Lifecycle Management
ISM is OpenSearch's index lifecycle automation engine. It corresponds to Elasticsearch's ILM, automatically handling the Hot, Warm, Cold, Delete flow.
3.1 Hot-Warm-Cold-Delete Full Policy
Example 2: ISM Full Lifecycle Policy
PUT _plugins/_ism/policies/app-log-lifecycle
{
"policy": {
"description": "App log index lifecycle: hot -> warm -> cold -> delete",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [
{
"rollover": {
"min_size": "30gb",
"min_index_age": "1d",
"min_doc_count": 10000000
}
}
],
"transitions": [
{ "state_name": "warm", "conditions": { "min_index_age": "3d" } }
]
},
{
"name": "warm",
"actions": [
{ "replica_count": { "number_of_replicas": 1 } },
{ "force_merge": { "max_num_segments": 1 } },
{ "allocation": {
"require": { "temp": "warm" },
"wait_for": true
}
}
],
"transitions": [
{ "state_name": "cold", "conditions": { "min_index_age": "30d" } }
]
},
{
"name": "cold",
"actions": [
{ "replica_count": { "number_of_replicas": 0 } },
{ "read_only": {} }
],
"transitions": [
{ "state_name": "delete", "conditions": { "min_index_age": "90d" } }
]
},
{
"name": "delete",
"actions": [
{
"notification": {
"destination": {
"slack": { "url": "https://hooks.slack.com/services/T.../B.../xxx" }
},
"message_template": {
"source": "Index {{ctx.index}} will be deleted according to retention policy (90 days)."
}
}
},
{ "delete": {} }
],
"transitions": []
}
],
"ism_template": [
{ "index_patterns": ["app-logs-*"], "priority": 100 }
]
}
}
Operational notes:
- The default ISM check interval is 5 minutes. It can be adjusted via
plugins.index_state_management.job_interval. - Setting the
ism_templatefield automatically attaches the policy to new indexes matching the pattern. - Rollover conditions (
min_size,min_index_age,min_doc_count) are OR conditions -- rollover executes when any one is met. allocation.require.temp: warm-- Moves shards to nodes with thenode.attr.temp=warmattribute.- After force_merge, no further writes should be made to that index.
3.2 ISM Operation Commands
# Check policy execution status (specific index)
GET _plugins/_ism/explain/app-logs-000001
# Attach policy to existing indexes
POST _plugins/_ism/add/old-logs-2025-*
{ "policy_id": "app-log-lifecycle" }
# Update policy version (apply new policy to managed indexes)
POST _plugins/_ism/change_policy/app-logs-*
{
"policy_id": "app-log-lifecycle-v2",
"state": "warm"
}
# Retry failed ISM operations
POST _plugins/_ism/retry/app-logs-000003
# Remove ISM policy from a specific index
POST _plugins/_ism/remove/app-logs-000005
3.3 Rollover + Alias Pattern Practical Setup
Example 3: Rollover Alias Initial Setup and Operation Flow
This is the standard pattern for time-series index operations. The write alias always points to the latest index, and ISM rollover creates and switches to new indexes.
// Step 1: Create initial index + alias setup
PUT app-logs-000001
{
"aliases": {
"app-logs-write": { "is_write_index": true },
"app-logs-read": {}
}
}
// Step 2: ISM policy's rollover action handles automatically
// -> Creates app-logs-000002
// -> Moves app-logs-write alias to 000002
// -> Sets 000001's is_write_index to false
// Reads always use the read alias (targets all history)
GET app-logs-read/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"sort": [{ "@timestamp": "desc" }],
"size": 100
}
// Writes always use the write alias
POST app-logs-write/_doc
{
"@timestamp": "2026-03-04T10:00:00Z",
"level": "ERROR",
"service": "payment-api",
"message": "Payment gateway timeout occurred",
"trace_id": "abc123",
"duration_ms": 30500,
"http_status": 504
}
Alias status check:
# Check current write index
GET _alias/app-logs-write
# Check full alias structure
GET _cat/aliases/app-logs-*?v&h=alias,index,is_write_index
4. Monitoring and Alert Configuration
4.1 Essential Monitoring Metrics and Thresholds
| Category | Metric | Threshold | Severity |
|---|---|---|---|
| Cluster Status | ClusterStatus.red | 1 or more (1 min, 1 occurrence) | Critical |
| Cluster Status | ClusterStatus.yellow | 1 or more (1 min, 5 consecutive) | Warning |
| Write Blocking | ClusterIndexWritesBlocked | 1 or more (5 min, 1 occurrence) | Critical |
| Node Count | Nodes | under expected count (1 day, 1 occurrence) | Critical |
| Free Disk | FreeStorageSpace | 25% or less of node storage | Warning |
| JVM Heap | JVMMemoryPressure | 95% or more (1 min, 3 consecutive) | Critical |
| Old Gen JVM | OldGenJVMMemoryPressure | 80% or more (1 min, 3 consecutive) | Warning |
| CPU Utilization | CPUUtilization | 80% or more (15 min, 3 consecutive) | Warning |
| Master CPU | MasterCPUUtilization | 50% or more (15 min, 3 consecutive) | Warning |
| Write Threadpool | ThreadpoolWriteRejected | 1 or more (SUM, delta) | Warning |
| Search Threadpool | ThreadpoolSearchRejected | 1 or more (SUM, delta) | Warning |
| Snapshot Failure | AutomatedSnapshotFailure | 1 or more (1 min, 1 occurrence) | Critical |
4.2 Cluster Status Check APIs
Example 4: Comprehensive Cluster Health Check Command Collection
# Overall cluster status (green/yellow/red)
GET _cluster/health
# Per-index status check
GET _cluster/health?level=indices
# Per-node JVM, CPU, disk details
GET _nodes/stats/jvm,os,fs
# Hot threads check (for high CPU root cause analysis)
GET _nodes/hot_threads
# Thread pool status (check pending and rejected counts)
GET _cat/thread_pool?v&h=node_name,name,active,queue,rejected
# Shard allocation status
GET _cat/shards?v&h=index,shard,prirep,state,docs,store,node&s=state
# Unassigned shard root cause analysis
GET _cluster/allocation/explain
{
"index": "app-logs-000003",
"shard": 0,
"primary": true
}
# Per-index size and document count
GET _cat/indices?v&h=index,health,pri,rep,docs.count,store.size&s=store.size:desc
# Per-node disk usage
GET _cat/allocation?v&h=node,shards,disk.used,disk.avail,disk.percent
# Check pending tasks
GET _cat/pending_tasks?v
4.3 Catching Bottleneck Queries with Slow Log
PUT app-logs-*/_settings
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s",
"index.search.slowlog.threshold.fetch.warn": "5s",
"index.indexing.slowlog.threshold.index.warn": "10s",
"index.indexing.slowlog.threshold.index.info": "5s"
}
Setting the warn level to 10 seconds and info level to 5 seconds makes queries taking over 10 seconds immediate alerting targets. Analyze these logs in OpenSearch Dashboards' Discover or collect them into a separate index to track trends.
4.4 OpenSearch Alerting Configuration
Example 5: Cluster Status Monitor + Slack Alert
POST _plugins/_alerting/monitors
{
"type": "monitor",
"name": "Cluster Red Status Alert",
"monitor_type": "query_level_monitor",
"enabled": true,
"schedule": {
"period": { "interval": 1, "unit": "MINUTES" }
},
"inputs": [
{
"search": {
"indices": [".opensearch-sap-log-types-config"],
"query": {
"size": 0,
"query": {
"match_all": {}
}
}
}
}
],
"triggers": [
{
"query_level_trigger": {
"name": "Cluster Status Red",
"severity": "1",
"condition": {
"script": {
"source": "ctx.results[0].hits.total.value >= 0",
"lang": "painless"
}
},
"actions": [
{
"name": "Notify Slack",
"destination_id": "slack-destination-id",
"message_template": {
"source": "Cluster status is RED.\nCluster: {{ctx.monitor.name}}\nTime: {{ctx.periodEnd}}\nImmediate investigation required."
},
"throttle_enabled": true,
"throttle": { "value": 10, "unit": "MINUTES" }
}
]
}
}
]
}
Practical tip: In actual operations, external monitoring that periodically calls the _cluster/health API (Prometheus + Grafana, Datadog, etc.) is more reliable. OpenSearch's built-in alerting may not function when the cluster itself is unstable.
5. Incident Scenarios and Recovery Strategies
5.1 Response by Major Incident Type
| Incident | Cause | Impact | Immediate Response |
|---|---|---|---|
| Red Cluster | Unassigned primary shards | Data loss risk, write failure on affected index | Analyze cause with _cluster/allocation/explain |
| Yellow Cluster | Unassigned replicas | Redundancy broken, data loss risk on node failure | Add nodes or adjust replica count |
| Disk Full | Insufficient storage | ClusterBlockException, all writes blocked | Delete old indexes, review ISM policies |
| JVM OOM | Heap pressure over 95% | Node crash, cascading failures | Kill heavy queries, clear caches |
| Split Brain | Fewer than 3 master nodes + network partition | Data inconsistency | Dedicated master nodes (3 across 3 AZs) is mandatory |
| AZ Failure | AWS infrastructure issue | Data loss if replicas insufficient | Multi-AZ + 2 replicas configuration |
| Mapping Explosion | dynamic mapping + unstructured fields | Heap spike, indexing delays | Switch to dynamic: strict or dynamic: false |
5.2 Disk Full Emergency Recovery
When the cluster has switched to read-only, recover in the following order.
# 1. Identify the largest indexes
GET _cat/indices?v&h=index,store.size&s=store.size:desc&format=json
# 2. Delete old unnecessary indexes
DELETE old-logs-2024-*
# 3. Remove read-only block (all indexes)
PUT _all/_settings
{
"index.blocks.read_only_allow_delete": null
}
# 4. Remove cluster-level block
PUT _cluster/settings
{
"persistent": {
"cluster.blocks.read_only": false
}
}
# 5. Check cluster status
GET _cluster/health
When disk usage exceeds 85%, OpenSearch automatically switches indexes to read-only. This threshold can be adjusted via the
cluster.routing.allocation.disk.watermark.flood_stagesetting, but fundamentally freeing up disk space is the answer.
5.3 Snapshots and Recovery
Example 6: S3 Snapshot Repository Registration and Restoration
// Register S3 repository
PUT _snapshot/s3-backup
{
"type": "s3",
"settings": {
"bucket": "my-opensearch-snapshots",
"base_path": "daily",
"region": "ap-northeast-2",
"server_side_encryption": true,
"role_arn": "arn:aws:iam::123456789012:role/OpenSearchSnapshotRole"
}
}
// Create manual snapshot
PUT _snapshot/s3-backup/snapshot-2026-03-04
{
"indices": "app-logs-*",
"ignore_unavailable": true,
"include_global_state": false
}
// Check snapshot status
GET _snapshot/s3-backup/snapshot-2026-03-04/_status
// Restore specific indexes only (rename to avoid conflicts with existing indexes)
POST _snapshot/s3-backup/snapshot-2026-03-04/_restore
{
"indices": "app-logs-000001,app-logs-000002",
"ignore_unavailable": true,
"include_global_state": false,
"rename_pattern": "(.+)",
"rename_replacement": "$1_restored",
"index_settings": {
"index.number_of_replicas": 0
}
}
// Restore replicas after restoration is complete
PUT app-logs-000001_restored/_settings
{ "index.number_of_replicas": 1 }
Set
include_global_state: falseto avoid conflicts with security indexes (.opendistro_security). Always use this option.
5.4 DR Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Snapshot/Restore | Hours | Data since last snapshot | Low | Low |
| Cross-Cluster Replication | Minutes | Under 1 min lag | High (2x cluster) | Medium |
| Dual Ingestion (Active-Active) | Near zero | Near zero | High (2x cluster + routing) | High |
CCR caveat: If replication is paused for more than 12 hours, it cannot be resumed. It must be restarted from scratch.
6. AWS OpenSearch Service Operations Checklist
6.1 Instance Type Selection
| Use Case | Recommended Instance | Notes |
|---|---|---|
| Small production | r6g.large | Can serve as combined data + master |
| General production | r6g.xlarge to r6g.2xlarge | Price-performance balance |
| Log analytics (large) | im4gn.* | High storage density per vCPU |
| Indexing-intensive | OR1 instances | 30% better price-performance (2024+) |
| Dedicated Master | 3 instances across 3 AZs | Latest generation, always odd count |
| Not for production | t2.*, t3.small | CPU throttling when burst credits deplete |
6.2 Storage Tier Utilization
| Tier | Use Case | Backend | Cost (Relative) |
|---|---|---|---|
| Hot | Active read/write | EBS (gp3 recommended) | 100% |
| UltraWarm | Read-only, infrequent queries | S3 + cache | ~40% |
| Cold | Archive, on-demand analytics | S3 | $0.024/GiB/month |
- UltraWarm becomes cost-effective above approximately 2.5 TiB.
- Use gp3 for EBS -- 9.6% cheaper than gp2 with no burst credit concerns.
6.3 Cost Optimization Strategies
- Automatic deletion with ISM -- Automatically clean up indexes past their retention period.
- Index rollups -- Aggregate granular data then move to UltraWarm/Cold.
- Reserved Instances -- Purchase after 14+ days of stable operation. 1-year no upfront saves ~30%, 3-year all upfront saves ~50%.
- Delete unused indexes -- They consume resources even during maintenance tasks (snapshots, etc.).
- Enable Auto-Tune -- Automatically optimizes JVM, queue sizes, and caches.
7. OpenSearch vs Elasticsearch Operations Perspective Comparison
| Category | OpenSearch | Elasticsearch |
|---|---|---|
| License | Apache 2.0 (fully open source) | SSPL + Elastic License (from 7.11) |
| Security | Free built-in (RBAC, FGAC, audit log) | Paid (Platinum+) |
| Alerting | Free plugin (Alerting) | Watcher (paid) |
| Anomaly Detection | Free plugin | ML (paid Platinum+) |
| Lifecycle Mgmt | ISM (Index State Management) | ILM (Index Lifecycle Management) |
| SQL Support | Free plugin + PPL | Paid feature |
| Managed Service | AWS OpenSearch Service | Elastic Cloud (multi-cloud) |
| Vector Search | k-NN plugin (FAISS, nmslib, Lucene) | Native integration (8.x HNSW) |
| Cloud | AWS-centric | AWS, GCP, Azure support |
| Release Cycle | ~3 month minor releases | ~2 week patches, ~4 month minor |
Selection criteria summary:
- OpenSearch: AWS infrastructure-centric, cost-sensitive, open-source license required, log/observability workloads
- Elasticsearch: Elastic APM/SIEM needed, multi-cloud deployment, vector search/RAG focus, Elastic ecosystem (Kibana, Beats, Logstash) utilization
8. Elasticsearch to OpenSearch Migration Checklist
8.1 Migration Method Comparison
| Method | Downtime | Best For | Complexity |
|---|---|---|---|
| Snapshot/Restore | Proportional to data | Simple migration, small scale | Low |
| Rolling Upgrade | Minimal | ES 6.8-7.10.2 to OS 1.x | Medium |
| Remote Reindex | None (live) | Cross-version moves, mapping changes | Medium |
| Migration Assistant | Near zero | Large clusters, multi-step version moves | Low (automated) |
8.2 Pre-Migration Checklist
- Version compatibility check -- OpenSearch 1.x is based on ES 7.10.2. ES 7.11+ (SSPL) requires snapshot/reindex
- Plugin compatibility audit -- Verify third-party plugins are supported in OpenSearch
- API path changes --
_opendistro/to_plugins/(security, alerting, ISM, and all plugin APIs) - Client library replacement --
elasticsearch-pytoopensearch-py,elasticsearch-jstoopensearch-js - Kibana to OpenSearch Dashboards -- Complete all reference name changes
- Full snapshot backup -- Required before starting migration
- Feature parity verification -- Confirm features in use exist in OpenSearch (refer to official compatibility matrix)
- Performance benchmark -- Measure with identical query sets on old and new environments
- Rollback plan -- Document the procedure for reverting to the original cluster if migration fails
- Security configuration migration -- Recreate roles/users/tenants via
securityadmin.shor API
8.3 Remote Reindex Practical Example
Example 7: Live Reindex from Remote Cluster
POST _reindex
{
"source": {
"remote": {
"host": "https://old-es-cluster:9200",
"username": "admin",
"password": "********",
"socket_timeout": "60m",
"connect_timeout": "30s"
},
"index": "app-logs-2025-*",
"size": 5000,
"query": {
"range": {
"@timestamp": {
"gte": "2025-01-01",
"lt": "2026-01-01"
}
}
}
},
"dest": {
"index": "app-logs-migrated"
},
"conflicts": "proceed"
}
Caution:
- The source host must be registered in the
reindex.remote.allowlistsetting. - For large-scale migrations, run asynchronously with
wait_for_completion=falseand monitor progress withGET _tasks/{task_id}. size: 5000is the scroll size per fetch. Adjust based on network bandwidth.
8.4 API Path Change Reference
| Elasticsearch / Open Distro | OpenSearch |
|---|---|
_opendistro/_security/ | _plugins/_security/ |
_opendistro/_alerting/ | _plugins/_alerting/ |
_opendistro/_ism/ | _plugins/_ism/ |
_opendistro/_anomaly_detection/ | _plugins/_anomaly_detection/ |
_opendistro/_sql/ | _plugins/_sql/ |
_opendistro/_ppl/ | _plugins/_ppl/ |
_opendistro/_knn/ | _plugins/_knn/ |
9. Performance Tuning Quick Reference
| Setting | Default | Recommended | Effect |
|---|---|---|---|
index.refresh_interval | 1s | 30s or more (for logs) | CPU/IO savings, delayed search visibility |
index.translog.flush_threshold_size | 512 MiB | 25% of JVM heap | Reduced flush frequency |
index.merge.policy | tiered | log_byte_size (time-series) | Improved @timestamp range query performance |
index.codec | LZ4 | zstd_no_dict | 25-30% disk savings |
| Bulk request size | -- | 5-15 MiB | Start at 5 MiB, increase until performance plateaus |
| Bulk batch count | -- | 5,000-10,000 docs/_bulk | Dramatic throughput improvement over single calls |
indices.memory.index_buffer_size | 10% | 10-15% | Expanded buffer for indexing-intensive workloads |
search.max_buckets | 65535 | Depends on workload | Safety limit for aggregation queries |
Bulk indexing optimization script:
from opensearchpy import OpenSearch, helpers
client = OpenSearch(
hosts=[{"host": "opensearch-node", "port": 9200}],
http_auth=("admin", "admin"),
use_ssl=True,
verify_certs=False,
)
def generate_actions(file_path):
"""Reads a log file and converts to _bulk actions"""
import json
with open(file_path) as f:
for line in f:
doc = json.loads(line)
yield {
"_index": "app-logs-write",
"_source": doc,
}
success, errors = helpers.bulk(
client,
generate_actions("/var/log/app/events.jsonl"),
chunk_size=5000,
max_retries=3,
request_timeout=60,
)
print(f"Indexed: {success}, Errors: {len(errors)}")
10. Security Hardening Checklist
| Item | Setting | Notes |
|---|---|---|
| Transport Encryption | TLS 1.2+ required | Both inter-node and client |
| Authentication | SAML / OIDC / Internal DB | AWS uses Cognito or SAML |
| FGAC | Index/field/document level access | _plugins/_security/api/roles |
| Audit Log | Read/write access recording | Required for compliance |
| IP-Based Access | VPC + Security Group | Never use public endpoints |
| API Key Management | Periodic rotation | 90-day cycle recommended |
Conclusion: 7 Things Every Operator Must Remember
- Target 30-50 GiB shard size -- Too small means metadata overhead, too large means recovery delays.
- Set up ISM proactively -- Apply it when indexes are created, not after the disk fills up.
- Monitor in layers -- Cluster status, then JVM/CPU, then thread pool rejections, then slow logs.
- 3 dedicated master nodes are non-negotiable -- The only way to prevent split brain at its source.
- Snapshot daily, test restoration quarterly -- A backup that has never been restored is not a backup.
- dynamic: strict should be the default -- Prevention is the only answer for mapping explosion. The only remediation is reindex.
- Always watch disk watermarks -- Warning at 75%, read-only at 85%. ISM auto-deletion is the only safety net.
References
- OpenSearch - Choosing the number of shards (AWS)
- OpenSearch - Operational best practices (AWS)
- OpenSearch - Index State Management
- OpenSearch - ISM Policies
- OpenSearch - Index templates
- OpenSearch - Data streams
- OpenSearch - Tuning for indexing speed
- OpenSearch - Alerting
- OpenSearch - Security configuration
- AWS - Recommended CloudWatch alarms for OpenSearch
- AWS - Sizing OpenSearch Service domains
- AWS - Multi-tier storage for OpenSearch
- AWS - Cross-cluster replication
- AWS - Migrating to Amazon OpenSearch Service
- OpenSearch - Migrate or upgrade
- OpenSearch - Remote reindex
- Benchmarking OpenSearch and Elasticsearch - Trail of Bits (2025)