Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — A Bank Day Does Not End at Midnight

In the small hours, after branches close and mobile app users fall asleep, a bank data center goes through its busiest time of the day. Thousands of batch jobs flow along a dependency graph, computing interest, rolling loans into delinquency, producing reports, and closing the ledger. This whole process is called end-of-day (EOD) processing.

A question I hear often: in this real-time era, why still batch? This article starts with that answer and then draws the full picture of banking batch architecture: the scope of EOD processing, job schedulers and dependency management, high-volume techniques, the coexistence of online and batch, failure recovery, and modernization trends.

This article is a technical resource from a system architecture perspective, not advice on any institution's internal implementation or regulatory interpretation.

Why Batch Remains Central in Banking

Batch does not survive merely because of technical debt. Some workloads are inherently batch.

1. **Work that needs a reference point in time**: Interest accrual, asset classification, and reports must apply uniformly to all accounts as of a specific moment. With an ever-flowing stream and no reference point, you cannot even define "interest based on today's closing balance."

2. **Regulatory reporting is daily-grained**: Most supervisory reports, central bank submissions, and deposit insurance outputs require business-day snapshots.

3. **The structure of reconciliation and settlement**: Settlement with external institutions (clearing houses, card networks, securities depositories) is a protocol of file exchange and daily finalization — batch by nature.

4. **Processing efficiency**: Accruing interest for ten million accounts is tens of times more efficient as set-based bulk processing than as individual online transactions.

In other words, batch versus real-time is not an either-or choice. It is a division of roles: **work that needs a reference point goes to batch; work that needs immediacy goes online.**

The Scope of EOD — What Happens at Night

The representative workloads processed in the EOD window are as follows.

| Workload | Description | Characteristics |

| --- | --- | --- |

| Interest accrual | Daily interest calculation and accumulation per deposit/loan account | All accounts, compute-intensive |

| Value date processing | Final attribution of after-cutoff transactions to business days | Tied to date cut-over |

| Delinquency rollover | Transition of overdue loans to delinquent status, penalty interest start | State machine, regulation-dependent |

| Asset quality classification | Reclassification into normal/precautionary/substandard etc. | Full loan book re-evaluation |

| Maturity processing | Term deposit maturity payout/rollover, loan due-date handling | Diverse product rules |

| Fees and settlement | Daily fee aggregation, settlement file generation for external bodies | External interfaces |

| GL posting | Aggregation of the transaction ledger into the General Ledger | Balanced-entry gate |

| Reports and DW load | Regulatory report generation, data warehouse ETL | Runs after close is finalized |

The key point is that these workloads have **ordering dependencies**. Today's transactions must be finalized before interest accrual; asset classification follows delinquency rollover; GL posting is possible only after all ledger movements are done.

Batch Architecture — Schedulers and the Dependency DAG

No human can run thousands of jobs in order, so a job scheduler launches jobs along a dependency graph (DAG).

[EOD dependency DAG (simplified)]

Online cutoff declared

│

▼

Finalize today's transactions ──┬─────────────┐

│ │ │

▼ ▼ ▼

Interest accrual Fee aggregation Maturity processing

│ │ │

▼ │ │

Delinquency rollover ◀──────────┘ │

│ │

▼ │

Asset quality classification ◀────────────────┘

│

▼

Ledger verification (balanced entries) ── gate: halt everything on failure

│

▼

GL posting ──▶ Report generation ──▶ DW load

│

▼

Date cut-over ──▶ Online resumes

Traditionally, banks have used commercial enterprise schedulers (Control-M, TWS/IWS, AutoSys, JobScheduler families, and so on), and more shops now adopt open source such as Airflow. A comparison of perspectives:

| Perspective | Commercial schedulers | Airflow and other open source |

| --- | --- | --- |

| Dependency model | Job/calendar centric, operator-friendly GUI | Code-based DAGs, developer-friendly |

| Business-day calendar | Strong built-in holiday/business-day calendars | Build yourself or use plugins |

| Restart and exception handling | Easy immediate intervention from ops console | Task-level retry, UI intervention possible |

| Audit and authorization | Mature controls at the level finance demands | RBAC possible but needs configuration |

| Cost and scaling | High license cost | Free license, requires ops capability |

Either way, the non-negotiable requirements are the same.

- **Business-day calendar**: Holidays, ad hoc closures, and month-end/quarter-end branching must be expressible in schedule definitions.

- **Checkpoints and restart**: When a job fails midway, it must restart from the failure point, not from the beginning.

- **Enforced precedence**: Downstream jobs must not run without upstream success, and forced starts must only be possible under authorization control with an audit record.

High-Volume Techniques — Partitioning, Parallelism, Chunks

The fundamentals for batches that process tens of millions of rows are three.

1. **Partitioning**: Split the workload into non-overlapping ranges (account number ranges, hashes, branches) and run them in parallel.

2. **Chunk processing**: Instead of committing per record, read-process-write in chunks of N records before each commit, reducing transaction overhead.

3. **Prefer set-based operations**: Use SQL bulk operations (INSERT SELECT, bulk UPDATE) wherever possible instead of row-by-row loops.

The skeleton of an interest accrual job in Spring Batch looks like this.

@Configuration

public class InterestAccrualJobConfig {

@Bean

public Job interestAccrualJob(JobRepository jobRepository, Step partitionedStep) {

return new JobBuilder("interestAccrualJob", jobRepository)

.start(partitionedStep)

.build();

}

// Partition by account number range and run gridSize partitions in parallel

@Bean

public Step partitionedStep(JobRepository jobRepository,

Step accrualStep,

AccountRangePartitioner partitioner,

TaskExecutor taskExecutor) {

return new StepBuilder("partitionedAccrualStep", jobRepository)

.partitioner("accrualStep", partitioner)

.step(accrualStep)

.gridSize(8)

.taskExecutor(taskExecutor)

.build();

}

// Chunk-oriented processing: commit after reading/calculating/writing 1000 records

@Bean

public Step accrualStep(JobRepository jobRepository,

PlatformTransactionManager txManager,

JdbcPagingItemReader<Account> reader,

InterestCalculator processor,

JdbcBatchItemWriter<AccrualResult> writer) {

return new StepBuilder("accrualStep", jobRepository)

.<Account, AccrualResult>chunk(1000, txManager)

.reader(reader)

.processor(processor)

.writer(writer)

.faultTolerant()

.skipLimit(0) // financial batch: fail and investigate, do not skip

.build();

}

public class AccountRangePartitioner implements Partitioner {

@Override

public Map<String, ExecutionContext> partition(int gridSize) {

Map<String, ExecutionContext> result = new HashMap<>();

long min = 1_0000_0000L, max = 9_9999_9999L;

long size = (max - min) / gridSize + 1;

for (int i = 0; i < gridSize; i++) {

ExecutionContext ctx = new ExecutionContext();

ctx.putLong("minAccountNo", min + size * i);

ctx.putLong("maxAccountNo", Math.min(min + size * (i + 1) - 1, max));

result.put("partition" + i, ctx);

}

return result;

}

Design points:

- **skipLimit(0)**: In general data pipelines you skip bad records, but skipping an account in interest accrual means that account's interest disappears. The principle for core financial jobs is: stop on failure and investigate.

- **JdbcPagingItemReader**: Cursor-based readers can be hard to reposition on restart in some environments; key-based paging readers are friendlier to restart safety.

- **Partition boundaries should reflect data distribution**: If account numbers cluster by branch, even-range splitting becomes unbalanced. Consider hash splitting or statistics-driven splitting.

Coexistence of EOD and Online — Center-Cut and Date Cut-Over

The central challenge of the 24-hour banking era is: transfers keep arriving during close. The mechanisms that handle this are center-cut processing and the date cut-over sequence.

Center-cut is a batch-driven approach that pushes high volumes through the online transaction path. For example, a payroll file of 100,000 credits is received as a batch, and internally the system fires online transfer transactions at high speed. Because the online path is reused, limit checks and balance checks remain in one place.

Date cut-over proceeds roughly in the following sequence.

[Date cut-over sequence (simplified)]

23:50 Pre-cutoff notice : announce restrictions on some channels

23:55 Start queueing new txns: arrivals from now go to next-day value queue

00:00 Logical date switch : system business day D → D+1

│

├─ Queued transactions: resume online processing with value date D+1

│ (the account ledger stays live — zero-downtime switch)

│

00:05 Snapshot of day-D finalized txns → EOD batch starts

02:00 Ledger verification and GL posting complete

04:00 Reports and DW load complete, EOD declared finished

Important design decisions:

- **Zero-downtime or not**: In the past, online was taken down during close; today the standard is a non-stop close that switches only the value date while online stays up. This requires ledger reads and writes to be separated by business day (the business_date design from the previous article pays off here).

- **A consistent snapshot at cutoff**: Interest must be computed on balances as of D 23:59:59 while online keeps running; this is solved with a logical snapshot that sums only postings with value date D.

- **SLA for queued transactions**: Transactions queued during the switch must be processed within minutes, and queue depth must be monitored.

Batch Performance Optimization — Indexes and Bulk Operations

Most batch performance problems come from a batch full-scanning a schema designed for online.

- **Reads**: Use batch-specific covering indexes or partition pruning. With date-partitioned posting tables, read only today's partition.

- **Writes**: Use batch INSERT (JDBC batch, COPY) instead of row-by-row INSERT. For massive UPDATEs, building results in a temporary table and merging is often faster.

- **Index maintenance cost**: Secondary indexes on tables loaded with millions of rows slow loading dramatically. For nightly bulk-load tables, minimize indexes or rebuild them after loading.

-- Set-based bulk interest accrual instead of a row-by-row loop (example)

INSERT INTO interest_accruals (account_id, business_date, accrual_amount)

SELECT b.account_id,

DATE '2026-06-13',

ROUND(b.balance * r.daily_rate, 0)

FROM eod_balance_snapshot b

JOIN product_rates r ON r.product_code = b.product_code

WHERE b.snapshot_date = DATE '2026-06-13'

AND b.balance > 0;

However, the more set-based you go, the weaker the traceability of "why did this account get this amount." Storing the calculation basis (applied rate, reference balance) alongside the result helps greatly with audits.

Failure Handling — Partial Reprocessing and Idempotent Design

Half of batch design is building a restart button the operator can press with confidence when a job dies at 2 a.m.

Three principles:

1. **Idempotent re-execution**: Running the same job with the same parameters twice must yield the same result. For interest accrual, explicitly choose either "delete today's accruals, then recompute" or "block duplicates with an account-plus-date unique constraint."

2. **Checkpoint-based partial reprocessing**: Chunk commit points are checkpoints, so completed chunks are skipped on restart. Spring Batch JobRepository manages this state.

3. **Corrections go through correction jobs**: The moment an operator fixes data with ad hoc SQL, idempotency and audit trails break. Corrections must also run as jobs that leave records.

-- Idempotency pattern: account+date unique constraint plus cleanup on re-run

ALTER TABLE interest_accruals

ADD CONSTRAINT uq_accrual UNIQUE (account_id, business_date);

-- Restart pre-step: delete today's unfinalized accruals (finalized flag protected)

DELETE FROM interest_accruals

WHERE business_date = DATE '2026-06-13'

AND finalized = false;

Jobs interfacing with external institutions are trickier still. If a settlement file was already sent but the job was recorded as failed, a re-run could send the file twice. Split transmission into create-verify-send-confirm stages, with an idempotency key (file name plus date) specifically for the send stage.

Monitoring — Batch SLAs and Delay Alerts

EOD has an absolute deadline: complete before the morning opening of business. For monitoring, **delay prediction** matters more than failure detection.

- **Critical path management**: Identify the longest dependency path in the DAG and compare start/end times of jobs on the path against baselines.

- **Stage SLAs**: Set window SLAs such as "interest accrual starts by 01:30, ends by 02:30," and page on breach.

- **Throughput trends**: If the same job is twice as slow as yesterday, suspect data growth, stale optimizer statistics, or plan changes. Accumulate per-job record counts and durations as time series.

- **Upstream data validation**: External files arriving late or with zero records are a common failure cause. Make "file arrived plus record count sanity" a start condition of the job.

Batch Modernization — Near-Real-Time and Cloud

Modernization does not mean turning all batch into real-time. Realistic directions are:

1. **Separate calculation from finalization**: Pre-compute interest incrementally during online hours (near-real-time aggregation) and perform only the finalization at close, shrinking the EOD window.

2. **Event-driven transitions**: Work with clear triggers, like delinquency rollover (due date passed), can move from a once-a-day batch to events. Consistency with regulatory reporting reference points still requires daily finalization, though.

3. **Batch on the cloud**: Moving the execution platform to Kubernetes Job/CronJob, AWS Batch, or managed Airflow buys elastic parallelism (4 nodes normally, 16 at month-end). In exchange, business-day calendars, restart controls, and data adjacency (network latency to the database) become your design problems.

4. **There is no accounting without a close**: In any architecture, the accounting requirement to finalize numbers as of a point in time does not go away. The goal of modernization is not eliminating the close but shortening and stabilizing the close window.

Testing — You Cannot Trust a Batch Without Volume

Batches that behave at ten thousand records die at ten million. The core of test strategy is production-scale data.

- **Synthetic data generators**: Build generators that mimic production distributions (account counts, product mix, transaction density) and fill performance environments. Copying production data requires masking and de-identification controls, so synthetic generation is cleaner from the start.

- **Boundary-day testing**: Month-end, quarter-end, year-end, leap day February 29, and long-holiday sequences (the first business day after a holiday cluster) deserve dedicated scenarios. Interest accrual on the day after holidays accumulates for the holiday days, so results differ from a normal weekday.

- **Restart rehearsals**: Periodically kill a job on purpose mid-run and verify the restart result is sound — no double accrual, no gaps.

An Operations Scenario — Responding to a Delayed Close

Let me close the main body with a fictional but typical incident scenario.

[Scenario: interest accrual job delayed]

01:40 Accrual job normally takes 40 min → at 60 min, only 55% done

01:45 Delay alert fires. On-call assesses critical path impact

→ at this rate GL posting lands at 04:30; will miss opening

01:50 Root cause: new product launched yesterday grew target

accounts by 30% + stale statistics changed the plan

(index scan → full scan)

02:00 Decision: stop job → refresh statistics → checkpoint restart

(parallelism 8 → 12, within a pre-validated ceiling)

02:10 Restarted. Completed partitions skipped; only remainder runs

03:05 Accrual done, downstream jobs proceed automatically

04:10 EOD completes normally. Follow-up: add batch impact review

to the product launch checklist; promote the statistics

refresh job into the upstream sequence

What this scenario shows is that incident response capability ultimately comes from everyday design — checkpoints, idempotent restarts, adjustable parallelism, and critical path visibility.

Resource Contention Between Batch and Online — Isolation Strategies

In a non-stop close, batch and online compete for the same database. Without isolation, a heavy batch makes online response times spike, and in the worst case lock waits time out channel transactions.

Practical isolation tools, in escalating order:

| Tool | Description | Where it applies |

| --- | --- | --- |

| Temporal isolation | Schedule windows away from online peaks | The baseline, but limited with 24-hour channels |

| Resource isolation | Dedicated batch connection pools, session caps | Prevents DB connection exhaustion |

| Read separation | Run read-only batches on read replicas | Reports, DW loads |

| Lock minimization | Short transactions, chunk commits, optimistic processing | Ledger-updating batches |

| Priority control | Demote batch sessions via DB resource managers | Protection when batch spills into peak hours |

Two patterns deserve special caution.

1. **Long-running batch transactions**: Processing a million rows in one transaction explodes undo pressure and lock-hold time. Chunk commits are an online-protection device before they are a performance technique.

2. **Batches touching the same rows as online**: If interest accrual updates account rows that online transfers also update, design row lock ordering and lock timeouts explicitly. Give the batch side short timeouts plus retries, so the batch never becomes the long lock holder.

Job Design Standards — What Makes a Batch Operable

To operate thousands of jobs, **consistency of standards** matters more than the quality of any single job. The skeleton of a job design standard every organization should have:

- **Naming convention**: A system-domain-frequency-sequence shape (for example DEP-INTACC-D-010) so the job name alone reveals ownership and frequency.

- **Parameter standard**: Every job takes the business date as an explicit parameter. A job that reads "today" from the system clock is a job that cannot be reprocessed.

- **Exit code convention**: Unify a code scheme the scheduler can branch on automatically — success (0), completed with warnings (4), restartable failure (8), failure needing intervention (16).

- **Log standard**: Emit structured logs with start/end times, input counts, processed counts, skip counts, and the business date. A mismatch between processed count and input count is itself alert-worthy.

[Job execution summary log example (structured)]

job=DEP-INTACC-D-010 business_date=2026-06-13

status=COMPLETED exit_code=0

read_count=10482917 write_count=10482917 skip_count=0

started=01:12:03 ended=01:54:41 duration=42m38s

partition_count=8 restart=false

When these five lines accumulate daily, they become the source data for throughput trend analysis and delay prediction. Without this standard, every incident burns time decoding a different log format per job.

Beyond EOD — Month-End, Quarter-End, Year-End

Larger closing cycles stack on top of EOD. Month-end (EOM) runs monthly interest settlement, fee billing, and monthly reports; quarter-end finalizes asset quality and provisioning; year-end (EOY) performs annual closing and carry-forward.

Days where cycles overlap — December 31 stacks EOD, EOM, quarter-end, and EOY all at once — are the tricky ones to design.

- **Window sizing**: Total duration on overlap days can be two to three times a normal weekday. Size the window for the worst-case combination day; never promise SLAs based on normal weekdays.

- **Inter-cycle dependencies**: Month-end presumes the successful completion of the last EOD of the month. You need a gate that automatically holds EOM when EOD fails.

- **Year-end freeze**: Many institutions run deployment freezes around year-end closing. Jobs that execute only at year-end get one verification opportunity per year, so running mock year-end closes in a rehearsal environment during the year is the only way to prevent accidents.

[Stacked closing cycles (example: December 31)]

EOD (daily close)

└─ EOM (month-end) : December interest settlement, monthly reports

└─ EOQ (quarter-end): asset classification final, provisions

└─ EOY (year-end) : annual closing, carry-forward, annual reports

Execution order: EOD → EOM → EOQ → EOY (each gates the next)

Pitfalls and Anti-Patterns

Finally, a collection of anti-patterns repeatedly observed in banking batch.

1. **System clock dependence**: Jobs deriving the business date from the current time internally. They can never be re-run for yesterday. The business date is always a parameter.

2. **Implicit dependencies**: Time-based dependencies like "job A usually ends at 1:00, so start job B at 1:30." On the day A is late, B processes empty data. Dependencies must be declared as precedence relations.

3. **Non-restartable intermediate state**: A job that TRUNCATEs then loads a staging table dies midway, and a restart begins from a half-empty table. Check that state is self-contained at every step boundary.

4. **Infinite retries**: Infinite retry on external institution failures drains your own queues and threads when the counterparty is down. Set retry ceilings and a handover point to manual intervention.

5. **Alert floods**: Sending every job failure to the same channel buries the truly critical alerts. Separate alert severities for critical path jobs versus ordinary jobs.

6. **The "run once" correction script**: Unrecorded one-off SQL fixes come back as reconciliation breaks at the next close. Corrections are jobs too, with records.

7. **Small data in test environments**: A performance test passed with ten thousand rows guarantees nothing. Execution plans change with data volume.

Design Checklist

- [ ] Is the full EOD dependency set documented and codified as a DAG?

- [ ] Does the schedule definition reflect the business-day calendar (holidays, month-end, quarter-end)?

- [ ] Is every job safe (idempotent) when re-run with the same parameters?

- [ ] Is completed-work skipping on checkpoint restart verified?

- [ ] Are skip policies set per workload nature (no skipping for core financial jobs)?

- [ ] Do partition boundaries reflect data distribution?

- [ ] Do transaction queueing and value-date separation work during date cut-over?

- [ ] Are close verifications (balanced entries, count reconciliation) gates for downstream jobs?

- [ ] Are external file arrival and count sanity checks start conditions?

- [ ] Are critical path tracking, stage SLAs, and delay alerts in operation?

- [ ] Are per-job throughput and duration accumulated as time series?

- [ ] Are performance and restart rehearsals run with production-scale synthetic data?

- [ ] Are boundary days (month-end, year-end, leap day, holiday clusters) included in tests?

- [ ] Are recorded correction jobs used instead of manual data fixes?

Closing Thoughts

Banking batch is not legacy technology; it is the most efficient form of implementing the accounting requirement to finalize numbers as of a reference point. The essence of good batch architecture is not a fancy framework — it is explicit dependencies, idempotent restarts that assume failure, and a restart button the 2 a.m. operator can press with confidence. If you want to shrink the EOD window, do not try to remove batch; first consider moving calculation earlier and leaving only finalization in the close.

References

- Spring Batch documentation: https://docs.spring.io/spring-batch/reference/

- Spring Batch project: https://spring.io/projects/spring-batch

- Apache Airflow documentation: https://airflow.apache.org/docs/

- Kubernetes Job documentation: https://kubernetes.io/docs/concepts/workloads/controllers/job/

- Kubernetes CronJob documentation: https://kubernetes.io/docs/concepts/workloads/controllers/cron-job/

- AWS Batch documentation: https://docs.aws.amazon.com/batch/

- PostgreSQL documentation (Partitioning): https://www.postgresql.org/docs/current/ddl-partitioning.html

- Korea Financial Telecommunications and Clearings Institute: https://www.kftc.or.kr/

- Financial Supervisory Service (Korea): https://www.fss.or.kr/

- BIS (Bank for International Settlements): https://www.bis.org/

- JCP — Batch Applications for the Java Platform (JSR 352): https://jcp.org/en/jsr/detail?id=352