Forem: TimechoDB

Apache IoTDB in Connected Vehicle Management: Scaling Telemetry for Millions

TimechoDB — Thu, 16 Apr 2026 06:32:00 +0000

Modern connected vehicle platforms generate massive, high-frequency telemetry under variable connectivity, creating unique challenges for real-time ingestion, millisecond-level per-vehicle queries, and fleet-wide analytics.

Building on our previous articles—"Powering Intelligent Transportation with Apache IoTDB: Managing Time-Series Data at Scale" and "Apache IoTDB in Urban Rail Operations and Maintenance—Use Cases and Technical Deep Dive"—this article presents use cases from Connected Vehicle Management Scenarios how IoTDB's time-series–native architecture with TsFile compression efficiently supports millions of vehicles while reducing infrastructure footprint and operational complexity.

The Scale Challenge in Connected Vehicles

Managing 1.6 million vehicles, 800,000 concurrently active, producing 20 TB/day, introduces unique technical challenges:

High concurrency writes: Millions of vehicles transmit telemetry simultaneously, often unpredictably.
Per-vehicle queries: Remote diagnostics require millisecond-level access to individual vehicle data.
Fleet-wide analytics: Aggregations across millions of vehicles must complete in seconds, not hours.
Variable connectivity: Data arrives out-of-order due to network gaps, tunnels, and parking garages.
Surge capacity: Holidays and events trigger acute traffic spikes that must be absorbed without service degradation.

Safety, compliance, and business-critical decisions all depend on reliable, real-time telemetry ingestion and querying.

Case 1: Changan Automobile—570,000 Vehicles, 1 IoTDB Instance

Background

Changan's connected vehicle platform supports real-time driver assistance, remote diagnostics, and predictive maintenance. Previously, HBase required 25 nodes to handle ingestion and queries for 80 million measurement points across 150 million time-series.

Migration to IoTDB

Single-node deployment replaced 25 HBase nodes.
Real-time write: Tens of millions of data points per second sustained.
Query latency: Minutes-to-milliseconds reduction for per-vehicle time-range scans
Latest-value retrieval: Millisecond responses from in-memory buffers.
Compression efficiency: TsFile reduces storage and I/O by 10–30×, lowering infrastructure needs.

Why it works

IoTDB's columnar format stores each measurement channel independently, enabling high-throughput writes and per-vehicle queries without scanning irrelevant data. Combined with TsFile compression, it significantly reduces hardware and operational overhead.

Case 2: AutoAI—1.6 Million Vehicles, 20 TB/day

Background

Supports Toyota driving behavior analytics. Beyond raw telemetry, the platform performs fleet-wide pattern analysis, driving safety scores, and regulatory reporting. Previously HBase with heavy application-layer logic was used.

Key Results after IoTDB Migration

Infrastructure: Reduced to 25–33% cost of previous HBase deployment.
Storage: Cut to 1/10 of prior footprint.
Peak throughput: 2M points/sec sustained during commute and holiday peaks.
Fleet-wide analytics: Trailing 15–30 minute queries over 1.6M vehicles now complete in seconds.
Operational simplicity: Ingestion and query separation prevents write spikes from slowing analytics; cluster scales dynamically without downtime.

Architectural highlights

Path-based schema: supports per-vehicle, regional, system-level, and cross-fleet queries efficiently.
Out-of-order writes: IoTDB inserts late-arriving data correctly without application-layer buffering.
Sensor-aware compression: Delta encoding for monotonic values, run-length for binary signals.
Analytics integration: Direct access via JDBC/SQL, Spark/Flink, REST, and Kafka pipelines.

Comparative Insights: Connected Vehicles vs. Urban Rail

Aspect	Connected Vehicles	Urban Rail
Write patterns	Variable-frequency, unpredictable telemetry from millions of moving vehicles	Fixed-route, predictable telemetry from trains on known schedules
Query patterns	Recent-window fleet analytics and millisecond per-vehicle lookups	Deep historical maintenance queries over months/years
Infrastructure impact	More dramatic server consolidation due to HBase inefficiency with high-cardinality per-sensor queries	Moderate consolidation; edge + central clusters suffice
Connectivity handling	Must tolerate intermittent connectivity, out-of-order data	Edge synchronization handles occasional connectivity gaps
Platform flexibility	Same IoTDB platform supports both workloads without architectural compromise	Same IoTDB platform supports both workloads without architectural compromise

The same IoTDB platform accommodates both domains without architectural compromise.

Summary

Apache IoTDB enables real-time ingestion, efficient columnar storage, and low-latency per-vehicle and fleet-wide queries at production scale. Infrastructure is reduced, operational complexity lowered, and analytics accelerated.

Its architecture scales with increasing vehicle count, sensor density, and analytics demands without requiring fundamental re-engineering of the data layer.

Apache IoTDB in Urban Rail Operations and Maintenance—Use Cases and Technical Deep Dive

TimechoDB — Tue, 14 Apr 2026 03:31:00 +0000

The Data Intensity of Modern Rail Systems

Urban rail operations produce telemetry at extreme scale. A single metro train typically carries hundreds of sensors monitoring traction, braking, doors, HVAC, wheel wear, pantograph force, and other subsystems. At fleet scale—hundreds of trains plus trackside and station systems—the data volume becomes operationally significant.

In one representative deployment, the platform ingests 414 billion data points per day from a single metro management system. At this scale, data infrastructure directly affects reliability. Delayed ingestion can hide early fault signals; slow queries reduce operational visibility during incidents; inefficient storage drives unsustainable infrastructure costs. These constraints define the database requirements for rail O&M platforms.

This article analyzes how Apache IoTDB addresses these challenges across three production deployments.

We previously discussed the differences between Apache IoTDB as a time-series database and traditional databases. This article focuses on real-world application scenarios. If you need background context, please refer to: "Apache IoTDB for Intelligent Transportation — Architecture, Core Capabilities, and Industry Fit".

The Urban Rail Data Problem

High Measurement Density

A typical train exposes 1,000–5,000 measurement channels. Across a 300-train fleet, that expands to 300,000–1.5 million active time series, quickly reaching petabyte-scale raw volumes without compression.

Mixed Real-Time and Historical Workloads

Rail O&M systems must handle:

Continuous high-frequency ingestion
Real-time latest-value queries
Long-range historical analysis
Sliding-window anomaly detection

Many general-purpose databases optimize for only one of these patterns, leading to performance trade-offs at scale.

Complex Metadata Hierarchies

Rail telemetry carries structured context: train ID, car position, subsystem, sensor type, installation location, and maintenance lineage. Maintaining consistency across millions of series becomes operationally expensive in loosely coupled architectures.

Long Retention with Tiered Access

Operational and regulatory requirements typically mandate multi-year retention:

Hot data (≤30 days): frequently queried
Warm data (30 days–2 years): periodic analysis
Cold data (>2 years): compliance access

Efficiently serving all tiers without manual migration is a core requirement.

Case 1: CRRC Sifang—Fleet-Scale Intelligent O&M

Background

CRRC Sifang operates intelligent maintenance platforms for metro fleets, enabling condition-based maintenance and fault diagnostics. The previous stack—KairosDB—began to show limits in storage efficiency, metadata management, and write/query latency as scale increased.

Deployment Scale

300 metro trains under management
Nearly 1 million active measurement points
414 billion data points per day
Multi-year retention requirement

Average sustained ingestion reaches approximately 4.8 million points per second, with higher bursts during operational peaks.

Why the Migration Happened

The team faced three growing pressures:

Storage costs rising faster than fleet growth
Query/Write latency increasing with data increase
Metadata management requiring manual intervention

Results After Migrating to IoTDB

Schema-aligned metadata IoTDB's hierarchical model (network → line → train → car → subsystem → sensor) matches rail topology directly. Metadata becomes schema-native, removing external synchronization overhead.
Write efficiency and infrastructure reduction IoTDB sustained full ingestion volume while reducing the deployment from 9 servers to 1, significantly lowering operational complexity.
Storage compression Three-year storage footprint dropped from 200 TB to 16 TB (≈92.5% reduction), driven by time-series–optimized TsFile compression.
Query responsiveness Sampling latency improved by 60% Managed train capacity doubled on the same infrastructure Monthly incremental data volume reduced by 95%
Operational impact The platform can now expand monitoring coverage without proportional infrastructure growth, improving the economics of large-scale fleet observability.

Case 2: Metro Automation Platform—Replacing Cassandra in Cloud Signaling

Background

This deployment supports a cloud-based metro automation and signaling system spanning multiple stations with dual data centers. The workload combines sustained high-throughput ingestion with strict query latency requirements.

The previous architecture used Apache Cassandra. While write throughput was acceptable, time-range aggregation queries and resource efficiency became bottlenecks.

Deployment Characteristics

Dozens of fully instrumented stations
Active-active dual data centers
Sustained million-level read/write throughput
Mixed real-time and historical queries

Why the Traditional Database No Longer Fit

Cassandra's denormalization model increases storage overhead and operational complexity for time-series workloads that require flexible temporal aggregation. In addition, the lack of native time-series compression causes storage costs to scale roughly linearly with data volume.

IoTDB Results

After migration:

Query performance improved by 120%
Resource consumption reduced by 60% (CPU, memory, I/O)
Million-level throughput sustained without additional horizontal expansion

For signaling systems, reduced query latency directly improves control-loop responsiveness.

Toward Cloud-Based Train Control

The platform is extending IoTDB into cloud signaling workloads with stricter latency and availability requirements. IoTDB's distributed cluster architecture and automatic failover align well with the platform's dual–data center topology, enabling high availability without manual intervention.

Case 3: Deutsche Bahn—Fuel Cell Monitoring for Rail Infrastructure

Background

The Deutsche Bahn BZ-NEA project modernizes backup power systems at railway facilities using hydrogen fuel cells. These electrochemical systems require continuous, high-resolution monitoring across multiple interacting parameters.

Operational Requirements

The platform must support:

Compliance with safety regulations
Safe operation of battery systems
Real-time query performance
Real-time anomaly detection

Fault conditions can escalate rapidly, making second-level telemetry and low-latency queries essential.

Why IoTDB Was Selected

Safety and compliance readiness The monitoring platform required strict data integrity and availability guarantees. IoTDB's open-source transparency and configurable replication model supported compliance validation.
Real-time visibility Second-level ingestion combined with millisecond query response enables early fault detection.
Built-in support for anomaly detection workloads The system runs anomaly detection directly against IoTDB, using both real-time streams and historical baselines through a unified query path.

Industry Implication

This deployment demonstrates that IoTDB's applicability extends beyond rolling stock telemetry into broader rail infrastructure monitoring scenarios with similar data characteristics.

Key Architectural Takeaways

Across these deployments, several consistent design patterns emerge:

Edge-to-central ingestion enables reliable data collection despite intermittent connectivity.
Hierarchy-aligned schema design simplifies fleet-scale queries without denormalization.
Native tiered storage supports multi-year retention with minimal operational overhead.
Ecosystem integration allows the same data platform to serve both real-time and batch analytics.

Summary

Apache IoTDB proves highly effective in urban rail operations, supporting real-time writes, efficient storage, and low-latency queries. Its time-series–native design scales operationally without extra infrastructure, making it ideal for modern rail O&M systems.

The next article explores connected vehicle applications, applying the same principles to a different domain.

Stay tuned!

Timer-S1 Released: The First Billion-Scale Time Series Foundation Model Achieving SOTA Performance

TimechoDB — Mon, 13 Apr 2026 02:28:00 +0000

Introduction of Timer

As AI continues to permeate industrial systems, the role of time series data has evolved beyond basic querying and analytics toward more advanced tasks such as equipment state forecasting and intelligent imputation of missing data. Achieving high-precision forecasting in these scenarios increasingly depends on foundation models that are purpose-built for time series characteristics.

However, unlike text, images, or video, time series data presents unique challenges: high variability, stochasticity, and complex temporal dependencies. These factors significantly limit the generalization and scalability of traditional models. As a result, developing domain-specific foundation models for time series has become a central focus in both academia and industry.

To address these challenges, the research team from Tsinghua University, in collaboration with ByteDance, introduces Timer-S1, the latest advancement in the Timer model series (Timer 1.0–3.0). Timer-S1 is the first time series foundation model scaled to billions of parameters, with a context length of up to 11.5K time steps. It achieves state-of-the-art (SOTA) forecasting performance on the large-scale benchmark GIFT-Eval.

The accompanying paper, "Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling," presents the technical details. In this article, we break down the key innovations behind Timer-S1.

Core Challenges in Time Series Foundation Models

Building foundation models for time series involves several fundamental challenges:

Strong Data Heterogeneity

Time series data varies significantly across domains in terms of frequency, distribution, and structure. Capturing multi-scale dependencies in high-dimensional, often unstructured signals remains difficult.

Intrinsic Uncertainty

Real-world time series are typically non-stationary and stochastic. External factors and system dynamics can introduce abrupt distribution shifts, increasing prediction uncertainty.

Scalability Constraints

Scaling techniques widely used in large language models—such as Mixture-of-Experts (MoE)—do not directly translate well to time series, often leading to degraded performance. Balancing sequential dependency modeling with computational efficiency remains a bottleneck.

Training–Inference Gap

Autoregressive models align well with the sequential nature of time series, but suffer from high computational cost and error accumulation during iterative inference. On the other hand, parallel multi-step prediction improves efficiency but fails to capture long-term dependencies.

Timer-S1 is designed as a systematic response to these challenges.

Core Innovation of Timer-S1: The Serial Scaling Paradigm

The key contribution of Timer-S1 is the introduction of a serial scaling paradigm, which integrates the sequential nature of time series forecasting into three tightly coupled dimensions: model architecture, dataset construction, and training pipeline.

Architecture Design: Efficient Serial Forecasting

Timer-S1 is built on a decoder-only Transformer backbone, enhanced with two specialized modules:

TimeMoE Block

A sparse Mixture-of-Experts module tailored for global heterogeneity in time series. It dynamically routes different temporal patterns to specialized experts, enabling:

Scalable training up to 8.3 billion parameters
Improved training stability
Efficient inference despite large model size

TimeSTP Block (Serial Token Prediction)

The core innovation for sequential forecasting. TimeSTP introduces progressive, step-wise prediction within a single forward pass:

TimeSTP is a time-series modeling approach that captures temporal dependencies in sequential data to enable accurate forecasting and pattern analysis.

Iteratively generates multi-step forecasts using historical inputs and intermediate representations
Eliminates the need for autoregressive rolling inference
Reduces error accumulation while significantly improving inference efficiency

Unified Forecasting Head

A shared quantile prediction head that supports multiple output formulations (e.g., linear projection, diffusion-based heads), ensuring architectural flexibility.

The Timer-S1 Model increased additional techniques include:

Instance-wise re-normalization to handle scale variations across datasets
Patch-level tokenization, converting continuous time points into model-friendly tokens

Data Construction: A Trillion-Point Time Series Corpus

To support large-scale training, the team constructed TimeBench, a dataset containing 1 trillion time points.

Key features include:

Multi-Source Data Integration

Combines real-world datasets from finance, IoT, meteorology, and healthcare, along with public benchmarks (e.g., Chronos, LOTSA), and synthetic signals (linear, sinusoidal, exponential).

Strict Data Quality Control

Missing values handled via causal mean imputation
Outliers removed using sliding-window detection
Data leakage prevention through rigorous filtering

Targeted Data Augmentation

Resampling
Value flipping

These techniques reduce prediction bias and improve generalization.

Data Complexity Evaluation

Each dataset is profiled using:

ADF stationarity tests
Spectral entropy-based predictability metrics

This results in a structured "complexity plane" for fine-grained dataset selection.

Training Pipeline: Optimizing Short and Long-Term Forecasting

Timer-S1 adopts a multi-stage training strategy instead of a single-phase approach:

Pretraining

Dense supervision across varying input-output lengths using STP as the core objective, enabling strong representation learning and multi-steps forecasting ability.

Continued Pretraining

Introduces a weighted STP objective to prioritize short-term accuracy (critical for long-horizon forecasting), combined with replay-based sampling to prevent overfitting.

Long-Context Extension

Using RoPE (Rotary Positional Embedding), the context length is extended from 2,880 to 11,520 time steps, significantly improving long-range dependency modeling.

The training system supports:

Billion-scale distributed training
Hybrid memory–disk data loading for efficient trillion-scale data access

Benchmark Results: SOTA on GIFT-Eval

Timer-S1 is evaluated on GIFT-Eval, a comprehensive benchmark with:

24 datasets
144,000 time series
177 million data points

Key Results:

Overall SOTA Performance

MASE ↓ 7.6% (Mean Absolute Scaled Error)
CRPS ↓ 13.2% (Continuous Ranked Probability Score)

Compared to Timer-3 (trained on the same TimeBench dataset), these results demonstrate the effectiveness of the serial scaling paradigm.

Strong Mid- and Long-Horizon Forecasting

Performance gains are especially pronounced for longer prediction horizons, validating the effectiveness of serial modeling in capturing long-term dependencies.

Significant Gains from Post-Training

Multi-stage training (include pretraining, continous pretraining and long context expansion) significantly outperforms single-stage pretraining, confirming the importance of decoupled optimization objectives.

STP vs. NTP/MTP

Under the same compute budget:

STP outperforms both Next Token Prediction (NTP) and Multi-Token Prediction (MTP)
Achieves better accuracy with lower inference latency

Ablation Studies: Core Component is Essential

Extensive ablation experiments highlight the importance of each component:

TimeSTP Design Matters Removing TimeSTP during inference or reverting to autoregressive rolling prediction leads to substantial performance degradation. The current design effectively narrows the gap between training and inference, adapting to the distribution characteristics of time series.

Data Augmentation is Critical Eliminating augmentation strategies (resampling and value flipping) increases prediction bias and reduces generalization, which validates the necessity of data augmentation in alleviating time series distribution imbalance.

Pretraining Enables Transferability Models trained from scratch perform significantly worse, demonstrating strong cross-task knowledge transfer from TimeBench.

Scaling Still Works Optimal performance is achieved with 24 TimeMoE blocks + 16 TimeSTP blocks, confirming that billion-scale expansion continues to yield gains.

Conclusion

Timer-S1 represents a major step forward in scaling time series foundation models to the billion-parameter regime. Its serial scaling paradigm systematically integrates the sequential nature of time series into architecture, data, and training, offering a generalizable solution to long-standing scalability challenges.

With innovations such as TimeBench, multi-stage training, and the TimeSTP module, Timer-S1 provides a reusable technical framework for future research and industrial deployment.

The release of Timer-S1 is not an endpoint, but a new milestone. Continued advancements in generalization and real-world applicability will further unlock the potential of time series intelligence across industries.

Bringing Time-Series Forecasting into Apache IoTDB Database

TimechoDB — Wed, 08 Apr 2026 02:26:00 +0000

A Covariate Forecasting Framework with TimechoDB AINode

In industrial time-series forecasting scenarios, accurate trend prediction often serves as a critical foundation for operational decision-making. However, traditional univariate forecasting approaches struggle to fully capture the complexity of real-world systems.

Take electricity pricing as an example. Power prices are not determined solely by historical price patterns. They are also influenced by a variety of external factors, including temperature, wind speed, holidays, and energy supply structure. Similar multivariate dependencies exist across many industries, such as manufacturing, transportation, and energy systems.

As the scale and complexity of time-series data continue to grow, forecasting is no longer purely an algorithmic problem. Increasingly, it requires tight integration between data infrastructure and model capabilities.

In TimechoDB, we significantly enhanced the capabilities of AINode, the database's intelligent analysis node. The upgrade enables native deployment and inference of Transformer-based time-series models, while introducing a framework for covariate-aware forecasting tasks.

With this capability, users can integrate different types of time-series foundation models directly into the database, enabling a unified workflow that spans data management, model execution, and predictive analytics.

What Are Covariates and Covariate Forecasting?

To understand covariate forecasting, it is important to first clarify two core concepts.

Covariates

Covariates are variables that are strongly correlated with the target variable and can provide additional information useful for prediction.

For example, in electricity price forecasting:

Temperature
Wind speed
Holiday indicators
...

influence fluctuations in power prices. These variables can therefore be used as covariates during model training and inference.

Covariate Forecasting

Unlike univariate forecasting methods, covariate forecasting combines:

Historical data of the target variable
Historical data of covariates
Partially known future covariate values

to jointly model future trends.

Instead of relying on a single time series, the model learns from the dynamic relationships across multiple data dimensions, allowing it to better reflect the underlying behavior of real-world systems.

By incorporating covariate information, forecasting models can move beyond the limitations of single-signal prediction. In many industrial applications, this leads to significantly improved prediction accuracy and stability.

AINode: A Database-Native Intelligent Analysis Node

To better integrate forecasting capabilities with data infrastructure, TimechoDB introduced a major upgrade to AINode in version 2.0.8.

AINode is designed to transform the database from a pure data management system into a platform that can also host model deployment and inference workloads. This enables predictive analytics to run directly within the database environment.

In this release, AINode provides a unified model integration mechanism that allows the database to deploy a variety of Transformer-based time-series models, including:

Timer
Chronos
Moirai
...

Model training can still be performed outside the database for maximum flexibility. However, model deployment, inference execution, and task scheduling are centrally managed within the database.

With this architecture, forecasting tasks can directly:

Read data from the database
Invoke the prediction model
Generate forecast results

This eliminates the frequent data exports and system switching common in traditional forecasting pipelines.

As a result, the database evolves from a standalone data store into an infrastructure layer that enables collaboration between data and AI workloads.

Native Forecasting with SQL

In many forecasting tools, covariates must be manually passed as parameters. Users often need to input covariate values one by one in SQL queries or construct them via string concatenation.

This approach introduces several issues:

Complex operational workflows
Higher risk of parameter input errors
Limited integration with existing data query processes

TimechoDB optimizes this process by allowing covariate inputs to be directly queried from the database using SQL. This makes forecasting tasks a natural extension of standard data query workflows.

A typical covariate forecasting call looks like the following:

SELECT * FROM FORECAST (
    MODEL_ID => 'chronos2',
    TARGETS => (
        SELECT TIME, target1, target2
        FROM etth.tab_real
        WHERE TIME < 7
        ORDER BY TIME DESC
        LIMIT 6
    ),
    HISTORY_COVS => (
        SELECT TIME, cov1, cov2, cov3
        FROM etth.tab_real
        WHERE TIME < 7
        ORDER BY TIME DESC
        LIMIT 6
    ),
    FUTURE_COVS => (
        SELECT TIME, cov1
        FROM etth.tab_real
        WHERE TIME >= 7
        LIMIT 2
    ),
    OUTPUT_LENGTH => 2
);

With this approach:

Target variables and covariates both come directly from database queries
No manual parameter concatenation is required
Forecasting workflows become significantly easier to implement

This greatly lowers the operational barrier for deploying forecasting tasks.

Industrial Case Study: Covariate Forecasting for Electricity Prices

Covariate forecasting is not only a theoretical capability — it also delivers measurable benefits in real industrial scenarios.

In a real-world electricity price forecasting task, we validated the covariate forecasting framework under production conditions.

The goal was to predict electricity price trends. However, electricity prices are influenced by many interacting factors, including:

Meteorological conditions
Time-related patterns
Energy supply structures

In addition, extreme price spikes are notoriously difficult to predict.

During the modeling process, the business team initially identified more than 100 potential covariates. After multiple rounds of data cleaning and feature selection, over 20 key variables were retained for the final model.

These variables can be broadly grouped into three categories:

Time-related variables, such as date, weekday indicators, and holiday indicators, which capture periodic patterns in electricity demand.

Weather-related variables, including temperature, wind speed, precipitation, and cloud coverage, all of which can significantly influence energy consumption and renewable generation.

Energy-related variables, such as solar power generation, wind power output, and energy conversion efficiency, reflecting the supply-side dynamics of the energy system.

During forecasting, the model simultaneously consumes:

2,880 historical timestamps (~30 days) of target variables and covariates
160 future timestamps(~40 hours) of known covariate information

We implemented a covariate-enhanced forecasting approach built on top of open-source time-series foundation models and compared it against several baseline models.

The experimental results show that in complex multivariate environments, incorporating covariate modeling allows prediction curves to capture real trend changes more accurately. The covariate-enhanced model outperformed baseline models in:

Peak prediction accuracy
Trend tracking
Overall stability

In particular, the FutureBoosting covariate-enhanced model was able to better align with actual series behavior during key trend transitions.

Multi-model forecasting comparison: the covariate-enhanced approach (FutureBoosting) aligns more closely with the ground-truth series, particularly during major trend shifts.

Conclusion

In TimechoDB v2.0.8, we introduced a major upgrade to AINode, enabling the database to deploy and run Transformer-based time-series models while providing a framework for covariate forecasting tasks.

With this capability, organizations can centrally manage:

Model deployment
Forecasting task scheduling
Data access

all within the database environment.

This architecture enables an integrated workflow that spans data management and intelligent analytics.

As time-series foundation models continue to evolve, the collaboration between databases and AI models will increasingly become a key direction for next-generation time-series data systems. Databases are gradually evolving into critical infrastructure for time-series AI applications.

Apache IoTDB for Intelligent Transportation — Architecture, Core Capabilities, and Industry Fit

TimechoDB — Tue, 07 Apr 2026 03:17:00 +0000

The Data Infrastructure Problem Layer Often Overlooked

When intelligent transportation is discussed, the focus typically falls on autonomous vehicles, smart signaling, and real-time routing. Rarely does attention turn to the data infrastructure layer that quietly sustains these systems—continuously ingesting millions of sensor readings per second, compacting years of telemetry into manageable storage, and serving operational queries in milliseconds while transportation systems operate at full speed.

Yet in production environments, this invisible layer often determines whether an intelligent transportation platform scales successfully.

Consider the data reality:

A modern metro system operating 300 trains can generate ~414 billion data points per day
A connected vehicle platform managing 1.6 million vehicles can produce ~20 TB of new telemetry every 24 hours

These are not traditional data warehousing workloads. They are high-cardinality, high-velocity time-series problems that require purpose-built infrastructure.

Apache IoTDB is designed for exactly this class of workload. This article examines what it is, why it fits transportation systems particularly well, and where it delivers the most operational value.

What Is Apache IoTDB?

Apache IoTDB (Internet of Things Database) is an open-source, high-performance and AI-ready time-series database under the Apache Software Foundation. It was originally engineered for industrial IoT environments characterized by extreme write throughput and long-term telemetry retention—conditions that closely mirror modern transportation systems.

At a systems level, IoTDB differentiates itself through three architectural principles.

Purpose-Built for Time-Series at Scale

IoTDB is not a general database retrofitted with time-series features. Its:

Data model
Indexing strategy
Query engine
Storage format

are all optimized for canonical time-series access patterns, including:

High-frequency sequential writes
Time-range scans
Long-window aggregations
...

This specialization eliminates much of the structural overhead seen in row-oriented or general distributed databases under fleet-scale workloads.

Native Columnar Storage: Apache TsFile

IoTDB uses Apache TsFile as its on-disk format, organizing data by measurement and time to maximize compression efficiency.

For transportation telemetry—where sensor values typically exhibit strong temporal locality—TsFile commonly achieves 10×–30× lossless compression in production environments. In real deployments, three-year storage footprints have been reduced from 200 TB to ~16 TB.

Edge-to-Cloud Native Architecture

Unlike databases designed primarily for centralized deployment, IoTDB was built with edge scenarios as a first-class requirement.

Edge nodes (vehicles, substations, vessels) accumulate data locally and synchronize compressed TsFile segments upstream. Compared with record-level replication, this approach can reduce bandwidth consumption by up to 90%—a material advantage in environments with intermittent or constrained connectivity.

Why Traditional Databases Struggle at Transportation Scale

Transportation telemetry exposes several structural weaknesses in non-specialized databases.

Row-oriented storage Time-range queries against row stores incur significant I/O amplification because sensor histories are interleaved across rows. At fleet scale, this frequently translates into minute-level latency.
Generic distributed schemas Many systems require substantial application-side modeling to represent hierarchical assets (fleet → vehicle → subsystem → sensor). Metadata management often becomes a bottleneck at million-series scale.
Inefficient time-series compression Storage engines without time-series-aware encoding typically scale storage cost roughly linearly with data volume—economically unsustainable for multi-year telemetry retention.
Licensing and deployment flexibility Some database licensing models impose constraints on self-hosted deployment, long-term cost predictability, or deep system customization. For transportation platforms that operate large, long-lived infrastructure systems, these limitations can introduce operational and architectural friction at scale.

These limitations consistently surface in production migrations toward purpose-built time-series infrastructure.

Core Technical Capabilities

High-Throughput Ingestion

IoTDB's write path is optimized for concurrent, high-frequency sensor ingestion. In production conditions, a single node can sustain tens of millions of data points per second, enabled by:

Memory-buffered ingestion
Batch-optimized flushing
Time-partitioned storage

For transportation platforms, this means hundreds of trains or hundreds of thousands of vehicles can be absorbed without write-side bottlenecks.

Millisecond-Level Query Performance

Transportation workloads typically fall into three query classes:

Latest-value queries Example: current speed or battery level of a vehicle. → Served from in-memory structures with sub-millisecond latency.
Time-range queries Example: brake pressure between 08:00–09:30. → Executed efficiently via time-partitioned TsFile scans.
Aggregation queries Example: fleet fuel consumption over 30 days. → Accelerated by columnar scan-and-aggregate execution.

Across multiple production deployments, workloads migrated from HBase or Cassandra have observed latency reductions from minutes to milliseconds.

Compression and Storage Efficiency

Transportation telemetry is highly compressible due to:

Temporal correlation within sensor streams
Bounded numeric ranges
Repetitive measurement patterns

TsFile leverages differential encoding, run-length encoding, and dictionary compression at the column level.

In practice, this yields 10×–30× smaller storage footprints, directly lowering infrastructure cost at fleet scale.

Edge-to-Cloud Synchronization

IoTDB enables configurable synchronization strategies — from real-time record-level streaming to compressed TsFile-based batch replication — allowing transportation operators to balance latency, bandwidth efficiency, and network resilience.

This design delivers two operational advantages:

Bandwidth efficiency: Compressed TsFile transfer can reduce network usage by up to 90%.
Offline tolerance: If connectivity drops (tunnels, offshore zones), edge nodes continue buffering locally and resume sync automatically when the network returns—without application-side reconciliation logic.

High Availability Architecture

Distributed IoTDB clusters support:

Automatic failover
Load balancing
Rapid node recovery

For transportation systems where telemetry gaps can impact safety and compliance, these are baseline requirements rather than optional features.

Open Governance and Deployment Flexibility

IoTDB is governed under the Apache License 2.0, providing full source transparency and flexible self-hosted deployment. For large-scale transportation platforms that operate long-lived infrastructure systems, this model supports greater operational control and long-term maintainability.

Where IoTDB Fits in the Transportation Stack

IoTDB is data infrastructure, not an end-user application platform.

What IoTDB Handles

Telemetry ingestion from vehicles and infrastructure
Long-term compressed storage
High-performance time-series querying

What Typically Sits Above

Predictive maintenance systems
Anomaly detection pipelines
Visualization platforms (e.g., Grafana)
Big data processing (Spark, Flink, Hadoop)

What Sits Below

Onboard data collectors
IoT gateways
Network transport layers (5G, DSRC, satellite)

This positioning is important for correctly scoping IoTDB within complex transportation architectures.

Two Primary Transportation Use Domains

Urban Rail Operations and Maintenance

This domain emphasizes:

Equipment health monitoring
Predictive maintenance
Signal integration
Real-time operations

Production deployments include:

CRRC Sifang intelligent rail O&M platform
CityX Urban Construction Intelligent Control metro automation system
Deutsche Bahn fuel cell monitoring project

These environments commonly involve millions of measurement points and multi-year retention requirements.

Connected Vehicle Management

This domain features:

Geographically distributed fleets
Heterogeneous telemetry
Bursty peak loads
Mixed real-time and analytical queries

Representative deployments include:

Changan Automobile connected vehicle platform
AutoAI Toyota driving behavior system

Measurement cardinality typically reaches tens to hundreds of millions of time series.

Summary

Intelligent transportation systems run on time-series data infrastructure that is often overlooked but operationally decisive.

Apache IoTDB addresses the sector's most persistent data challenges through:

High-ratio TsFile compression
Edge-native synchronization
Millisecond query latency
Open and sovereign deployment model

The next two articles in this series examine how these capabilities translate into real-world outcomes in urban rail and connected vehicle platforms. Stay tuned!

Build smarter systems on a foundation that scales. Start exploring IoTDB today.

Time-Series Databases vs. Relational Databases, What is the Difference

TimechoDB — Mon, 06 Apr 2026 03:08:00 +0000

Introduction

Many teams default to relational databases because they are familiar and versatile. For business systems, that choice is often correct.

But when the workload shifts — from mutable business records to high-frequency telemetry streams — the database architecture begins to matter in very different ways.

Not all data problems are relational. And not all databases are designed for time.

Relational databases (RDBMS) power enterprise systems such as e-commerce platforms, logistics platforms, and ERP systems, thanks to their general-purpose modeling capabilities and strong transactional guarantees.

Time-series databases (TSDB), by contrast, are purpose-built for time-indexed data. They are widely used in industrial IoT, energy systems, observability platforms, monitoring infrastructures, and financial time-series analysis.

To understand when each is appropriate, we compare them across five architectural dimensions.

Transaction Mechanism: Essential vs. Often Secondary

Relational Databases: ACID Is Fundamental

Relational databases support ACID transactions, ensuring atomicity, consistency, isolation, and durability.

Consider a bank transfer:

Account A deducts $10
Account B credits $10

Both operations must either succeed together or fail together. If a system crash or network failure occurs mid-operation, the database must roll back to preserve consistency.

To achieve this in distributed systems, RDBMS engines maintain:

Write-ahead logs (WAL)
State tracking
Concurrency control mechanisms
Rollback and recovery protocols

Transactional integrity is a core requirement because business data is frequently modified and subject to concurrent updates.

Time-Series Databases: Transactions Are Often Less Critical for Ingestion

In many industrial IoT ingestion workloads, data originates from sensors. Each record represents a real-world measurement at a specific timestamp (e.g., temperature, wind speed, voltage).

Typical characteristics:

Data is append-only
Each record is independent
No multi-row atomic updates
No write-write conflicts

In these workloads, heavy transactional coordination adds overhead without delivering proportional value.

TSDB systems therefore trade transactional complexity for ingestion scale — prioritizing high-throughput, stable streaming writes.

Write Patterns: Consistency-Centric vs. Throughput-Centric

Relational Databases: Strong Schema & Consistency

RDBMS typically stores:

Configuration data
Personnel records
Business entities
Financial transactions

Data is often entered via structured forms and must conform strictly to predefined schemas and constraints.

Because of transaction semantics:

Writes are grouped
Entire batches commit or roll back
Consistency is prioritized over raw throughput

This design is ideal for systems where correctness across related entities is critical.

Time-Series Databases: Extreme Write Throughput

Time-series workloads differ dramatically:

Data originates from sensors or devices
Device counts can range from thousands to millions
Sampling intervals may be seconds or milliseconds
Write rates can reach tens of millions of points per second

TSDB systems are engineered for:

High concurrency ingestion
High throughput
Out-of-order data handling

For example, Apache IoTDB leverages its underlying storage format Apache TsFile, enabling:

Columnar data ingestion
Millisecond-level data access
Out-of-order separation storage mechanisms for unstable network environments
Stable high-throughput ingestion in benchmark scenarios

When data volume grows from gigabytes to multi-year telemetry archives, ingestion architecture becomes a scaling boundary — not just a performance metric.

Storage & Compression: General-Purpose vs. Time-Series-Optimized

Relational Databases: B+ Trees & Generic Compression

RDBMS storage engines typically use:

B+ tree indexing
Row-based or hybrid storage
Generic compression algorithms (LZ77, DEFLATE, etc.)

Compression is optional and tuned based on workload requirements. The storage format is optimized for multi-dimensional querying and transactional consistency.

Time-Series Databases: Time-Series-Optimized Storage

Time-series data exhibits structural properties that storage engines can exploit:

Strong temporal locality
Sequential append patterns
Small deltas between consecutive data points

These characteristics enable:

Columnar storage
Run-Length Encoding (RLE)
Delta encoding
Specialized Compression Algorithm

In IoTDB, the underlying format Apache TsFile provides:

Multi-dimensional indexing (device, sensor, timestamp)
Fast time-range filtering
5-10× query throughput improvement compared to generic formats
Up to 15× higher compression ratios

This significantly reduces storage footprint while improving I/O efficiency.

Query Patterns: Precise Retrieval vs. Time-Dimension Analytics

Relational Databases: Entity-Based Querying

Using SQL constructs such as:

SELECT: select target columns.
FROM: define the source table.
WHERE: set filtering conditions.

RDBMS excel at:

Precise filtering
Multi-table joins
Complex business logic queries
Foreign key relationships

The goal is accurate entity retrieval and relational consistency across structured datasets.

Time-Series Databases: Temporal Analysis at Scale

TSDB queries commonly involve:

Trend analysis over weeks, months, or years
Large-scale aggregation across hundreds of thousands of points
High-frequency dashboard refreshes (e.g., hundreds of metrics per second)

Users expect:

High query throughput
Efficient time-window filtering
Native time-series processing capabilities

IoTDB supports:

High-throughput time-range queries via TsFile
Downsampling for visualization efficiency
Nearly 100 built-in time-series processing functions, including segmentation, gap filling, and data repair.

Data Circulation: Centralized Management vs. Edge-Cloud Collaboration

Relational Databases: Platform-Centric Storage

RDBMS systems typically:

Store internal business data
Use proprietary storage formats
Serve centralized application workloads

Data migration often requires format conversion when systems evolve.

Time-Series Databases: Edge-Cloud Synchronization

Industrial IoT architectures frequently involve:

Devices → Edge nodes → Regional data centers → Central cloud platforms

Additional constraints may include:

Production network isolation
One-way data gateways
Bandwidth limitations

TSDB systems must optimize:

Efficient cross-terminal synchronization
Low-bandwidth replication
Minimal CPU overhead
Efficient file-based transfer

IoTDB addresses this through its unified TsFile format, enabling file-based data exchange and subscription-based synchronization, reducing re-processing overhead compared to re-ingestion-based approaches, achieving up to 90% network bandwidth savings and 95% CPU savings on receiving nodes.

In distributed industrial systems, data mobility can be as important as raw database performance.

Conclusion

The differences between time-series databases and relational databases stem fundamentally from the nature of the data they serve:

Dimension	Relational Database	Time-Series Database
Data Model	Entity relationships	Time-indexed metrics
Transactions	Essential	Less central for ingestion-heavy workloads
Write Focus	Consistency	High throughput
Storage Compression	B+ Tree, generic compression	Time-optimized, columnar-oriented formats
Query Style	Multi-table precision	Large-scale temporal analytics
Data Flow	Application-centric	Edge-cloud collaboration

When choosing between a TSDB and an RDBMS, organizations should evaluate:

Data generation patterns (Is it time-correlated?)
Write throughput requirements
Query complexity
Edge-to-cloud synchronization needs
Infrastructure constraints

Selecting the correct database architecture is not merely a technical preference—it directly impacts scalability ceilings, infrastructure cost, and long-term operational efficiency.

This is not a feature comparison.

It is a workload architecture decision.

Scaling Time-Series Data: Partitioning, Replication and Backup in Apache IoTDB

TimechoDB — Fri, 03 Apr 2026 11:17:24 +0000

Understanding Partitioning, Replication and Backup in Apache IoTDB

With the rapid evolution of IT and OT technologies, time-series data has become a critical asset across industries such as manufacturing, energy and transportation. Applications including AI analytics, predictive maintenance, and anomaly detection rely heavily on the efficient storage and processing of time-series data.

However, managing massive time-series datasets introduces significant challenges in terms of storage scalability, query performance, and system reliability. To address these challenges, Apache IoTDB provides robust mechanisms for data partitioning, replication, and backup.

This article introduces how IoTDB implements these mechanisms and how they support large-scale industrial scenarios.

Characteristics of Time-Series Data

Time-series data has several unique characteristics compared with traditional transactional data.

Massive Number of Data Points

Industrial systems often contain an extremely large number of measurement points.

For example:

A large energy storage facility may deploy millions of sensors
Nationwide monitoring systems may contain tens of billions of measurement points
Connected vehicle platforms may collect billions of telemetry signals from vehicles on the road

These measurement points continuously generate data streams.

High Storage Cost

Industrial environments typically produce data at high frequency and high volume.

Examples include:

Ultra-large steel manufacturing equipment
Wind turbines in renewable energy plants

In these scenarios, data collection frequencies can be extremely high, and the total storage demand can easily reach petabyte scale.

Without efficient data organization mechanisms, managing such datasets becomes extremely difficult.

Data Partitioning in Apache IoTDB

What Is Data Partitioning

Data partitioning refers to dividing data into multiple segments according to defined rules so that each segment can be managed independently.

A simple analogy is a library:

Without partitioning, all books are stored randomly.
With partitioning, books are categorized and placed on different shelves.

This organization significantly improves data management efficiency and query performance. For time-series databases handling massive datasets, partitioning becomes a core architectural component.

Data Partitioning Mechanism in IoTDB

Apache IoTDB implements a two-dimensional partitioning strategy based on:

Series dimension
Time dimension

These correspond to Series Partition Slots and Time Partition Slots.

Series Partition Slot

Series partitioning is used to manage time series vertically.

By default:

Partitioning occurs at the database level
Each database contains 1,000 series partition slots

IoTDB uses a Hash Algorithm to map each time series to a specific partition slot.

This approach provides several benefits:

Efficient metadata management
Reduced memory mapping overhead
Better load distribution across nodes

This design is particularly important for scenarios involving hundreds of millions or billions of devices.

Time Partition Slot

Time partitioning manages time series horizontally. Data is divided into segments based on fixed time intervals. By default, each time partition represents: 7 days of data.

This design improves query efficiency because:

Queries typically target specific time ranges
Only relevant partitions need to be scanned

As a result, IoTDB avoids unnecessary full-dataset scans.

Partition Distribution in an IoTDB Cluster

An IoTDB cluster contains two types of nodes:

ConfigNode

The ConfigNode is responsible for cluster management and coordination, including:

Metadata management
Partition allocation
Cluster configuration

DataNode

The DataNode handles actual data operations, including:

Data ingestion
Query processing
Storage management

Within each DataNode, data is organized into:

SchemaRegion
DataRegion

IoTDB distributes partitions across nodes using load balancing algorithms, ensuring that data and write workloads are evenly distributed across the cluster.

This architecture improves:

Storage scalability
Write throughput
Cluster stability

Partition Execution from Read and Write Perspectives

Write Workflow

When a client sends a write request:

The request can be sent to any node in the IoTDB cluster.
The node applies a load-balancing algorithm based on device_id.
The system determines the target DataNode.
The timestamp determines which time partition the data belongs to.

The data is then written to the corresponding DataRegion.

Query Workflow

When a query is executed:

The query request is sent to the cluster.
The query engine determines the target node using device_id.
The request is forwarded to the corresponding node.
The query engine scans only the relevant time partitions.

Because unrelated partitions are skipped, query performance is significantly improved.

Data Partitioning Mechanisms in IoTDB

IoTDB supports two types of synchronization:

Intra-cluster synchronization
Cross-cluster synchronization

Each serves different purposes.

Intra-Cluster Synchronization

Intra-cluster synchronization refers to data replication between nodes within the same cluster.

Its primary goal is to ensure:

High availability
Replica consistency

IoTDB supports two types of consensus protocols.

Strong Consistency: Ratis Protocol

IoTDB uses the Apache Ratis protocol to achieve strong consistency for:

ConfigNode metadata
Some partition operations

With strong consistency: A request is considered successful only after all replicas confirm the update. This ensures strong data consistency but may introduce higher latency.

High-Performance Replication: IoTConsensus

For DataNode operations, IoTDB uses its own protocol called IoTConsensus. This protocol prioritizes write performance.

Workflow:

Data is first written to the local node
Replication to other nodes occurs asynchronously

This design significantly improves ingestion throughput, which is critical for industrial time-series workloads.

Replication Workflow

The replication process follows these steps:

The server receives a write request
The consensus layer processes the request
The request is delivered to the state machine
The state machine forwards the request to the DataRegion
The storage engine writes the data into:
MemTable
Write-Ahead Log (WAL)

A log distribution thread asynchronously replicates the write request to replica nodes.

If a replica node goes offline:

The leader records the synchronization progress
When the node recovers, synchronization resumes automatically

This ensures eventual consistency across replicas.

Failover and High Availability

The intra-cluster consensus protocol enables automatic failover.

If the leader node fails:

A replica node is automatically promoted to leader
Read and write services continue without interruption

This mechanism ensures high service availability in production environments.

Cross-Cluster Synchronization

IoTDB also supports synchronization between different clusters. This capability is useful for scenarios such as:

Disaster recovery
Geo-redundant backup
Edge-cloud collaboration

IoTDB Streaming Framework

IoTDB provides a stream processing framework consisting of three stages:

Data Extraction
Data Processing
Data Delivery

Data Extraction

Defines which data should be extracted from IoTDB, including:

Measurement scope
Time range

Data Processing

Users can apply programmable processing logic, such as:

Removing outliers
Transforming data types
Filtering values

Data Delivery

Processed data can be sent to different destinations.

Users can implement custom logic using IoTDB’s standardized plugin framework, and the platform also provides built-in plugins.

Typical Use Cases

The IoTDB streaming framework enables many real-world scenarios.

Disaster Recovery

Data synchronization tasks can be created using simple SQL commands, enabling:

Cross-region disaster recovery
Real-time backup

Replication latency can be as low as milliseconds.

Real-Time Data Processing

The framework can also support:

Real-time alerts
Stream computing
Real-time aggregation
Data write-back

Cross-System Data Integration

IoTDB can integrate with external systems including:

Message queues
Apache Flink
Offline analytics pipelines

[Added for clarity: common enterprise architectures frequently integrate IoTDB with data lakes or streaming platforms]

Frequently Asked Questions (FAQ)

When Should You Use the Ratis Protocol?

If your workload requires strict consistency and write throughput is not the primary concern, Ratis may be appropriate. However, IoTConsensus typically provides better write performance for large-scale ingestion.

Why Does IoTDB Use Series Partition Slots?

In scenarios such as energy storage systems or meteorological monitoring, the number of time series can be extremely large.Series partition slots reduce memory overhead by managing series through hash-based slot mapping.

Can IoTDB Support Cross-Network Gateway Transmission?

Yes. The streaming framework has already been adapted for common industrial gateways. Other gateways typically require only minimal integration work.

Cost-Efficient Storage Options

IoTDB supports tiered storage, allowing users to store:

Hot data on SSD
Warm data on HDD
Cold data on object storage such as Amazon S3

During queries, data stored in S3 can be retrieved transparently.

Is Data Loss Possible?

Under extreme conditions, a small amount of data loss may occur when using eventual consistency replication, because asynchronous replication introduces a short delay. However, this delay is typically within 1 millisecond.

Impact of Multiple Replicas

Multiple replicas improve availability and fault tolerance, but they also increase storage consumption. Replication is asynchronous and usually does not affect the primary write thread, unless system resources become constrained.

Query Optimization

The client can cache the leader node of each device, allowing queries to be sent directly to the leader and reducing request forwarding. This feature can be enabled or disabled depending on client resource constraints.

Can Replicas Be Placed on Specific Nodes?

Currently, IoTDB does not support explicitly assigning replicas to specific nodes. However:

Manual migration is supported
Cross-cluster synchronization can be used to build geo-distributed active-active architectures

Covariate Forecasting: The Next Leap in Time-Series Database Capabilities

TimechoDB — Thu, 02 Apr 2026 02:00:00 +0000

Beyond the Myth of "Simple" Time-Series Forecasting

Many practitioners still view time-series forecasting as a straightforward exercise: use historical data to predict future trends. In real industrial systems, however, the problem is far more complex.

Load forecasting is tightly coupled with temperature variation.
Equipment health prediction depends heavily on operating conditions.
Wind power forecasting is driven by meteorological factors.
Production energy consumption forecasting relies on scheduling plans.

In practice, real-world time series exist within strongly coupled multivariate systems. Relying solely on the historical values of a target variable imposes a natural ceiling on predictive performance. The true technical frontier of time-series forecasting lies in the accurate modeling and utilization of covariates.

From Univariate Forecasting to Covariate-Aware Modeling

Early time-series models primarily focused on the intrinsic dynamics of a single curve. The typical question was:

How will this series evolve in the future?

In industrial environments, the more meaningful question is:

How will this series evolve under the current environmental and operational conditions?

External factors that influence the target variable—such as temperature, humidity, load, control parameters, and operating state—are referred to as covariates.

Importantly, covariate forecasting is not merely about increasing the number of input variables. Its core objective is to enable models to understand the dynamic dependencies and coupling relationships among variables. It addresses a system-level problem:

How does the target variable evolve under multi-variable interaction?

In strongly coupled industrial systems, the ability to robustly handle covariates represents a key breakthrough toward higher-complexity forecasting scenarios.

The Timer Roadmap: Structural in Time-Series Modeling

Time-series foundation models are emerging as a new modeling paradigm in large-scale forecasting research. Through large-scale pretraining, these models learn general time-series representations and achieve cross-domain transferability.

The Timer model family illustrates a clear technical evolution toward more general and powerful time-series intelligence:

Timer 1.0 — Feasibility of General Representation Learning The initial phase focused on validating the viability of general time-series pretraining. With large-scale pretraining, the model began to demonstrate cross-dataset transfer capability, moving time-series modeling toward a more generalized paradigm.
Timer-XL (Timer 2.0) — Long-Context Modeling Timer-XL strengthened long-sequence modeling and established a unified forecasting framework. Industrial systems typically exhibit both long-term trends and short-term fluctuations; improving context length and modeling stability was a critical step toward real-world applicability.
Timer-Sundial (Timer 3.0) — Generative Forecasting and Uncertainty Modeling Timer 3.0 introduced native continuous time-series tokenization and leveraged trillion-scale pretraining tokens for broader distribution learning. While maintaining strong generalization, the model achieved significantly improved zero-shot forecasting capability.

Learn more on the website: https://thuml.github.io/timer/

Compared with version 2.0, Timer 3.0 delivers notable gains in both inference quality and efficiency. It also supports quantile forecasting, enabling prediction outputs to move beyond single-point estimates and explicitly characterize uncertainty intervals.

The Timer roadmap is not a simple feature accumulation. It represents a structural evolution that continuously strengthens modeling depth, generalization, and engineering readiness on top of a general pretraining foundation.

The Changing Role of the Database: Data–Model Co-Execution

As general time-series models become more capable, a practical question emerges:

How can these models be integrated into production systems in a controllable, engineering-friendly way?

Today, capabilities such as zero-shot forecasting, quantile prediction, and covariate modeling are increasingly available at the algorithmic level. However, if forecasting pipelines still depend on:

data export
external inference
and result write-back

then system complexity and data movement costs rise significantly.

In the evolution of Apache IoTDB, the preferred direction is data–model co-execution. By introducing the native intelligent analytics node (AINode), covariate forecasting can be scheduled and executed directly within the database system.

Once forecasting becomes a native component of the data infrastructure, the role of the time-series database fundamentally shifts—from a pure data management system to an integrated data-and-intelligence platform.

This transition implies that productionizing covariate forecasting is not only an algorithmic upgrade; it also requires a new round of architectural evolution in time-series databases.

Covariate Modeling: A System-Level Capability Upgrade

From univariate prediction to covariate modeling…

From scenario-specific models to general pretrained foundations…

From offline analytics to in-database native inference…

Time-series analytics is undergoing a fundamental shift in modeling paradigms and system architecture..

Within the ongoing evolution of Apache IoTDB, covariate forecasting is viewed as a key strategic direction. The surrounding technologies are being continuously refined and hardened for real-world deployment.

Further practical insights on engineering covariate forecasting inside the database stack will be shared in upcoming work.

From OpenClaw to IoTDB Skills: How Databases Evolve for the AI Agent Era

TimechoDB — Wed, 01 Apr 2026 10:06:02 +0000

Recently, OpenClaw has been gaining rapid traction in the developer community. Its rise highlights a broader shift: AI is evolving from "able to chat" to "able to act."

Agents are beginning to operate systems, invoke tools, and access databases. They are no longer limited to answering questions—they are executing tasks on behalf of users.

However, as Agents start interacting with database interfaces, a fundamental question emerges:

Do Agents truly understand databases?

Invocation ≠ Understanding: The Cognitive Gap Agents Face

As Agents become a new software interaction paradigm, the question is no longer whether they can invoke a database. The real challenge is whether they possess domain cognition.

Take Apache IoTDB as an example. For an Agent to effectively assist users, it must understand far more than API syntax. At minimum, it needs knowledge of:

The differences between the tree model and the table model
Common pitfalls in time-series data modeling
Optimization strategies for high-throughput writes and queries
The design boundaries and applicable scenarios of Apache TsFile
Trade-offs between consistency and performance in industrial workloads

This type of domain expertise is not inherently embedded in general-purpose LLM(large language models). Without it, even an Agent that successfully calls IoTDB APIs may:

Misinterpret data modeling and generate logically incorrect code
Provide generic, non-actionable optimization advice
Confuse data models and trigger runtime errors
Produce "technical hallucinations" that sound plausible but are fundamentally wrong

IoTDB Skills: Giving Agents a Domain Knowledge Foundation

To address this gap, we recently open-sourced two core skill sets: IoTDB Skill and TsFile Skill

Project website: https://github.com/timecholab/timecho-skills

Skills (Timecho): AI assistant capabilities for working with Apache IoTDB and Apache TsFile.

Here, Skills are not traditional feature modules. Instead, they represent a structured domain knowledge packaging approach for AI systems.

These Skills distill real-world engineering experience with IoTDB and TsFile into reusable, machine-interpretable capability modules, including:

Core conceptual boundaries of time-series databases
Common usage scenarios and anti-patterns
Recommended analytical approaches for specific problems
Guardrails designed to reduce technical hallucinations

In essence, IoTDB Skills attempt to answer a key question:

If an Agent is expected to help users succeed with IoTDB, what foundational knowledge must it possess?

This is not merely a product feature—it is a community-level exploration into how AI can move beyond "API invocation without understanding" toward accurate domain reasoning in time-series systems.

Beyond Understanding: Native Database Intelligence

If IoTDB Skills address how Agents understand databases, another question follows: How do Agents connect to databases in the first place?

We previously introduced MCP capabilities:

MCP solves how Agents securely and properly connect to databases
Skills address whether Agents truly understand domain logic

They operate at different layers and are complementary:

MCP = Connectivity layer → enables safe database access
Skills = Cognition layer → enables correct domain reasoning

On top of these, IoTDB's ongoing evolution is exploring a third dimension:

Intelligence layer—represented by capabilities such as AINode, enabling built-in reasoning, analytics, and forecasting within the database itself

From connectivity, to cognition, to built-in intelligence—these form the three critical upgrade paths for databases in the Agent era.

Within IoTDB's roadmap, this direction is already taking shape through:

Covariate forecasting to improve time-series trend prediction
Built-in time-series foundation models(Timer) to lower the barrier to intelligent analytics
The extensible AINode architecture providing infrastructure for native intelligence

These are not simply "AI add-ons." They embed analytical and predictive intelligence directly into the database engine, unifying storage, computation, and intelligence to support the next-generation Agent interaction model.

The overview of IoTDB AI ability

The Database Role Is Being Redefined in the Agent Era

Not every system will become an Agent. But every system will need to be understood correctly by Agents.

OpenClaw's popularity is just one signal of the broader Agent wave. As Agents become a core component of the software ecosystem, the role of databases is being fundamentally reshaped.

In the future, every database must adapt to requirements of an Agent-driven ecosystem:

Be correctly understood by Agents, not just mechanically invoked
Provide structured domain memory to support Agent decision-making
Possess native intelligent analytics, evolving from data storage to an intelligent data foundation

IoTDB and TsFile Skills represent an early exploration toward machine-understandable databases, while covariate forecasting and AINode point toward native intelligence within the database.

These efforts are still in early stages—but they converge on a clear direction:

In the Agent era, domain knowledge crystallization and intelligent data infrastructure will become core competitive advantages for databases.

The Agent era is just beginning—and the evolution of databases is already underway.

If you are interested in AI, Agents, IoTDB, or TsFile, you are welcome to join the community discussion and contribute.

Looking for One Answer, Ending Up with Ten Tabs?

TimechoDB — Wed, 01 Apr 2026 09:44:19 +0000

With so many AI tools around, why does finding answers on a website still feel so hard?

One search turns into ten tabs—documentation, GitHub issues, pull requests—yet none of them quite match what you're looking for. That’s why Ask AI is now available on the Apache IoTDB website.

Faster search. More precise answers.

It Usually Starts with a Simple Question

You just want to check one thing.

A parameter name.
A feature detail.
Something you know you've seen before.

So you:

search the website
open the first tab — a blog
open the second tab — a GitHub issue
then another one
maybe a PR, just in case

Now you have ten tabs open—and still no clear answer. Not because the answer doesn’t exist—but because traditional search can’t surface it.

Sound familiar?

Ask AI Knows Where the Answer is

Apache IoTDB now provides Ask AI, an AI assistant built directly into the official website. It's powered by a custom LLM with access to:

Instead of manually searching through documents, issues, or examples, Ask AI helps you locate the most relevant existing answers directly from trusted IoTDB sources.

What Can Ask AI Help With

Ask AI is best suited for users who want to quickly locate existing, authoritative information about Apache IoTDB, including:

understanding system behavior and data models
tuning performance and configurations
checking whether an issue is known or previously discussed

It helps you find the right part of it—especially when you don’t know where to start.

More Than a Chat Bot

Ask AI goes beyond one-off questions. With support for multi-turn conversations and a “Deep Thinking” mode, it helps users examine a topic more thoroughly by bringing together relevant references from IoTDB documentation and GitHub.

Rather than casual chat, it is designed for focused technical exploration within the IoTDB ecosystem.

Try It Next Time You're Stuck

The next time you’re about to open over ten tabs, try Ask AI instead. You’ll find it on the official Apache IoTDB website.

Try Now: https://iotdb.apache.org/

Welcome any discussion or questions: https://join.slack.com/t/apacheiotdb/shared_invite/zt-18jpjuo0m-VADRsGGbsQ6XsfkXxHR3uA

Key Apache IoTDB Distributed Tuning Details You Must Understand

TimechoDB — Wed, 01 Apr 2026 09:26:17 +0000

How many databases should you create? How should you model your data to fully utilize hardware resources?

When deploying Apache IoTDB in distributed mode, teams often face the same challenge: how to scale throughput without over-fragmenting the system. This article answers the most frequently asked questions about IoTDB distributed deployment and data modeling.

Recently, during a distributed deployment discussion, a user asked:

Most examples on the IoTDB website focus on smart factory scenarios. Is there a more general data modeling approach? Would creating one database per state improve performance? How should hierarchical paths be structured, like root.<state>.<license_plate>.<device_type>.<device_id>.<measurement>?

These questions touch several critical architectural concepts in IoTDB. Let’s address them step by step.

p.s. Applicable to IoTDB 1.0x and 2.0x

Do You Need Multiple Databases for Performance?

The short answer is:

No.

IoTDB is a distributed database. It does not require manual database sharding to achieve high throughput. Even a single database can fully utilize machine resources when properly configured.

That said, multiple databases may still be appropriate for semantic or operational reasons:

Different time partition intervals
Different Region counts
Independent permission control
Strong data isolation between business domains

It is important to note that:

Data across databases is isolated.
Cross-database queries are not supported.

Therefore, multiple databases are suitable when strict business isolation is required — not for performance tuning.

The key to distributed performance in IoTDB lies elsewhere — in a core abstraction called Region.

What Is Region? How Should You Tune Region Count?

Fundamentals

Region is one of the most important internal abstractions in IoTDB. Depending on perspective, Region has different roles:

From a distributed systems perspective → a data shard instance
From a storage engine perspective → a serial-write IoT-LSM engine instance
From a replication perspective → the unit of high availability

In practice, Region defines the true concurrency boundary of IoTDB.

The relationship between Database and Region is one-to-many:

One database owns multiple Regions
One Region belongs to exactly one database
Regions with the same ID are replicated across nodes for high availability

On a single DataNode:

More Regions → higher concurrency → better CPU utilization
But each Region consumes memory and runtime resources
Therefore, each DataNode has a soft upper limit on the number of Regions

As data volume increases, Regions expand dynamically until reaching this soft limit.

Understanding this mechanism is critical: Performance scaling in IoTDB is Region-driven, not database-driven.

Recommended Region Configuration

Recommend configuration: Region soft limit per DataNode = CPU logical cores ÷ 2

This configuration achieves:

Strong write concurrency
Controlled memory consumption
Stable garbage collection behavior
Predictable performance under load

The parameter is configured in iotdb-system.properties:

data_region_per_data_node

Cluster-wide consistency is required.

Version-specific defaults:

≤ 1.3.3 Default = 5 Recommended: manually calculate CPU logical cores ÷ 2
≥ 1.3.4 Default = 0 0 means auto-detect CPU logical cores ÷ 2

You may still set a fixed positive value if your workload requires it.

When Should You Increase Region Count?

Suppose:

data_region_per_data_node = CPU cores ÷ 2
You still want higher read/write throughput
Monitoring shows:
Disk I/O is not saturated
Network bandwidth is sufficient
Memory GC is stable
CPU is not fully utilized

In this case, the bottleneck may be insufficient concurrency rather than hardware limits.

You may:

Increase data_region_per_data_node to approximately CPU logical cores
Restart the cluster
Wait for new time partitions to trigger new Data Region creation

This increases the number of parallel write engines and allows the system to absorb higher write pressure.

Important Note About Multi-Database Deployments

The data_region_per_data_node parameter is a soft upper limit per DataNode.

With a single database → it effectively occupies the entire soft limit.
With multiple databases → they share the Region quota according to internal balancing policies.

In large-scale scenarios with many databases, the actual Region count may gradually exceed the soft limit as the system scales.

Again, this reinforces a central idea: IoTDB scaling is fundamentally Region-based.

Recommended Data Modeling Strategy

Now let’s return to the original modeling question.

Prefer a Single Database

For most deployments, a single database such as root.db is sufficient.

This:

Does not negatively affect performance
Simplifies cross-region queries (suite for cross-domain queries, depending on circumstances)
Avoids unnecessary data isolation

Configure Region Properly

Set data_region_per_data_node = CPU logical cores ÷ 2

This ensures hardware resources are effectively utilized while maintaining stability.

Hierarchical Path Design Principles

A recommended structure is:

root.db.<province>.<device_type>.<license_plate>.<measurement>

Core principle: Place lower-cardinality attributes at higher hierarchy levels.

Why？

IoTDB’s tree structure benefits from hierarchical compression
Fewer distinct nodes at upper levels improve metadata compression efficiency
Balanced tree structures improve memory usage and traversal efficiency

In practice:

Use semantic hierarchy
Place attributes with fewer unique values higher
Avoid excessive fragmentation

Tree Model and Table Model (IoTDB 2.x)

In IoTDB 2.x, both Tree Model and Table Model are supported.

While their access semantics differ, the underlying distributed architecture remains the same.

Region still defines:

Physical storage boundaries
Concurrency units
Replication units

Table Model introduces relational-style access semantics, but the Region-based scaling mechanism and storage engine remain consistent with Tree Model.

Therefore, understanding Region is essential regardless of which model you choose.

Final Takeaway

In distributed IoTDB:

Performance is not improved by manually splitting databases
Concurrency is controlled by Region configuration
Efficiency depends on balanced hierarchical modeling

Once Region is understood as the fundamental concurrency unit, distributed deployment decisions become clear engineering trade-offs rather than trial-and-error experimentation.

Why Apache IoTDB Is Written in Java: A Decade of Engineering Trade-offs

TimechoDB — Wed, 01 Apr 2026 09:02:23 +0000

Since I started working on the development of the time-series database Apache IoTDB in 2016, I've been asked the same question again and again:

Why did you choose Java to build a database? Can Java really be used to write a database system?

In the early days, my standard answer was usually this:

When IoTDB was initiated in 2011, almost all influential distributed systems and databases were built in Java or on the JVM—such as Hadoop, HBase, Spark (Scala on JVM), Cassandra, Kafka, and Flink. To integrate deeply with the big data ecosystem, choosing Java was a natural decision.

That explanation is valid—but clearly insufficient.

What people really want to know is:

If you learn Java, do you actually have a chance to build a database?
Can Java be used to build a good database?
What does choosing Java really mean for a system like IoTDB?
...

These questions cannot be answered by theory alone. The relationship between programming languages and databases is not a matter of ideology—it is a practical trade-off among language characteristics, system complexity, engineering investment, and long-term returns.

After nearly ten years of real-world exploration, we believe we can now give a more grounded answer. Below are the eight key considerations behind IoTDB's choice of Java.

A Mature and Comprehensive Java Ecosystem

Queues, maps, heaps, locks, thread scheduling—nearly every common data structure and concurrency primitive has mature, well-tested implementations in the Java ecosystem. This allows database developers to focus their energy on core database logic and performance optimizations, rather than repeatedly reinventing low-level infrastructure.

More importantly, Java is widely used across enterprise platforms and applications. Middleware components in the Java ecosystem integrate smoothly with each other, which significantly lowers the learning curve for developers adopting Java-based databases. As a result, Java developers can more easily understand, operate, and extend a Java-written database system.

Code Readability and Long-Term Maintainability

This factor is often overlooked, but for someone who has spent years working on database internals, it is critical.

Databases are inherently complex systems. That complexity brings enormous optimization potential—but also substantial risk. A single subtle mistake can introduce severe bugs, which is why newer versions of some databases occasionally perform worse or become less stable than older ones.

Java's object-oriented design provides a natural advantage in code readability and conceptual clarity. In practice, we have found that many community contributors are able to ramp up quickly by understanding IoTDB's design principles and abstractions.

Readable code is not just a matter of elegance—it is a system's lifeline. Only readable and understandable codebases can sustain long-term evolution without collapsing under their own complexity.

Operability and Debugging Efficiency

Most Java developers are familiar with exception handling and detailed stack traces in logs—and those stack traces are invaluable.

In our experience, when users report bugs in IoTDB, engineers can often locate the root cause within the same day, and rarely does debugging exceed one day. The stack information alone usually provides enough context to pinpoint the issue.

By contrast, in discussions with developers of C-based databases, diagnosing production issues such as memory leaks can sometimes take weeks or even months.

No language-level advantage matters more than system stability and recoverability. There is nothing more painful than a production database failure that cannot be quickly diagnosed or fixed.

JVM tooling such as JProfiler and Arthas gives Java developers powerful observability into runtime behavior, enabling fast root-cause analysis and remediation.

Cross-Platform Portability

Today, this is often referred to as localization or hardware adaptation.

Java's promise of "write once, run anywhere" has proven extremely valuable as domestic and heterogeneous hardware platforms have become more common. For IoTDB, we have rarely needed special platform-specific adaptations—if Java can run, IoTDB can run.

This allows us to concentrate on coredatabase logic and optimization, instead of spending engineering effort on platform compatibility.

Efficient Project and Dependency Management

For anyone joining the IoTDB project, the first essential skill is understanding Maven.

Nearly all Java projects—large or small—use Maven for project structure, dependency management, compilation, packaging, and release workflows. Advanced tasks such as code formatting and static analysis can be standardized through Maven profiles.

This consistency significantly reduces onboarding costs. In fact, my earliest blog posts about IoTDB were introductions to Maven-based release pipelines.

Performance: The Question Everyone Cares About

The most common concern is simple:

Can a database written in Java actually perform well?

Let's start with facts.

In major public time-series database benchmarks—such as TPCx-IoT and benchANT—IoTDB ranks first in both read/write performance and cost efficiency. These benchmarks include databases written in:

Go (InfluxDB, VictoriaMetrics)
C (TimescaleDB)
C++ (ClickHouse)

IoTDB, written in Java, is not merely competitive—it leads.

Why?

Because databases are often described as the crown jewel of foundational software. Their difficulty does not come from language syntax or runtime mechanics, but from internal system complexity.

As database functionality grows, system complexity increases exponentially—much like governing a large city with countless departments, workflows, and dependencies. This complexity creates vast optimization opportunities: columnar storage, batching, pipelining, indexing, and more. Optimizing even a single execution path can yield order-of-magnitude performance gains.

Java's garbage collection is frequently criticized, but in practice, it is a net positive feature—analogous to memory defragmentation at the OS level. Modern JVM GC algorithms are the result of decades of global engineering effort and perform remarkably well.

For special scenarios, databases can:

design smarter caching strategies
use off-heap memory
isolate memory-sensitive components

and do so transparently at the database layer.

In our production environments, we have never encountered a case where Java GC itself was the performance bottleneck. When serious GC pauses occur, they usually indicate either misconfiguration or memory leaks—issues typically identified and resolved during testing, often within the same day.

A database is a holistic system. No single technical advantage or disadvantage defines its success.

Lightweight Deployment Scenarios

Another frequent concern is whether Java databases can be deployed in edge or constrained environments.

There are two distinct scenarios:

Intelligent terminals

These devices may have limited resources (single-core CPU, 1–2 GB memory, tens of GB storage) but still support a full software stack. In such cases, Java poses no issue.

IoTDB can operate with memory footprints of just a few hundred megabytes, easily meeting edge read/write workloads. It is already running stably in satellite systems, airborne platforms, and power data collection terminals.

Embedded environments

Some embedded systems only support C/C++ runtimes, with tens of megabytes of memory and strict real-time constraints.

In many such cases, a full database is unnecessary; a lightweight file-based approach is often more appropriate. For this reason, we typically deploy the C++ implementation of TsFile, IoTDB's time-series file format, on the device side and upload files upstream.

P.S. The C++ version of TsFile will be open-sourced soon.

Industrial control algorithms rarely require long-term historical data stored in databases. Real-time control logic prioritizes low time complexity and often keeps required historical data fully cached in memory.

As hardware capabilities improve, the focus should shift toward better data processing models, not merely raw resource constraints.

A Strong Java Talent Pool

A database company is not just about code—it depends on a reliable development and operations team.

Although database systems attracted significant attention during recent waves of innovation, participation remains relatively small compared to application-layer projects.

Our experience shows that excellent Java developers can successfully transition into database kernel development. They ramp up quickly, take ownership of modules, and begin contributing meaningful code in a short time.

So why does the perception persist that Java cannot be used to build databases?

Largely due to historical reasons.

Relational databases originated in the 1970s, while Java was introduced in 1995. By the time Java emerged, major databases had already been written in C for decades:

Oracle (1977)
PostgreSQL (1986)
MySQL (1995)

Early skepticism also surrounded the commercial viability of databases themselves—until Oracle proved otherwise.

Java followed a similar trajectory. Today, many high-performance middleware systems and databases—including IoTDB, Cassandra, and H2—demonstrate that Java performance is more than sufficient for database development.

Looking back over IoTDB's decade-long journey, our ability to rapidly iterate on user demands while maintaining high stability and performance owes a great deal to Java.

Java is not only capable of building databases—it is well-suited for the task. This is not a theoretical claim, but a conclusion drawn from practice.

If you are a Java developer, you absolutely have the opportunity to build an excellent database.

If you’re curious about IoTDB, feel free to explore the project on GitHub—and join the community discussions and contributions.