Forem: tonybarber2

How JuiceFS Boosts Foundation Model Inference in Multi-Cloud Architectures

tonybarber2 — Fri, 30 Aug 2024 07:34:58 +0000

In the development and application of foundation models, data preprocessing, model development, training, and inference are the four critical stages. This article focuses on the inference stage. Our previous blog posts mentioned using JuiceFS Community Edition to improve model loading efficiency, with examples from BentoML and Beike.

In this article, we’ll dive into how JuiceFS Enterprise Edition accelerates foundation model inference in multi-cloud architectures. You can learn how it tackles challenges like speeding up data access, efficiently distributing model data in multi-cloud environments, cost-effectively managing large volumes of existing data, and optimizing hardware resource utilization in heterogeneous environments.

Challenges for foundation model inference and storage

The diagram below illustrates a typical architecture for foundation model inference services:

We can observe several characteristics from this diagram:

The architecture spans multiple cloud services or data centers. Given the current GPU resource shortages in the foundation model domain, most vendors or companies are adopting multi-cloud, multi-data center, or hybrid cloud strategies for their inference services.
The use of public cloud object storage as the storage point for all model data to ensure data consistency and ease of management. When scheduling inference tasks, specific cloud services may be selected. Data model retrieval requires manual intervention, such as pre-copying data. This is because the scheduling system cannot determine the exact data needed at each data center, and this data is changing. This results in additional costs.
Large inference clusters with hundreds to thousands of GPUs generate high concurrent data retrieval demands during server initialization.

In summary, the challenges related to foundation model inference and storage focus on:

Efficient data access
Rapid cross-region data distribution
Reading existing data
Resource optimization

Next, we’ll dive into our practices for addressing these challenges.

Challenge 1: Ensuring high throughput and concurrent data access

Inference tasks often involve handling model files that are hundreds of gigabytes in size, requiring high-concurrency sequential reads. Load speed is a critical concern for users. To meet the performance demands of such scenarios, JuiceFS Enterprise Edition’s distributed cache can create a large-scale caching space. By storing frequently used model data in a cache cluster, data access speed can be significantly improved, especially when launching thousands of inference instances simultaneously. Moreover, for AI applications that frequently switch models, such as Stable Diffusion’s text-to-image service, the cache cluster can greatly reduce model loading times. This directly enhances user experience.

For example, when loading a Safetensors format Stable Diffusion model on a standalone machine with a single GPU, data retrieval from the cache cluster can have a latency as low as 0.5 ms, compared to about 20 ms from object storage. This is nearly a 40-fold performance improvement.

The figure below shows JuiceFS’ distributed cache architecture:

The inference cluster is at the top.
The cache cluster is in the middle layer.
Object storage is at the bottom.
The metadata service is at the top right corner.

When the inference service is deployed, it first accesses the required model data through the mounted JuiceFS on the inference cluster. If the data is found in the local memory cache of the inference cluster, it’s used. If not, it queries the cache cluster in the middle. If the cache cluster also fails to find the data, it’s retrieved from the object storage.

Although the inference cluster and cache layer may appear as separate layers in the diagram, they can be combined in practical applications or deployments if GPU servers have NVMe SSDs.

In the example where each GPU server is equipped with multiple SSDs, with three SSDs per GPU server, one SSD is used for local caching and the other two SSDs serve as storage disks for distributed caching. In this case, we recommend the following deployment approach: deploying two clients on a GPU server—one for the FUSE daemon and one for the cache cluster client.

When an inference task needs to read data, it first attempts to read from the local FUSE mount point. If the local cache does not contain the required model data, the inference task accesses the distributed cache through another JuiceFS client on the same server. Once the data is read, it’s returned to the inference task and cached on the two SSDs managed by the cache cluster as well as the local FUSE mount point for faster future access.

Deploying two clients on a GPU server has two benefits:

Reduced network overhead: Local caching minimizes network communication costs. Although GPU servers communicate over high-speed network cards, network communication still incurs significant overhead.
Distributed cache cluster effect: The cache cluster client allows inference tasks to access data on other GPU servers. This achieves the effect of a distributed cache cluster.

Challenge 2: Quickly distributing model data to compute nodes in multi-cloud and hybrid cloud architectures

In multi-cloud and hybrid cloud architectures, data is spread across different cloud platforms and data centers. Traditional manual intervention, copying, and migration methods are not only costly but also complex to manage and maintain, including various issues such as permission control.

JuiceFS Enterprise Edition's mirror file system feature allows users to replicate data from one region to multiple regions, creating a one-to-many replication relationship. The entire replication process is transparent to users and applications. Data is written to a specified region, and the system automatically plans and replicates it to other regions.

The diagram below shows the data writing and reading process in a mirror file system. It shows two regions: the source region and the mirror region. When data is written to the source region, JuiceFS automatically replicates the data from the source region to the mirror region.

When reading data, the client in the mirror region first attempts to pull data from the object storage in its own region. If the data is missing or has not yet arrived due to synchronization delays, it automatically falls back to the source region storage and pulls data via a secondary data source link. Thus, all clients in the mirror region can ultimately access the data, although some data may come from the backup data source.

An example of data write process

This example shows a foundation model enterprise's deployment of a mirror file system, similar to the typical architecture diagram shown at the beginning of the article. At the top of the diagram is a central cluster, which serves as the source of data production.

The data write process is as follows:

Writing data: Data is initially created and written in the central cluster.
Full mirror metadata: After data production is complete, it’s written to JuiceFS, triggering a full metadata mirror process. As shown, data is mirrored from the central JuiceFS metadata service to one or more edge clusters (three in this case), allowing edge clusters to access metadata locally.
Cache warming up (optional): This step optimizes data access speed. When new data is added, in addition to replicating the metadata, it’s also desirable to access this data locally. In environments without object storage, distributed cache functionality can be employed by deploying a distributed cache cluster in each data center. By warming up the cache, the new data is copied to the cache clusters in each edge cluster, thereby accelerating data access.

An example of data read process

The data read process is as follows:

Accessing the mirrored metadata service: As shown in the green numbers in the diagram, when the GPU cluster needs to retrieve model data, it first accesses the mirrored metadata service.
Reading metadata and retrieving data: After retrieving metadata, the client first attempts to obtain the required data from the cache cluster in the data center. If cache warming up has been done, the required model data can usually be found in the data center’s cache cluster.
Falling back to source data: If data is not found in the cache cluster for any reason, there is no need to worry. This is because all cache cluster nodes will automatically fall back to the central object storage bucket to retrieve the original data.

Thus, the entire data reading process is smooth. Even if some data has not been warmed up or new data is not yet successfully warmed up, it can still be fetched from the central JuiceFS bucket through automatic fallback.

Challenge 3: Cost-effective and efficient access to large volumes of existing data

In addition to the challenges of data distribution in multi-cloud and hybrid cloud architectures, a common need is to migrate large amounts of accumulated raw data (for example, several petabytes) directly into JuiceFS. This need increases the complexity of large-scale data management and may require adjustments like data dual-writing. This may impact normal application operations.

The importing object storage metadata feature in JuiceFS Enterprise Edition allows for more efficient data import with minimal impact on operations. Users only need to continuously import metadata without copying data. The imported data can be accelerated through JuiceFS' distributed cache, improving data access speed. The following diagram shows the workflow of this feature:

The workflow of importing object storage metadata:

Import metadata: Using JuiceFS command-line tools, users can selectively import metadata from specific parts of the original data bucket without importing the entire bucket. This process uses prefix matching, involves only metadata import, and completes quickly as it does not copy data in the object storage. Metadata import is not a one-time operation. As the original data grows or changes, users can perform incremental imports without worrying about additional costs from duplicate imports. During each incremental import, the system only imports metadata for new or modified data, avoiding the re-import of already processed files. This prevents extra overhead.
Read metadata: After importing metadata into JuiceFS, applications (for example, inference tasks) can access the imported data through the JuiceFS client. Applications can start immediately without waiting for data to be copied from the original bucket to JuiceFS.
Read data: In scenarios like inference, distributed caching is often configured to optimize data access. Since only metadata was imported in the first step and not the actual data, the initial read from the distributed cache will not retrieve the data directly.
Fall back to original bucket and cache data: This step involves retrieving data from the original bucket through the distributed cache system. Once data is read, it’s automatically cached in JuiceFS' distributed cache. This avoids the need to access the original bucket for subsequent data reading, thus enhancing data access efficiency.

Following these steps, inference tasks can quickly access existing data and benefit from the performance boost of distributed caching.

Challenge 4: Optimizing hardware resource utilization in heterogeneous environments to enhance storage and compute performance

In heterogeneous environments, a system integrates various types or configurations of hardware devices. To maximize the value for enterprises, it’s crucial to fully utilize these heterogeneous hardware resources. In the following example, we have three servers, each equipped with different numbers and capacities of SSDs. Based on the total storage capacity of each server, the cache capacity ratios for these servers are set to 1:2:3.

By default, JuiceFS' distributed cache assumes that all servers have the same hardware configurations. Therefore, all cache nodes have equal weight. Under this configuration, the system's overall performance is limited by the server with the smallest capacity—in this case, 8 TB—leading to underutilization of other servers’ storage. Up to two-thirds of server 1’s cache potentially is not used.

To address this issue, we introduce the cache node weight concept, allowing users to dynamically or statically adjust the weight of each GPU node based on actual hardware configurations. For example, you can set the cache weight of server 1 to a default value of 100, server 2 to 200, and server 3 to 300. This corresponds to the 1:2:3 SSD capacity ratio. By setting different weights, you can more effectively use the storage resources of each cache server, optimizing overall system performance. This approach provides a typical solution for handling servers with different hardware configurations.

Beyond this scenario, cache node weight can be applied in other situations as well. For example, GPU servers tend to encounter failures, and users may need to take one or two servers offline weekly for hardware maintenance. Since a server shutdown results in the loss of cached data or temporary inaccessibility, this could affect the cache cluster's hit rate. In this case, the cache node weighting feature can also be used to minimize the impact of server failures or maintenance on the cache cluster’s utilization.

Future plans

Finally, let’s explore the improvements we plan to make in inference scenarios and other potential applications.

We plan to introduce a multi-replica feature in the distributed cache. Currently, data in the distributed cache system typically exists in a single replica format. This means that if a server (such as a GPU server) unexpectedly fails, the cached data on that server is lost due to the lack of a backup, directly impacting cache hit rates. Since such failures are sudden, we cannot gradually migrate data to other nodes through manual intervention.

In this context, single-replica caching inevitably affects the efficiency of the entire cache cluster. Therefore, we consider upgrading from single-replica to multi-replica caching. The benefits of this upgrade are clear: although it consumes more storage space, it can significantly improve cache hit rates and availability in scenarios where servers frequently fail.
We’re exploring the implementation of a user-space client. Currently, the file system based on FUSE mounting effectively implements file system functions. However, since it relies on the Linux system kernel and involves multiple switches and data copies between user space and kernel space, there is some performance overhead. This limitation is particularly obvious in serverless and Kubernetes environments on the cloud. FUSE mounting may not be permitted. This restricts the application scenarios of JuiceFS.

Therefore, we consider developing a pure user-space client. This would be a component that does not rely on kernel space. It can significantly lower the usage threshold and provide services in environments where FUSE is unsupported. Moreover, by avoiding frequent switches and memory copying between kernel space and user space, this client could potentially deliver significant performance improvements, especially in GPU-intensive environments requiring high throughput.

However, a potential drawback of this client is that it may not be as transparent as a POSIX interface. It requires users to implement functionalities by incorporating specific libraries (for example, the JuiceFS library). This approach might introduce intrusion into the application.
We aim to enhance observability. The JuiceFS architecture includes multiple complex components, such as GPU servers, the cache cluster, object storage through dedicated lines, and cache warm-up. Given this, we plan to introduce more convenient tools and methods to improve the overall observability of the architecture. This will help JuiceFS users quickly and easily locate and analyze issues. In the future, we’ll further optimize the observability of various components, including distributed caching, to assist users in fast problem diagnosis and resolution when issues arise.

If you have any questions or would like to share your thoughts, feel free to join JuiceFS discussions on GitHub and community on Slack.

Metabit Trading Built a Cloud-Based Quantitative Research Platform with JuiceFS

tonybarber2 — Fri, 16 Aug 2024 07:34:31 +0000

Founded in 2018, Metabit Trading is a technology-driven quantitative investment firm harnessing the power of AI and machine learning. In April 2023, our assets under management reached about USD 1.4 billion. We place a high priority on building foundational infrastructure and have a strong research infrastructure team. The team uses cloud computing to overcome the limitations of standalone development, aiming to develop a more efficient and secure toolchain.

Driven by the unique demands of quantitative research, we switched to cloud-native storage. After our evaluation, we chose JuiceFS, a distributed file system, for its POSIX compatibility, cost-effectiveness, and high performance. It’s an ideal fit for our diverse and bursty compute tasks. The deployment of JuiceFS has enhanced our ability to handle large-scale data processing, optimize storage costs, and protect intellectual property (IP) in a highly dynamic research environment.

In this article, we’ll deep dive into characteristics of quantitative research, storage requirements of our quantitative platform, and our storage solution selection.

What quantitative trading research does

As a newly established quantitative trading firm, our infrastructure storage platform selection was influenced by two factors:

Our preference for modern technology stacks, given our lack of historical technical baggage
The specific characteristics of machine learning scenarios in quantitative trading analytics

The diagram below shows the research model most closely associated with machine learning in our scenario:

Initially, feature extraction must be performed on the raw data before model training. Financial data has a particularly low signal-to-noise ratio, and using raw data directly for training would result in a very noisy model.

Raw data includes market data such as stock prices and trading volumes, as well as non-quantitative data like research reports, financial reports, news, social media, and other unstructured data. Researchers extract features from this data through various transformations and then train AI models.

Model training yields models and signals. Signals predict future price trends, and their strength indicates the strategic orientation. Quantitative researchers use this information to optimize portfolios, forming real-time trading positions. This process also involves horizontal dimension information (stocks) for risk control, such as avoiding excessive positions in a particular sector. After forming a position strategy, researchers simulate orders to understand the strategy’s performance through the resulting profit and loss information.

Characteristics of quantitative research

Bursty tasks: high elasticity

Quantitative research generates a lot of bursty tasks due to researchers validating their ideas through experiments. As new ideas emerge, the computing platform generates numerous bursty tasks. This requires high elasticity in compute.

Diverse research tasks: flexibility

The entire process includes a wide variety of compute tasks. For example:

Feature extraction: Computations on time-series data
Model training: Classical machine learning scenario
Portfolio optimization: Tasks involving optimization problems
Trading strategy backtesting: Simulating strategy performance with historical data

The diversity of tasks demands varied compute requirements.

Protecting research content: modularity and isolation

Quantitative research content is critical IP. To protect it, the research platform abstracts each strategy segment into modules with standardized inputs, outputs, and evaluation methods. For example, model research involves standardized feature values as inputs and signals and models as outputs. This modular isolation effectively safeguards IP. The storage platform must be designed to accommodate this modularity.

Data characteristics of quantitative research

The input for many tasks comes from the same data source. For example, as mentioned above, quantitative researchers need to perform extensive backtesting on historical strategies. They test the same positions with different parameters to observe their performance. Feature extraction often involves combining basic and new features, with much of the data coming from the same source.

Take raw data of 100 TB as an example. In the era of big data, this is not a particularly large amount of data. However, when a large number of compute tasks simultaneously access this data, it imposes specific requirements on data storage.

In addition, the quantitative research process involves many sudden tasks. The research team wants to store the results of these tasks. This generates a large amount of archive data, although the access frequency of this data is low.

Characteristics of quantitative research compute tasks

Based on the above characteristics, it’s difficult to meet our compute needs using traditional data centers. Therefore, moving compute tasks to the cloud is a suitable technical choice for us.

High frequency of burst tasks and high elasticity

The figure below shows the running instances of one of our clusters. We can see there were multiple periods where the entire cluster's instances were fully utilized. At the same time, there were moments when the entire cluster scaled down to zero. The compute tasks of a quantitative institution are closely tied to the research progress of its researchers. The peaks and valleys in demand are significant, which is a hallmark of offline research tasks.

Technology development and the challenge of predicting compute needs

Our research methods and compute needs can experience explosive growth. Accurately predicting these changes in compute demand is challenging. In early 2020, both our actual and forecasted usage were quite low. However, when the research team introduced new methods and ideas, there was a sudden, substantial increase in the demand for compute resources. Capacity planning is crucial when setting up traditional data centers, but it becomes particularly tricky under these circumstances.

Modern AI ecosystem built on cloud-native platforms

The modern AI ecosystem is almost entirely built on cloud-native platforms. We’ve tried many innovative technologies, including the popular machine learning operations, to streamline the entire pipeline and build machine learning training pipelines. Many distributed training tasks now support cloud-native development, making it a natural choice for us to move our compute tasks to the cloud.

Quantitative platform storage requirements

Based on the above application and compute needs, our storage platform requirements include:

Imbalance between compute and storage: As mentioned earlier, compute tasks could surge significantly, reaching very high levels. However, the growth of hot data was not so fast. Therefore, the separation of compute and storage was required.
High throughput access for hot data: For example, market data requires high throughput access when hundreds of tasks simultaneously access the data.
Low-cost storage for cold data: Quantitative research generates a large amount of archive data that needs to be stored at a low cost.
Diverse file types/requirements and POSIX compatibility: We have various compute tasks with diverse file type requirements, such as CSV and Parquet. Some research scenarios also require flexible custom development, making POSIX compatibility a critical consideration when selecting a storage platform.
IP protection: data sharing and isolation. Our need to protect IP requires isolation in compute tasks and data. At the same time, researchers need easy access to public data like market data.
AI ecosystem and task scheduling on cloud platforms: This is a basic requirement, so storage also needs good support for Kubernetes.
Modularity, namely intermediate result storage/transmission: The modularity of compute tasks requires storage and transmission of intermediate results. For example, during feature calculation, large amounts of feature data are generated, which are immediately used by training nodes. We need an intermediate storage medium for caching.

Storage solution selection

Non-POSIX compatible solutions

Initially, we tried many object storage solutions, which were non-POSIX solutions. Object storage offers strong scalability and low cost but has obvious drawbacks. The biggest issue is the lack of POSIX compatibility. The use of object storage differs significantly from file systems, making it difficult and inconvenient for researchers to use directly.

In addition, many cloud providers have request limits on object storage. For example, Alibaba Cloud limits the bandwidth of the entire OSS account. While this is acceptable for typical application scenarios, burst tasks generate significant bandwidth demand in an instant. Therefore, it’s challenging to support these scenarios with just object storage.

Another option was HDFS. However, we didn't test HDFS extensively. Our tech stack did not heavily rely on Hadoop. HDFS does not particularly support AI training products well. Moreover, HDFS lacks full POSIX compatibility, which limits its use in our scenarios.

POSIX-compatible solutions on the cloud

Based on the application characteristics mentioned above, we required strong POSIX compatibility. Since our technology platform is based on the public cloud, we focused on POSIX-compatible cloud solutions.

Cloud providers offer solutions like Alibaba Cloud NAS and AWS EFS. Another category includes Alibaba Cloud Cloud Parallel File Storage (CPFS) and Amazon FSx. The throughput of these file systems is strongly tied to capacity—greater capacity means greater throughput, directly related to NAS storage properties.

Such solutions are not very friendly when dealing with small amounts of hot data and require extra optimization for better performance. In addition, CPFS or Alibaba Cloud's high-speed NAS are good for low-latency reads but have high costs.

Compared with high-performance NAS product costs, JuiceFS' overall cost is much lower, because its underlying storage is object storage. JuiceFS costs include object storage fees, JuiceFS Cloud Service fees, and SSD cache costs. Overall, JuiceFS' total cost is much lower than NAS and other solutions.

Regarding throughput, early tests showed no significant performance difference between CPFS and JuiceFS, when there were not many nodes. However, as the number of nodes increased, NAS file systems faced bandwidth limitations. This lengthened reading time. In contrast, JuiceFS, with a well-deployed cache cluster, handled this effortlessly with minimal overhead.

Besides cost and throughput, JuiceFS supports the mentioned functionalities well:

Full POSIX compatibility
Permission control
Quality of service
Kubernetes support

Notably, JuiceFS' cache cluster capability allows flexible cache acceleration. Initially, we used compute nodes for local caching—a common approach. With compute-storage separation, we wanted some data to be localized to compute nodes. JuiceFS supports this well, with features for space occupancy and percentage limits.

We deployed an independent cache cluster to serve hot data, warming up the cache before use. During use, we noticed significant differences in resource utilization across compute clusters. Some high-bandwidth servers were primarily used for single-node computations. This meant their network resources were largely unused. We deployed cache nodes on these servers, using idle network bandwidth. This resulted in a high-bandwidth cache cluster within the same cluster.

JuiceFS in production scenarios

Currently, we use JuiceFS in the following production scenarios:

File systems for compute tasks, applied to hot data input.
Log/artifact output.
Pipeline data transfer: After feature generation, data needs to be transferred to model training. During training, data transfer needs arise, with Fluid + JuiceFS Runtime serving as an intermediate cache cluster.

In the future, we’ll continue exploring cloud-native and AI technologies to develop more efficient, secure toolchains and foundational technology platforms.

If you have any questions for this article, feel free to join JuiceFS discussions on GitHub and their community on Slack.

A Comprehensive Comparison of JuiceFS and HDFS for Cloud-Based Big Data Storage

tonybarber2 — Fri, 07 Apr 2023 06:26:53 +0000

About Author
At Juicedata, Youpeng Tang works as a full-stack engineer and is responsible for integrating JuiceFS with the Hadoop platform.

HDFS has become the go-to choice for big data storage and is typically deployed in data center environments.

JuiceFS, on the other hand, is a distributed file system based on object storage that allows users to quickly deploy elastic file systems that can be scaled on demand in the cloud.

For enterprises considering building a big data platform in the cloud, understanding the differences and pros and cons between these two products can provide valuable insights for migrating or switching storage solutions. In this article, we will analyze the similarities and differences between HDFS and JuiceFS from multiple aspects, including technical architecture, features, and use cases.

1 Architecture

1.1 HDFS Architecture

HDFS (Hadoop Distributed File System) is a distributed file system within the Hadoop ecosystem. Its architecture consists of NameNode and DataNode. The most basic HDFS cluster is made up of one NameNode and a group of DataNodes.

NameNode is responsible for storing file metadata and serving client requests such as create, open, rename, and delete. It manages the mapping between DataNodes and data blocks, maintaining a hierarchical structure of all files and directories, storing information such as file name, size, and the location of file blocks. In production environments, multiple NameNodes are required along with ZooKeeper and JournalNode for high availability.

DataNode is responsible for storing actual data. Files are divided into one or more blocks, each of which is stored on a different DataNode. The DataNode reports the list and status of stored blocks to the NameNode. DataNode nodes handle the actual file read/write requests and provide the client with data from the file blocks.

1.2 JuiceFS Architecture

JuiceFS community edition also splits file data into blocks, but unlike HDFS, JuiceFS stores data in the object storage (such as Amazon S3 or MinIO), while metadata is stored in a user-selected database such as Redis, TiKV, or MySQL.

JuiceFS's metadata management is completely independent of its data storage, which means it can support large-scale data storage and fast file system operations while maintaining high availability and data consistency. JuiceFS provides Hadoop Java SDK which supports seamless switching from HDFS to JuiceFS. Additionally, JuiceFS provides multiple APIs and tools such as POSIX, S3 Gateway, and Kubernetes CSI Driver, making it easy to integrate into existing applications.

The JuiceFS Enterprise Edition shares the same technical architecture as the Community Edition but is tailored for enterprise users with higher demands for performance, reliability, and availability, providing a proprietary distributed metadata storage engine. It also includes several advanced features beyond the Community Edition, such as a web console, snapshots, cross-region data replication, and file system mirroring.

2. Basic Capabilities

2.1 Metadata

2.1.1 Metadata Storage

HDFS stores metadata in memory using FsImage and EditLog, to ensure that metadata is not lost and can be quickly recovered after a restart.

JuiceFS Community Edition employs independent third-party databases to store metadata, such as Redis, MySQL, TiKV, and others. As of the time of this writing, JuiceFS Community Edition supports a total of 10 transactional databases across three categories.

JuiceFS Enterprise Edition uses a self-developed high-performance metadata storage engine that stores metadata in memory. It ensures data integrity and fast recovery after restart through changelog and checkpoint mechanism, similar to HDFS's EditLog and FsImage.

2.1.2 Memory consumption for metadata

The metadata of each file in HDFS takes up approximately 300 bytes of memory space.

When Redis is used as the metadata engine in JuiceFS Community Edition, the metadata of each file takes up approximately 300 bytes of memory space.

For JuiceFS Enterprise Edition, the metadata of each file for hot data takes up approximately 300 bytes of memory space. JuiceFS Enterprise Edition supports memory compression, and for infrequently used cold data, memory usage can be reduced to around 80 bytes per file.

2.1.3 Scaling metadata

If metadata service cannot scale well, performance and reliability will degrade at a large scale.

HDFS addresses the capacity limitation of a single NameNode by using federation. Each NameNode in the federation has an independent namespace, and they can share the same DataNode cluster to simplify management. Applications need to access a specific NameNode or use statically configured ViewFS to create a unified namespace, but cross-NameNode operations are not supported.

JuiceFS Community Edition uses third-party databases as metadata storage, and these database systems usually come with mature scaling solutions. Typically, storage scaling can be achieved simply by increasing the database capacity or adding database instances.

JuiceFS Enterprise Edition also supports horizontal scaling of metadata clusters, where multiple metadata service nodes jointly form a single namespace to support larger-scale data and handle more access requests. No client modifications are required when scaling horizontally.

2.1.4 Metadata operations

HDFS resolves file paths based on the full path. The HDFS client sends the full path that needs to be accessed directly to the NameNode. Therefore, any request for a file at any depth only requires one RPC call.

JuiceFS Community Edition accesses files through inode-based path lookup, which involves searching for files layer by layer starting from the root directory until the final file is found. Therefore, depending on the file's depth, multiple RPC calls may be required, which can slow down the process. To speed up path resolution, some metadata engines that support fast path resolution, such as Redis, can directly resolve to the final file on the server side. If the metadata database used does not support fast path resolution, enabling metadata caching can speed up the process and reduce the load on metadata services. However, enabling metadata caching changes the consistency semantics and requires further adjustments based on the actual scenario.

JuiceFS Enterprise Edition supports server-side path resolution, requiring only one RPC call for any file at any depth.

2.1.5 Metadata cache

Metadata caching can significantly improve the performance and throughput of a file system. By storing the most frequently used file metadata in the cache, the number of RPC calls will significantly decrease, hence improving performance.

HDFS does not support client-side metadata caching.

Both JuiceFS Community and Enterprise Editions support client-side metadata caching to accelerate metadata operations such as open, list, and getattr, and reduce the workload on the metadata server.

2.2 Data management

2.2.1 Data storage

In HDFS, files are stored in 128MB blocks with three replicas on three different DataNodes for fault tolerance. The number of replicas can be adjusted by modifying the configuration. In addition, HDFS supports Erasure Coding, an efficient data encoding technique that provides fault tolerance by splitting data blocks into smaller chunks and encoding them for storage. Compared to replication, erasure coding can save storage space but will increase computational load.

On the other hand, JuiceFS splits data into 4MB blocks and stores them in object storage. For big data scenarios, it introduces a logical block of 128MB for computational task scheduling. Since JuiceFS uses object storage as the data storage layer, data reliability depends on the object storage service used, which generally provides reliability assurance through technologies such as replication and Erasure Coding.

2.2.2 Data cache

In HDFS, specified data can be cached in the off-heap memory of DataNodes on the server-side to improve the speed and efficiency of data access. For example, in Hive, small tables can be cached in the memory of DataNodes to improve join speed.

On the other hand, JuiceFS's data persistence layer resides on object storage, which usually has higher latency. To address this issue, JuiceFS provides client-side data caching, which caches data blocks from object storage on local disks to improve the speed and efficiency of data access.

JuiceFS Enterprise Edition not only provides basic client-side caching capabilities but also offers cache sharing functionality, allowing multiple clients to form a cache cluster and share local caches. In addition, JuiceFS Enterprise Edition can also set up a dedicated cache cluster to provide stable caching capabilities for unstable compute nodes (like Kubernetes).

2.2.3 Data Locality

HDFS stores the storage location information for each data block, which can be used by resource schedulers such as YARN to achieve affinity scheduling.

JuiceFS supports using local disk cache to accelerate data access. It calculates and generates preferred location information for each data block based on a pre-configured list of compute nodes, allowing resource schedulers like YARN to allocate compute tasks for the same data to the same fixed node, thereby increasing the cache hit rate.

JuiceFS Enterprise also supports sharing data cache between multiple compute nodes, called distributed cache. Even if a compute task is not scheduled to the preferred node, it can still access cached data through distributed cache.

3. Features

3.1 Data consistency

Both HDFS and JuiceFS ensure strong consistency for their metadata (CP systems), providing strong consistency guarantees for the file system data.

JuiceFS supports client-side caching of metadata. When client caching is enabled, it may impact data consistency and should be carefully configured for different scenarios.

For read-write scenarios, both HDFS and JuiceFS provide close-to-open consistency, which ensures that newly opened files can read data previously written by files that have been closed. However, when a file is kept open, it may not be able to read data written by other clients.

3.2 Data Reliability

When writing data to HDFS, applications typically rely on closing files successfully to ensure data is persistent, this is the same for JuiceFS.

To provide low-latency data writes for HBase, typically used for HBase's WAL files, HDFS provides the hflush method to sacrifice data persistence (writing to memory on multiple DataNodes) to achieve low latency.

JuiceFS's hflush method will persist data to the client's cache disk (this is JuiceFS Client write cache, also called writeback mode), relying on the performance and reliability of the cache disk. Additionally, when the upload speed to the object storage is insufficient or client exits prematurely, it may cause data loss, affecting HBase's fault recovery ability.

To ensure higher data reliability, HBase can use HDFS's hsync interface to ensure data persistence. JuiceFS also supports hsync, in which JuiceFS will persist data to the object storage.

3.3 Concurrent Read and Write

Both HDFS and JuiceFS support concurrent reading of a single file from multiple machines, which can provide relatively high read performance.

HDFS does not support concurrent writing to the same file. JuiceFS, on the other hand, supports concurrent writing, but the application needs to manage the offset of the file itself. If multiple clients simultaneously write to the same offset, these operations will overwrite each other.

Both for HDFS and JuiceFS, if multiple clients simultaneously open a file, one client modifies the file, and other clients may not be able to read the latest modification.

3.4 Security

Kerberos is used for identity authentication. Both HDFS and JuiceFS Enterprise Edition support it. However, the JuiceFS community version only supports authenticated usernames and cannot verify user identities.

Apache Ranger is used for authorization. Both HDFS and JuiceFS Enterprise Edition support it, but the JuiceFS community version does not.

Both HDFS and JuiceFS Enterprise Edition support setting additional access rules (ACL) for directories and files, but the JuiceFS community version does not.

3.5 Data Encryption

HDFS has implemented transparent end-to-end encryption. Once configured, data read and written by users from special HDFS directories are automatically encrypted and decrypted without requiring any modifications to the application code, making the process transparent to the user. For more information, see "Apache Hadoop 3.3.4 - Transparent Encryption in HDFS".

JuiceFS also supports transparent end-to-end encryption, including encryption in transit and encryption at rest. When static encryption is enabled, users need to manage their own key, and all written data will be encrypted based on this key. For more information, see "Data Encryption". All of these encryption applications are transparent and do not require modifications to the application code.

3.6 Snapshot

Snapshots in HDFS refer to read-only mirrors of a directory that allow users to easily access the state of a directory at a particular point in time. The data in a snapshot is read-only and any modifications made to the original directory will not affect the snapshot. In HDFS, snapshots are implemented by recording metadata information on the filesystem directory tree and have the following features:

Snapshot creation is instantaneous: the cost is O(1) and does not include node lookup time.
Additional memory is used only when modifications are made relative to the snapshot: memory usage is O(M), where M is the number of files/directories modified.
Blocks in the datanodes are not copied: snapshot files record the list of blocks and file size. There is no data replication.
Snapshots do not have any adverse impact on regular HDFS operations: modifications are recorded in reverse chronological order, so current data can be accessed directly. Snapshot data is calculated by subtracting the modified content from the current data.
Furthermore, using snapshot diffs allows for quick copying of incremental data.

JuiceFS Enterprise Edition implements snapshot functionality in a similar way to cloning, quickly copying metadata without copying underlying data. Snapshot creation is O(N) and memory usage is also O(N). Unlike HDFS snapshots, JuiceFS snapshots can be modified.

3.7 Storage Quota

Both HDFS and JuiceFS support file and storage space quotas. JuiceFS Community Edition needs to be upgraded to the upcoming 1.1 version.

3.8 Symbolic Links

HDFS does not support symbolic links.

JuiceFS Community Edition supports symbolic links, and the Java SDK can access symbolic links created through the POSIX interface (relative paths).

JuiceFS Enterprise Edition supports not only symbolic links with relative paths but also links to external storage systems (HDFS compatible), achieving similar effects as Hadoop's ViewFS.

3.9 Directory Usage Statistics

HDFS provides the du command to obtain real-time usage statistics for a directory, and JuiceFS also supports the du command. The enterprise edition of JuiceFS can provide real-time results similar to HDFS, while the community edition needs to traverse subdirectories on the client side to gather statistics.

3.10 Elastic Scaling

HDFS supports dynamic node adjustment for storage capacity scaling, but this may involve data migration and load balancing issues. In contrast, JuiceFS typically uses cloud object storage, whose natural elasticity enables storage space to be used on demand.

3.11 Operations and Maintenance

In terms of operations and management, there are incompatibilities between different major versions of HDFS, and it is necessary to match other ecosystem component versions, making the upgrade process complex.

In contrast, both the JuiceFS community and enterprise versions support Hadoop 2 and Hadoop 3, and upgrading is simple, only requiring the replacement of jar files. Additionally, JuiceFS provides tools for exporting and importing metadata, facilitating data migration between different clusters.

4. Applicable Scenarios

When choosing between HDFS and JuiceFS, it is important to consider the different scenarios and requirements.

HDFS is more suitable for data center environments with relatively fixed hardwares, using bare metals as storage media and not requiring elastic and storage-compute separation.

However, in public cloud environments, the available types of bare metal disk nodes are often limited, and object storage has become a better storage option. By using JuiceFS, users can achieve storage-compute separation to obtain better elasticity and at the same time support most of the applications in the Hadoop big data ecosystem, making it a more efficient choice.

In addition, when big data scenarios need to share data with other applications, such as AI, JuiceFS provides richer interface protocols, making it very convenient to share data between different applications and eliminating the need for data copying, which can also be a more convenient choice.

5. Conclusion

Feature

Feature	HDFS	JuiceFS Community Edition	JuiceFS Enterprise Edition
Release Date	2005	2021	2017
Language	Java	Go	Go
Open Source	Apache V2	Apache V2	Closed source
High availability	Support (depend on ZK)	Depending on Metadata engine	support
Metadata scaling	Independent namespace	Depending on Metadata engine	Horizontal scaling, single namespace
Metadata storage	Memory	Database	Memory
Metadata Caching	Not support	Support	Support
Data storage	Disk	Object storage	Object storage
Data caching	Datanode memory cache	Client cache	Client cache/ distributed cache
Data locality	Support	Support	Support
Data consistency	Strong consistency	Strong consistency	Strong consistency
Atomicity of rename	Yes	Yes	Yes
Append writes	Support	Support	Support
File truncation ( truncate)	Support	Support	Support
Concurrent writing	Not support	Support	Support
hflush(HBase WAL)	Multiple DataNodes' memory	Write cache disk	Write cache disk
ApacheRanger	Support	Not support	Not support
Kerberos authentication	Support	Not support	Not support
Data encryption	Support	Support	Support
Snapshot	Support	Not support	Support (clone)
Storage quota	Support	Support	Support
Symbolic link	Not support	Support	Support
Directory statistics	Support	Not supported (requires traversal)	Support
Elastic scaling	Manual	Automatic	Automatic
Operations & Maintenance	Relatively complicated	Simple	Simple

Access protocol

Protocol	HDFS	JuiceFS
FileSystem Java API	✅	✅
libhdfs (C API)	✅	✅
WebHDFS (REST API)	✅	❌
POSIX (FUSE or NFS)	❌	✅
Kubernetes CSI	❌	✅
S3	❌	✅

Although HDFS also provides NFS Gateway and FUSE-based clients, they are rarely used due to poor compatibility with POSIX and insufficient performance.

From Juicedata/JuiceFS

SeaweedFS vs JuiceFS

tonybarber2 — Fri, 17 Feb 2023 06:04:27 +0000

SeaweedFS is an efficient distributed file system that can read and write small data blocks quickly, and is prototyped after Facebook's Haystack. In this article, we will compare the differences in design and features between SeaweedFS and JuiceFS to help readers make a better choice for themselves.

SeaweedFS System Architecture

SeaweedFS consists of 3 parts, a Volume Server to store files in the underlying layer, a Master Server to manage clusters, and an optional Filer component to provide more features upwards.

Volume Server and Master Server

In terms of system operation, Volume Server and Master Server work together as file storage. Volume Server focuses on writing and reading data, and Master Server provides management service for clusters and Volumes.

When reading and writing data, the implementation of SeaweedFS is similar to Haystack in that a user-created Volume is a large disk file (Superblock in the figure below). In this Volume, all files written by the user (Needle in the figure below) are merged into the large disk file.

Before starting to write data, the caller needs to make a write request to SeaweedFS (Master Server), which then returns a File ID (consisting of a Volume ID and offset) based on the current amount of data, along with basic metadata information such as file length and Chunk. When the write is complete, the caller needs to associate the file with the returned File ID in an external system (e.g. MySQL). When reading the data, the contents of the file can be obtained very efficiently because the data can be directly referenced by File ID.

Filer

On top of the above-mentioned underlying storage unit, SeaweedFS provides a component called Filer, which connects to the Volume Server and Master Server. Thus it can provide rich functions and features (such as POSIX support, WebDAV, S3 interface, etc.). Like JuiceFS, Filer also needs to use an external database to store metadata.

For the sake of illustration, all the SeaweedFS mentioned below include Filer components.

JuiceFS Architecture

JuiceFS uses a separate storage architecture for data and metadata, where the file data itself is splitted and stored in an object store (e.g. Amazon S3), while the metadata is stored in a database of the user's choice (e.g. Redis, MySQL). By sharing the same Metadata Engine and object storage, JuiceFS achieves a distributed file system with strong consistency guarantee, and also has many features such as POSIX full compatibility and high performance.

Comparison in Metadata

Both SeaweedFS and JuiceFS support external databases to store metadata . In terms of the supported databases, SeaweedFS supports up to 24 databases. Due to high requirements for database transactions (see below), JuiceFS supports 3 types of transactional databases, 10 in total.

Atomic Operations

To ensure the atomicity of all metadata operations, JuiceFS needs to use a transactional database at the implementation level. SeaweedFS, on the other hand, has weaker requirements for database transaction capabilities. And it only enables transactions for some of the databases (SQL, ArangoDB, and TiKV) when performing rename operations. Also, since the original directory or file is not locked when copying metadata during the rename operation, data corruption may occur.

Change Log (changelog)

SeaweedFS generates a changelog for all metadata operations, which can be used for data replication (see below), operation auditing, etc., while JuiceFS does not implement this feature yet.

Comparison in Storage

As mentioned above, the data storage of SeaweedFS is implemented by Volume Server + Master Server, which supports features such as merged storage and correction code for small data blocks. The data storage of JuiceFS is based on Object Storage Service, and the relevant features are provided by the user-selected Object Storage.

File Splitting

When storing data, SeaweedFS and JuiceFS both split files into smaller chunks and then persist them in the underlying data system. SeaweedFS splits files into 8MB chunks, and for very large files (over 8GB), it also saves the chunk indexes to the underlying data system. JuiceFS, on the other hand, splits into 64MB Chunks and then into 4MB Objects, optimizing the performance of random writes, sequential reads, repeated writes, etc. by using the internal concept of a Slice. (See An introduction to the workflow of processing read and write for details)

Tiered Storage

For newly created Volumes, SeaweedFS stores the data locally, while for older Volumes, SeaweedFS supports uploading them to the cloud to achieve tiered storage. In this regard, JuiceFS relies on external services.

Data Compression

JuiceFS supports data compression using LZ4 or ZStandard, while SeaweedFS chooses compression algorithm based on the information such as extension and file type of the written files.

Storage Encryption

JuiceFS supports encryption in transit and encryption at rest. When encryption at rest is enabled, the user needs to provide a self-managed key, and all written data will be encrypted based on this key. (See "Data Encryption" for details.)

SeaweedFS also supports in-transit encryption and encryption at rest. When data encryption is enabled, all data written to Volume Server is encrypted with a random key, and the information about the corresponding random key is managed by the Filer, which maintains the metadata.

Access Protocols

POSIX Compatibility

JuiceFS is fully POSIX compatible, while SeaweedFS is only partially POSIX compatible (see Issue 1558 and Wiki), which is still being improved.

S3 Protocol

JuiceFS implements S3 gateway through MinIO S3 gateway. It provides a S3-compatible RESTful API for files in JuiceFS, and enables the management with tools such as s3cmd, AWS CLI, MinIO Client (mc), etc. if mounting is not convenient.

SeaweedFS currently supports about 20 S3 APIs, covering common read, write, check, and delete requests, with extension of functions for some specific requests (e.g. Read), as detailed in Amazon-S3-API.

WebDAV Protocol

Both JuiceFS and SeaweedFS support the WebDAV protocol.

HDFS Compatibility

JuiceFS is fully compatible with the HDFS API, not only with Hadoop 2.x and Hadoop 3.x, but also with various components of the Hadoop ecosystem. SeaweedFS provides basic compatibility with the HDFS API, but some operations (such as truncate, concat, checksum, and extended attributes) are not yet supported.

CSI Drivers

Both JuiceFS and SeaweedFS provide a Kubernetes CSI Driver to help users use the corresponding file system in the Kubernetes ecosystem.

Extended Features

Client-side caching

JuiceFS has a variety of client-side caching strategies, covering everything from metadata to data caching, and allows users to customize according to their scenarios (details), while SeaweedFS does not have client-side caching capabilities.

Cluster Data Replication

For data replication between multiple clusters, SeaweedFS supports two asynchronous replication modes, Active-Active and Active-Passive, both of which achieve consistency between different clusters through log replaying. For each changelog, there is a signature message to ensure that the same modification is not applied multiple times. In Active-Active mode, where the number of cluster nodes exceeds 2, some operations of SeaweedFS will be limited, such as renaming directories.

JuiceFS does not natively support data synchronization between clusters and relies on the data replication capabilities of the metadata engine and the object store.

On-cloud Data Caching

SeaweedFS can be used as a cache for the object stored on the cloud and supports manual cache warm-up . Modifications to cached data are synchronized asynchronously to the object store. JuiceFS needs to store files to the object store by chunks and does not yet support cache acceleration for data already in the object store.

Trash

By default, JuiceFS enables the Trash feature, which automatically moves user-deleted files to the .trash directory under the JuiceFS root directory for a specified period of time before the data is actually cleaned up. SeaweedFS does not support this feature currently.

Operations and Maintenance Tools

JuiceFS provides two subcommands, juciefs stats and juicefs profile, allowing users to view real-time or playback performance metrics from a certain time period. JuiceFS has also developed an external metrics API that allows users to easily monitor using Prometheus and Grafana.

SeaweedFS supports both Push and Pull approaches to work with Prometheus and Grafana, and provides an interactive weed shell tool for users to perform a range of operations and maintenance tasks (e.g., checking the current cluster status, listing files, etc.).

Other comparisons

In terms of release date, SeaweedFS was released in April 2015 and has accumulated 16.4K stars so far, while JuiceFS was released in January 2021 and has accumulated 7.3K stars.
In terms of the project, both JuiceFS and SeaweedFS adopt the commercial-friendly Apache License 2.0. SeaweedFS is mainly maintained by Chris Lu personally, while JuiceFS is mainly maintained by Juicedata, inc.
Both JuiceFS and SeaweedFS are written in Go language.

Comparison Table

	SeaweedFS	JuiceFS
Metadata	Multi-engine	Multi-engine
Atomicity of Metadata Operations	Not guaranteed	Guaranteed through database transactions
Changelogs	Yes	No
Data storage	Included	Reliance on external services
Code Correction	Supported	Reliance on external services
Data merge	Supported	Reliance on external services
File splitting	8MB	64MB + 4MB
Tiered storage	Supported	Reliance on external services
Data compression	Supported（extension-based）	Supported（global settings）

From [JuiceFS/Juicedata]~

How to implement a distributed /etc directory using etcd and JuiceFS

tonybarber2 — Fri, 10 Feb 2023 06:57:37 +0000

Background

etcd is a key-value database with consistency and high availability. It is simple, safe, fast, and reliable and the primary data store for Kubernetes. Let's start with an official description of etcd's name.

he name “etcd” originated from two ideas, the unix “/etc” folder and “d"istributed systems. The “/etc” folder is a place to store configuration data for a single system whereas etcd stores configuration information for large scale distributed systems. Hence, a “d"istributed “/etc” is “etcd”.

The above quote comes from etcd officialwebsite. etcd imaginatively combines the concepts of etc (where Linux systems typically store their configuration files) and distributed.** However, since etcd is served via HTTP API, so "unfortunately" a truly distributed /etc directory was not implemented. Below we will introduce how etcd can be used to implement a truly distributed /etc directory via JuiceFS.**

We use JuiceFS, an open source distributed file system, to provide access to the POSIX file interface for /etc, and JuiceFS can use etcd as a Metadata engine to store metadata such as directory trees and filenames in the file system. With JuiceFS CSI Driver, it can be used as Persistent Volume to enable multiple application instances to share configuration files in Kubernetes platform, which is the out-and-out distributed /etc.

The following will introduce what is JuiceFS, why JuiceFS can and how to implement distributed /etc, and how etcd can share configuration files across multiple application instances with JuiceFS.

What is JuiceFS

JuiceFS is an open source distributed file system designed for the cloud, providing full POSIX, HDFS and S3 API compatibility. JuiceFS separates "data" and "metadata" storage, files are split into chunks and stored in object storage and the corresponding metadata can be stored in various databases such as Redis, MySQL, TiKV, SQLite, etc. v1.0-beta3 version officially supports etcd as the metadata engine. The data storage engine docks to any object storage storage , and etcd is also supported as data storage engine in v1.0-rc1, which is suitable for storing configuration files with small capacity.

Why JuiceFS can achieve distributed /etc

According to the hierarchical architecture described above, we can see that JuiceFS stores metadata in the database and file data in object storage, thus allowing users to access the same tree file system structure on different nodes. With JuiceFS, we can put configuration files in the file system and then mount JuiceFS into its /etc directory in each application, thus realizing a truly distributed "/etc". The whole process is shown in the figure below.

How to implement distributed /etc

The next section describes how etcd makes it possible for multiple nginx instances to share the same configuration and implement distributed /etc using JuiceFS.

Deploy etcd

In a Kubernetes cluster, it is recommended to build an independent for JuiceFS, instead of using the default etcd in the cluster, to avoid affecting the stability of the Kubernetes cluster with high access pressure of file system.

To install etcd, you can refer to the official documentation and build a multi-node etcd cluster; you can also use the chart installer provided by Bitnami for etcd .

In case of data sensitivity, you can enable the encrypted communication function of etcd for encrypted data transmission. Refer to the sample script provided by the etcd project.

Preparing configuration files in JuiceFS

After the etcd cluster is installed, we can initialize a JuiceFS file system with two lines of commands.

$ juicefs format etcd://$IP1:2379,$IP2:2379,$IP3:2379/jfs --storage etcd --bucket etcd://$IP1:2379,$IP2:2379,$IP3:2379/data pics
$ juicefs mount etcd://$IP1:2379,$IP2:2379,$IP3:2379/jfs /mnt/jfs

After mounting the JuiceFS volume to /mnt/jfs, you can directly place the nginx.conf file in that directory.

Using JuiceFS in Kubernetes

First we should create a Secret and fill it with etcd connection information:

apiVersion: v1
kind: Secret
metadata:
  name: juicefs-secret
  namespace: kube-system
type: Opaque
stringData:
  name: test
  metaurl: etcd://$IP1:2379,$IP2:2379,$IP3:2379/jfs
  storage: etcd
  bucket: etcd://$IP1:2379,$IP2:2379,$IP3:2379/data

Both the metadata engine and the object store use etcd, where $IP1, $IP2, and $IP3 are the client IPs of etcd. Then we create the PV and PVC (see this document. You can then mount the PVC to /etc in Nginx application, as follows.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-app
spec:
    …
    volumes:
      - name: config
        persistentVolumeClaim:
          claimName: etcd
    containers:
      - image: nginx
        volumeMounts:
        - mountPath: /etc/nginx
          name: config
         …

When Nginx application starts, it will read the nginx.conf configuration file under /etc/nginx, that is, read the nginx.conf configuration file we placed in JuiceFS which is shared through JuiceFS volume.

Summary

etcd has become the de facto standard as a cloud-native key-value database for storing service configuration information in the Kubernetes platform. However, the configuration file management of many upper-level applications is still inconvenient. This article shares how JuiceFS turned etcd into a distributed "/etc", helping etcd fulfill its original dream ✌🏻.

About Author:

Weiwei Zhu, Juicedata engineer, who is responsible for the developing and maintaining JuiceFS CSI Driver, and the development of JuiceFS in the area of cloud-native.

from JuiceFS/Juicedata

Migrating Elasticsearch’s warm & cold data to object storage with JuiceFS

tonybarber2 — Tue, 17 Jan 2023 06:18:14 +0000

With the development of cloud computing, object storage has gained favor with its low price and elastic scalable space. More and more enterprises are migrating warm and cold data to object storage. However, migrating the index and analysis components directly to object storage will hinder query performance and cause compatibility issues.

This article will elaborate the fundamentals of hot and cold data tiering in Elasticsearch, and introduce how to use JuiceFS to cope with the problems that occur on object storage.

1 Elasticsearch’s data tier architecture

There are three concepts to be known before diving into how ES implements a hot and cold data tiering strategy: Data Stream, Index Lifecycle Management (ILM), and Node Role.

Data Stream

Data Stream is an important concept in ES, which has the following characteristics:

Streaming writes. Data Stream is a data set written in stream rather than in block.
Append-only writes. Data Stream updates data via append writes and does not require modifying existing data.
Timestamp. A timestamp is given to each new piece of data to record when it was created.
Multiple indexes: In ES, every piece of data resides in an Index. The data stream is a higher level concept, one data stream may compose of many indexes, which are generated according to different rules. However, only the latest index is writable, while the historical indexes are read-only.

Log data is a typical type of data steam. It is append-only and also has to be timestamped. The user will generate new indexes by different dimensions, such as day or others.

The scheme below is a simple example of index creation for a data stream. In the process of using the data stream, ES will write directly to the latest index. As more data is generated, this index will eventually become an old, read-only index.

The following graph illustrates writing data to the ES, including two phases.

Stage 1: the data is first written to the In-memory buffer.
Stage 2: The buffer will fall to the local disk according to certain rules and time, which is shown as green in the graph (persistent data), known as segment in ES.

There may be some time lag in this process, and the newly created segment cannot be searched if a query is triggered during the persistence process. Once the segment is persisted, it can be searched by the upper-level query engine immediately.

Index Lifecycle Management

Index Lifecycle Management (ILM), is the lifecycle management of an index, and ILM defines the lifecycle of an index as 5 phases.

Hot data: needs to be updated or queried frequently.
Warm data: no longer updated, but is still frequently queried.
Cold data: no longer updated and is queried less frequently.
Frozen data: no longer updated and hardly ever queried. It is safe to put this kind of data in a relatively low-speed and cheap storage medium.
Deleted data: no longer needed and can be safely deleted. All ES index data will eventually go through these stages, users can manage their data by setting different ILM policies.

Node Role

In ES, each deployment node will have a Node Role. Different roles, such as master, data, ingest, etc., will be assigned to each ES node. Users can combine the Node Roles, and the different lifecycle phases mentioned above, for data management.

The data node has different stages, and it can be a node that stores hot data, warm data, cold data, or even extremely cold data. The node will be assigned different roles based on its tasks, and different nodes are sometimes configured with different hardware depending on roles.

For example, hot data nodes need to be configured with high-performance CPUs or disks, for nodes with warm and cold data, as these data are queried less frequently, the requirement for hardware is not necessarily high for some computing resources.

Node roles are defined based on different stages of the data lifecycle.** It is important to note that each ES node can have multiple roles, and these roles do not need to have a one-to-one relationship**. Here's an example where node.roles is configured in the ES YAML file. You can also configure multiple roles for the node that it should have.

node.roles: ["data_hot", "data_content"]

Lifecycle Policy

After understanding the concepts of Data Stream, Index Lifecycle Management, and Node Role, you can customize lifecycle policies for your data.

Based on the different dimensions of index characteristics defined in the policy, such as the size of the index, the number of documents in the index, and the time when the index was created, ES can automatically help users roll over data from one lifecycle stage to another, which is known as rollover in ES.

For example, the user can define features based on the size of the index and roll over the hot data to the warm data, or roll over the warm data to the cold data according to some other rules. ES can do the job automatically, while the lifecycle policy needs to be defined by the user.

The screenshot below shows Kibana's administration interface, which allows users to graphically configure lifecycle policies. You can see that there are three phases:hot data, warm data, and cold data.

Expanding the advanced settings, you can see more details about configuration policies based on different characteristics,which is listed on the right side of the screenshot below.

Maximum index size. Take an example of 50 GB in the above screenshot. It means that data will be rolled from the hot data phase to the warm data phase when the size of the index exceeds 50GB.

Maximum documents. The basic storage unit of ES index is document, and the user data is written to ES in the form of documents. Thus, the number of documents is a measurable indicator.

Maximum age. As an example of 30 days, i.e., an index has been created for 30 days, it will trigger the rollover of the hot data to the warm data phase as mentioned previously.

However, using Elasticsearch directly on object storage can cause poor write performance and compatibility and other issues. Thus, companies that also want to balance query performance are starting to look for solutions on the cloud. Under this context, JuiceFS is increasingly being used in data tiering architectures.

2 practice of ES + JuiceFS

Step 1: Prepare multiple types of nodes and assign different roles. Each ES node can be assigned different roles, such as storing hot data, warm data, cold data, etc. Users need to prepare different types of nodes to match the needs of different roles.

Step 2: Mount the JuiceFS file system. Generally users use JuiceFS for warm and cold data storage, users need to mount the JuiceFS file system locally on the ES warm data node or cold data node. The user can configure the mount point into ES through symbolic links or other means to make ES think that its data is stored in a local directory, but this directory is actually a JuiceFS file system.

Step 3: Create a lifecycle policy. This needs to be customized by each user, either through the ES API or through Kibana, which provides some relatively easy ways to create and manage lifecycle policies.

Step 4: Set a lifecycle policy for the index. After creating a lifecycle policy, you need to apply the policy to the index, that is, you need to set the policy you just created for the index. You can do this by using index templates, which can be created in Kibana, or explicitly configured through the API via index.lifycycle.name.

Here are a few tips.
Tip 1: The number of copies (replicas) of Warm or Cold nodes can be set to 1. All data is placed on JuiceFS, eventually uploaded to the underlying object storage, so the reliability of the data is high enough. Accordingly, the number of copies can be reduced on the ES side to save storage space.

Tip 2: Turning on Force merge may cause constant CPU usage on nodes, so turn it off if appropriate. When moving from hot data to warm data, ES will merge all the underlying segments corresponding to the hot data index. If Force merge is enabled, ES will first merge these segments and then store them in the underlying system of warm data. However, merging segments is a very CPU-consuming process. If the data node of warm data also needs to carry some query requests, you can turn off this function as appropriate, that is, keep the data intact and write it to the underlying storage directly.

Tip 3: The index of Warm or Cold phase can be set to read-only. When indexing warm and cold data phases, we can basically assume that the data is read-only and the indexes will not be modified. Setting the index to read-only can reduce some resource usage on the warm and cold data nodes, you can then scale down these nodes and save some hardware resources.

From Juicedata/JuiceFS ！ (0ᴗ0✿)

POSIX Compatibility Comparison among four file system on the cloud

tonybarber2 — Mon, 21 Nov 2022 09:16:57 +0000

POSIX compatibility is an indispensable criterion when choosing file system. Recently, we conducted a test on POSIX compatibility among GCP Filestore, Amazon EFS, Azure Files and JuiceFS.

About POSIX

POSIX ( Portable Operating System Interface) is the most widely used interface standards for operating systems, including file systems. If you want to learn more about POSIX, please refer to the Quora question and answer "What does POSIX conformance/compliance mean in the distributed systems world?"

Test Method

One popular POSIX compatibility test suites is pjdfstest, derived from FreeBSD and also applicable to systems such as Linux.

Test Results

The test results are shown below, JuiceFS failed with 0 cases, showing the best compatibility. GCP Filestore is the second best, with two failures. Amazon EFS failed with several orders of magnitude larger test cases compared to other products.

Note that, for the sake of comparison, logarithmic coordinates are used for the horizontal coordinates of the result figure.

Failure Use Case Analysis

GCP Filestore

The GCP Filestore failed2 tests in total, one in each of the unlink and utimensat categories.

The first one is in the unlink test set unlink/14.t, and the corresponding log is as follows.

/root/pjdfstest/tests/unlink/14.t ...........
not ok 4 - tried 'open pjdfstest_b03f52249a0c653a3f382dfe1237caa1 O_RDONLY : unlink pjdfstest_b03f52249a0c653a3f382dfe1237caa1 : fstat 0 nlink', expected 0, got 1

This test set （unlink/14.t） is used to verify the behavior of a file when it is deleted in the open state.

desc="An open file will not be immediately freed by unlink"

The operation of deleting a file actually corresponds to unlink at the system level, which removes the link from filename to inode, and then minus the corresponding nlink by 1. This test is to verify this.

# A deleted file's link count should be 0
expect 0 open ${n0} O_RDONLY : unlink ${n0} : fstat 0 nlink

The contents of a file are only really deleted when the number of links (nlink) is reduced to 0 and there are no open file descriptors (fd) pointing to the file. If nlink is not updated correctly, it may result in files that should be deleted remaining on the system.

The other one is in the utimensat test set utimensat/09.t, which corresponds to the following log.

/root/pjdfstest/tests/utimensat/09.t ........
not ok 5 - tried 'lstat pjdfstest_909f188e5492d41a12208f02402f8df6 mtime', expected 4294967296, got 4294967295

This test case requires 64-bit timestamp support. GCP Filestore supports 64-bit timestamp, but it will be reduced by 1 on top of that. So it should not affect the use of this test case even though it fails.

Amazon EFS

Amazon Elastic File System (EFS) failed 21.49% of the pjdfstest tests, with failure use cases covering almost all categories.

Amazon EFS supports mounting via NFS, but the support for NFS features is not complete. For example, EFS does not support block and character devices, which directly led to the failure of a large number of test cases in pjdfstest. After excluding these two types of files, there are still hundreds of different categories of failure.

Azure Files

Azure Files has a failure rate of 62%, indicating that some basic POSIX scenarios may have incompatibility issues. For example, Azure Files files and folders have default permissions of 0777, with root as owner, and it does not support modification, i.e. there are no permissions restrictions. Also, Azure Files does not support hard and symbolic links.

JuiceFS passed all test items, performed the best in terms of compatibility.
Google Filestore was the next, only failing in two categories, one of which did not affect actual usage.
Amazon EFS has the worst compatibility with Azure Files, with a large number of compatibility tests failing, including several test cases with serious security risks, so it is recommended to do a security assessment before use.

From Juicedata/JuiceFS ！ (0ᴗ0✿)