Forem: Ken Tune

Aerospike & IoT using MQTT

Ken Tune — Fri, 11 Nov 2022 16:40:31 +0000

Introduction

MQTT (Message Queuing Telemetry Transport) is a widely used messaging protocol for the Internet of Things (IoT). It is ideal for communicating with small remote devices with limited power and network bandwidth. MQTT is used in a wide variety of industries, such as automotive, manufacturing, telecommunications, oil and gas.

Aerospike is a high performance distributed database, particularly well suited for real time transactional processing. It is aimed at institutions and use-cases that need high throughput (100k tps+), with low latency (95% completion in <1ms), while managing large amounts of data (Tb+) with 100% uptime, scalability and low cost.

This article, based on example code in the aerospike-examples/mqtt-aerospike-example GitHub repository, describes how to achieve end-to-end data flow between a small device and Aerospike, with the data being stored in Aerospike as queryable time series. Although the example is small in scope, the decoupled MQTT architecture and high performance Aerospike database allows the approach to be scaled to accommodate thousands of devices, storing data over a period of years if necessary.

More specifically, the example simulates the generation of data from an IoT sensor and tracks how that can be sent to a specific topic on an MQTT Broker. The data simulator could quite easily be replaced with an actual sensor, communicating with an MQTT Broker.

On the receiving side we describe how to subscribe to the above topic and how the data can be serialized to the Aerospike database using our Community Time Series Client, which can also be used to query the data.

The net result of this is the ability to source data in a scalable fashion from IoT devices and store it as queryable time series data within Aerospike.

Generating the data

The data simulation in the example works as follows. Successive calls to the simulator result in data points, which are (timestamp,value) pairs. The average time between timestamps is specified at the outset as is a percentage variability in the timestamps, to make the simulation realistic. The ratio between successive values is normally distributed - the mean and variance of this distribution is also specified before the simulation is started. So, we have four parameters governing our simulation. In addition, an initial timestamp and value must be specified and the simulation must be given a name. The simulator constructor reflects this:

public TimeSeriesSimulator(
  String simulatorName, Date startTime, double initialValue,
  long observationIntervalMilliSeconds, 
  double observationIntervalVariabilityPct, 
  double dailyDriftPct, double dailyVolatilityPct)

We obtain successive data points by calling:

public DataPoint getNextDataPoint()

The following output shows the kind of content we expect to see, if simulating a sensor polling approximately hourly.

Sampling Engine-001-RPM-Sensor at time 2022-10-14 01:00:00.000. Found value 10000.000000. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 02:00:25.920. Found value 10470.777590. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 02:57:30.240. Found value 11123.240496. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 03:57:35.280. Found value 11066.086321. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 04:55:18.840. Found value 10599.837433. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 05:57:19.800. Found value 10268.800822. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 06:56:12.120. Found value 10256.870171. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 07:55:04.800. Found value 10329.697112. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 08:57:12.600. Found value 10307.305881. 
Sampling Engine-001-RPM-Sensor at time 2022-10-14 09:57:15.840. Found value 10436.093769.

Sending the data to an MQTT Broker

The MQTT paradigm assumes we have many disparate small devices. In order to collect information these devices will publish to a topic on an MQTT Broker. You can think of a broker as a centralized depot for the receipt and distribution of messages, which provides for scalability. Topics allow the messages to be separated into distinct collections. Subscribers can independently subscribe to a topic and receive updates to the topic via push notifications.

The following code shows the signature of a Sensor Observer object. We provide a simulator to watch, a topic to publish to, and integers governing the frequency and number of observations.

public RunnableMQTTSensorObserver(
 ITimeSeriesSimulator timeSeriesSimulator, 
 long millisecondsBetweenObservations, long observationCount, 
 MqttTopic publicationTopic)

The MQTT publication topic is obtained by connecting to a networked resource, MQTT_BROKER_URL using a publisher id MQTT_PUBLISHER_ID. To keep things simple, in this example we use the public MQTT server tcp://test.mosquitto.org:1883. This is an open resource and can be used by anybody. No special setup is required, but your data is potentially public. For this example this is not an issue, but you will ultimately need your own broker to take things further.

IMqttClient mqttPublisher = 
new MqttClient(MQTT_BROKER_URL, MQTT_PUBLISHER_ID);
mqttPublisher.connect(standardMqttConnectOptions());
MqttTopic mqttTopic = mqttPublisher.getTopic(MQTT_TOPIC_NAME);

Here we are using the Eclipse Paho implementation of the MQTT API.

When the observer is run the following code is executed observationCount times, each time resulting in the data point being sent to the publication topic.

DataPoint dataPoint = timeSeriesSimulator.getCurrentDataPoint();
byte[] payload = MQTTUtilities.encodeForMQTT(
 timeSeriesSimulator.getSimulatorName(),dataPoint).getBytes();
MqttMessage msg = new MqttMessage(payload);
publicationTopic.publish(msg);

In the first line, we obtain a data point from the simulator.

In the second line, we encode the data point so it can be sent as a message. The encoding function has the following signature:

public static String encodeForMQTT(String timeSeriesName, DataPoint dataPoint)

It makes use of a very simple serialization - timeSeriesName:dataPoint.getTimestamp():dataPoint.getValue() - colon separated values. See the function MQTTUtilities.encodeForMQTT in the aerospike-examples/mqtt-aerospike-example repository for full details.

In the third line we construct an MQTTMessage and finally, in the fourth line, publish it to the publication topic.

Subscribing to an MQTT Broker

Similar to the above section, we connect to the MQTT Broker:

IMqttClient mqttSubscriber = new MqttClient(MQTT_BROKER_URL, MQTT_SUBSCRIBER_ID);
mqttSubscriber.connect(standardMqttConnectOptions());

We also create a listener object:

IMqttMessageListener mqttDataListener = new MQTTAerospikeDataPersister(asTimeSeriesClient);

The next line implements the IMqttMessageListener interface consisting of a single call:

public void messageArrived(String topic, MqttMessage mqttMessage)

You will see that our implementation of IMqttMessageListener, MQTTAerospikeDataPersister requires an Aerospike Time Series Client when constructed. Now, we subscribe to the topic using the listener object:

mqttSubscriber.subscribe(MQTT_TOPIC_NAME, mqttDataListener);

Inside the messageArrived function

Whenever a message is received the messageArrived function of the listener is invoked. Following is our implementation code for that function.

String mqttMessageAsString = new String(mqttMessage.getPayload(), Constants.MQTT_DEFAULT_CHARSET);
String timeSeriesName = MQTTUtilities.timeSeriesNameFromMQTTMessage(mqttMessageAsString);
DataPoint dataPoint = MQTTUtilities.dataPointFromMQTTMessage(mqttMessageAsString);
timeSeriesClient.put(timeSeriesName,dataPoint);

First we obtain the message as a string. In lines 2 and 3 we extract the time series name and data point (i.e. the timestamp and value). Finally we add the value to the Aerospike database using the put call of the timeSeriesClient.

Running the demonstration

Get the source code:

git clone https://github.com/aerospike-examples/mqtt-aerospike-example.git

This example requires an Aerospike database accessible via the localhost address, listening on port 3000. These values can be altered in the code using the MQTTPersistenceDemo.AEROSPIKE_SEED_HOST and MQTTPersistenceDemo.AEROSPIKE_SERVICE_PORT parameters. The easiest way to obtain Aerospike is to install Docker Desktop and run an Aerospike Community container e.g.

docker run -d --name aerospike aerospike/aerospike-server

You can run MQTTPersistenceDemo.main() in your favorite IDE or build at the command line from the project root:

mvn clean compile assembly:single

Running the demonstration:

java -jar target/aerospike-mqtt-example-1.0-SNAPSHOT-jar-with-dependencies.jar

Your output should be similar to this sample output

Querying the data from Aerospike using the Community Time Series Client

MQTTPersistenceDemo.main validates the end-to-end pipeline by requesting the data for our time series - Engine-001-RPM-Sensor:

MQTTUtilities.printTimeSeries(asTimeSeriesClient,SENSOR_NAME);

The body of the above function is as follows:

// Get the basic time series details
TimeSeriesInfo timeSeriesInfo = TimeSeriesInfo.getTimeSeriesDetails(timeSeriesClient, timeSeriesName);
// and output them
outputMessageWithPara(timeSeriesInfo.toString());
// use the time series client to get all the available points for our series with name timeSeriesName
// We use the timeSeriesInfo object to get the start and end date times for the series 
// so we can request all the points available
DataPoint[] dataPoints = timeSeriesClient.getPoints(timeSeriesName, 
  new Date(timeSeriesInfo.getStartDateTimestamp()),new Date(timeSeriesInfo.getEndDateTimestamp()));
// Header for the output
System.out.println("Timestamp,Value");
// For each point print out t formatted version of the point
for (DataPoint dataPoint : dataPoints) {
    outputMessage(String.format("%s,%.6f", dataPointDateToString(dataPoint), dataPoint.getValue()));
}

Typical output:

Retrieving data for time series Engine-001-RPM-Sensor from Aerospike database:

Name : Engine-001-RPM-Sensor Start Date : 2022-10-15 01:00:00 End Date 2022-10-15 09:52:44 Data point count : 10

Timestamp,Value
2022-10-15 01:00:00.000,10000.000000
2022-10-15 01:58:54.480,10197.212074
2022-10-15 02:57:50.040,10579.313417
2022-10-15 03:59:18.240,10025.330483
2022-10-15 04:56:36.600,10013.730374
2022-10-15 05:56:40.920,10188.447442
2022-10-15 06:58:32.880,10145.885126
2022-10-15 07:55:53.400,10350.374583
2022-10-15 08:54:05.400,10533.135383
2022-10-15 09:52:44.040,10326.813161

If you scroll back to the beginning of the article, you will see this is exactly the data initally emitted by our mock sensor.

Conclusion

This example shows how the Aerospike database can be easily and scalably used to store industrial time series data made available by the MQTT ecosystem. Aerospike plus its Community Time Series Client streamlines the storage and retrieval of the data, supporting the ability to both write and read millions of data points per second if required.

Further Directions

This demonstration could easily be scaled to show data being harvested from multiple sensors in parallel and saved to Aerospike. It would also be interesting to replace the simulation with an actual device - something Arduino based for example.

Aerospike Time Series API

Ken Tune — Wed, 20 Apr 2022 09:22:06 +0000

Introduction

Aerospike is a high performance distributed database, particularly well suited for real time transactional processing. It is aimed at institutions and use-cases that need high throughput ( 100k tps+), with low latency (95% completion in <1ms), while managing large amounts of data (Tb+) with 100% uptime, scalability and low cost.

Conceptually, Aerospike is most readily categorised as a key value database. In reality however it has a number of bespoke features that make it capable of supporting a much wider set of use cases. A good example is our document API which builds on our collection data types in order to provide JsonPath support for documents.

Another general use case we can consider is support for time series. The combination of buffered writes and efficient map operations allows us to optimise for both read and write of time series data. The Aerospike Time Series API leverages these features to provide a general purpose interface for efficient reading and writing of time series data at scale. Also included is a benchmarking tool allowing performance to be measured.

Time Series Data

Time series data can be thought of as a sequence of observations associated with a given property of a single subject. An observation is a quantity comprising two elements - a timestamp and a value. A property is a measurable attribute such as speed, temperature, pressure or price. We can see then that examples of time series might be the speed of a given vehicle; temperature readings at a fixed location; pressures recorded by an industrial sensor or the price of a stock on a given exchange. In each case the series consists of the evolution of these properties over time.

A time series API in its most basic form needs to consist of

1) A function allowing the writing of time series observations
2) A function allowing the retrieval of time series observations

Additional conveniences might include

The ability to write data in bulk (batch writes)
The ability to query the data e.g. calculate the average, maximum or minimum.

Aerospike Time Series API

The Aerospike Time Series API provides the above via the TimeSeriesClient object. The API is as follows

// Store a single data point for a named time series
void put(String timeSeriesName,DataPoint dataPoint);

// Store a batch of data points for a named time series
void put(String timeSeriesName, DataPoint[] dataPoints);

// Retrieve all data points observed between startDateTime and endDateTime for a named time series
DataPoint[] getPoints(String timeSeriesName,Date startDateTime, Date endDateTime);

// Retrieve the observation made at time dateTime for a named time series
DataPoint getPoint(String timeSeriesName,Date dateTime);

// Execute TimeSeriesClient.QueryOperation versus the observations recorded for a named time series
// recorded between startDateTime and endDateTime
// The operations may be any of COUNT, AVG, MAX, MIN or VOL (volatility)
double runQuery(String timeSeriesName, TimeSeriesClient.QueryOperation operation, Date fromDateTime, Date toDateTime);

A DataPoint is a simple object representing an observation and the time at which it was made, constructed as follows. The Java Date timestamp allows times to be specified to millisecond accuracy

DataPoint(Date dateTime, double value)

Simple Example

The code example below shows us inserting a series of 24 temperature readings, taken in Trafalgar Square, London, on the 14th February 2022. We give the time series a meaningful and precise name by concatenating subject, property and units.

// Let's store some temperature readings taken in Trafalgar Square, London. Readings are Centigrade.
String timeSeriesName = "TrafalgarSquare-Temperature-Centigrade";
// The readings were taken on the 14th Feb, 2022
Date observationDate = new SimpleDateFormat("yyyy-MM-dd").parse("2022-02-14");
// ... and here they are
double[] hourlyTemperatureObservations =
    new double[]{2.7,2.3, 1.9, 1.8, 1.8, 1.7, 2.3, 3.2, 4.7, 5.4, 6.3, 7.7, 7.9, 9.9, 9.3, 
               9.6, 9.7, 8.4, 7.4, 6.8, 5.5, 5.4, 4.3, 4.2};

// To store, create a time series client object. Requires AerospikeClient object and Aerospike namespace name
// new TimeSeriesClient(AerospikeClient asClient, String asNamespaceName)
TimeSeriesClient timeSeriesClient = new TimeSeriesClient(asClient,asNamespaceName);
// Insert our hourly temperature readings
for(int i=0;i<hourlyTemperatureObservations.length;i++){
  // The datapoint consists of the base date + the required number of hours
  DataPoint dataPoint = new DataPoint(
    Utilities.incrementDateUsingSeconds(observationDate,i * 3600),
    hourlyTemperatureObservations[i]);
  // Which we then 'put'
  timeSeriesClient.put(timeSeriesName,dataPoint);
}

As a diagnostic, we can get some basic information about the time series

TimeSeriesInfo timeSeriesInfo = TimeSeriesInfo.getTimeSeriesDetails(timeSeriesClient,timeSeriesName);
System.out.println(timeSeriesInfo);

which will give

Name : TrafalgarSquare-Temperature-Centigrade Start Date : 2022-02-14 00:00:00.000 End Date 2022-02-14 23:00:00.000 Data point count : 24

Another diagnostic allows the time series to be printed to the command line

timeSeriesClient.printTimeSeries(timeSeriesName);

gives

Timestamp,Value
2022-02-14 00:00:00.000,2.70000
2022-02-14 01:00:00.000,2.30000
2022-02-14 02:00:00.000,1.90000
...
2022-02-14 22:00:00.000,4.30000
2022-02-14 23:00:00.000,4.20000

Finally we can run a basic query

System.out.println(
  String.format("Maximum temperature is %.3f",
                timeSeriesClient.runQuery(timeSeriesName,
                TimeSeriesClient.QueryOperation.MAX,
                timeSeriesInfo.getStartDateTime(),timeSeriesInfo.getEndDateTime())));

Maximum temperature is 9.900

Note we could alternatively have used the batch put operation, which 'puts' all the points in a single operation.

// Create an array of DataPoints
DataPoint[] dataPoints = new DataPoint[hourlyTemperatureObservations.length];
// Add our observations to the array
for (int i = 0; i < hourlyTemperatureObservations.length; i++) {
  // The datapoint consists of the base date + the required number of hours
  dataPoints[i] = new DataPoint(
    Utilities.incrementDateUsingSeconds(observationDate, i * 3600),
    hourlyTemperatureObservations[i]);
}
// Put the points in a single call
timeSeriesClient.put(timeSeriesName,dataPoints);

Implementation

There are two key implementation concepts to grasp. Firstly, rather than store each data point as a separate object, they are inserted into Aerospike maps. This minimises network traffic at write time (we only 'send' the new point) and allows large numbers of points to be potentially read at read time as they are encapsulated in a single object. It also helps minimise memory usage as Aerospike has a fixed (64 byte) cost for each object. Schematically, each time series object looks something like

{
    timestamp001 : value001,
    timestamp002 : value002,
    ...
}

The maps must not grow to an indefinite extent, so the API ensures that each map will not grow beyond a specified maximum size. By default this limit is 1000 points, although this can be altered (see additional control). There is also a discussion in the README of the sizing and performance considerations associated with this setting.

The second implementation point follows on from the first. As there is a limit to the number of points that can be stored in a block, we need to have some mechanism for creating new blocks and keeping track of existing blocks for each time series. This is done, on a per time series basis, by maintaining an index of all blocks created. Conceptually this looks something like the following

{
    TimeSeriesName : "MyTimeSeries",
  ListOfDataBlocks : {
        StartTimeForBlock1 : {EndTime: <lastTimeStampForBlock1>, EntryCount: <entriesInBlock1>},
        StartTimeForBlock1 : {EndTime: <lastTimeStampForBlock1>, EntryCount: <entriesInBlock1>},
    ...
  }
}

Benchmarking

The Time Series API ships with a benchmarking tool. Three modes of operation are provided - real time insert, batch insert and query. For details of how to download and run see the benchmarking section of the README.

Real Time Benchmarking

As a simple example, let's insert 10 seconds of data for a single time series, with observations being made once per second.

./timeSeriesBenchmarker.sh -h <AEROSPIKE_HOST_IP>  -n <AEROSPIKE_NAMESPACE> -m realTimeWrite -p 1 -c 1 -d 10

Sample output

Aerospike Time Series Benchmarker running in real time insert mode

Updates per second : 1.000
Updates per second per time series : 1.000

Run time : 0 sec, Update count : 1, Current updates/sec : 1.029, Cumulative updates/sec : 1.027
Run time : 1 sec, Update count : 2, Current updates/sec : 1.000, Cumulative updates/sec : 1.013
Run time : 2 sec, Update count : 2, Current updates/sec : 0.000, Cumulative updates/sec : 0.672
...
Run time : 8 sec, Update count : 9, Current updates/sec : 1.000, Cumulative updates/sec : 1.003
Run time : 9 sec, Update count : 10, Current updates/sec : 1.000, Cumulative updates/sec : 1.003

Run Summary

Run time : 10 sec, Update count : 10, Cumulative updates/sec : 0.997

We can make use of another utility to see the output - ./timeSeriesReader.sh. This can be run for a named time series, or alternatively, will select a time series at random.

Here is sample output for our simple example

./timeSeriesReader.sh -h <AEROSPIKE_HOST_IP>  -n <AEROSPIKE_NAMESPACE>

Running TimeSeriesReader

No time series specified - selecting series AFNJFKSKDV

Name : AFNJFKSKDV Start Date : 2022-02-22 12:17:13.294 End Date 2022-02-22 12:17:23.185 Data point count : 11

Timestamp,Value
2022-02-22 12:17:13.294,97.37854
2022-02-22 12:17:14.247,97.34929
2022-02-22 12:17:15.263,97.33103
...
2022-02-22 12:17:22.212,97.31197
2022-02-22 12:17:23.185,97.29315

We can see that we have had sample points generated over a ten second period, with the series given a random name.

The benchmarker can be run at greater scale using the -c (time series count) flag. You may also wish to make use of -z (multi-thread) flag in order to achieve required throughput. The benchmarker will warn you if required throughput is not being achieved.

Another real time option is acceleration via the -a flag. This runs the simulation at an accelerated rate. So for instance if you wished to insert points every 30 seconds over a 1 hour period (120 points), you could shorten the time of the run by running using '-a 30'. This will 'speed up' the simulation by a factor of 30, so it will only take 120s. A higher number would also be possible. The benchmarker will indicate the actual update rates. For example

./timeSeriesBenchmarker.sh -h <AEROSPIKE_HOST>  -n <AEROSPIKE_NAMESPACE> -m realTimeWrite -c 5 -p 10 -a 10 -d 10
Aerospike Time Series Benchmarker running in real time insert mode

Updates per second : 5.000
Updates per second per time series : 1.000

Batch Insertion

A disadvantage of the 'real time' benchmarker is precisely that - the loading occurs in real time. You may wish to build your sample time series as quickly as possible. The batch insert mode is provided for this purpose.

In this mode, data points are loaded a block at a time - effectively as fast as the benchmarker will run. The invocation below, for example, will create 1000 sample series (-c flag), over a period of 1 year (-r flag), with 30 seconds between each observation.

./timeSeriesBenchmarker.sh -h <AEROSPIKE_HOST_IP>  -n <AEROSPIKE_NAMESPACE>  -m batchInsert -c 10 -p 30 -r 1Y

./timeSeriesBenchmarker.sh -h $HOST  -n test  -m batchInsert -c 1000 -p 30 -r 1Y -z 100 

Aerospike Time Series Benchmarker running in batch insert mode

Inserting 1051200 records per series for 1000 series, over a period of 31536000 seconds

Run time : 0 sec, Data point insert count : 0, Effective updates/sec : 0.000. Pct complete 0.000%
Run time : 1 sec, Data point insert count : 1046000, Effective updates/sec : 870216.306. Pct complete 0.100%
Run time : 2 sec, Data point insert count : 2568000, Effective updates/sec : 1146363.231. Pct complete 0.244%
Run time : 3 sec, Data point insert count : 4196000, Effective updates/sec : 1308796.007. Pct complete 0.399%
Run time : 4 sec, Data point insert count : 5806000, Effective updates/sec : 1372576.832. Pct complete 0.552%
...
Run time : 577 sec, Data point insert count : 1051077000, Effective updates/sec : 1820986.414. Pct complete 99.988%
Run time : 578 sec, Data point insert count : 1051158000, Effective updates/sec : 1817977.108. Pct complete 99.996%

Run Summary

Run time : 578 sec, Data point insert count : 1051200000, Effective updates/sec : 1816538.588. Pct complete 100.000%

Query Benchmarking

Having two different methods for generating data now puts us in the position where we can consider query benchmarking. This is the third and final aspect of the benchmarking toolkit.

Query benchmarking can be invoked via the 'query' mode. We choose how long to run the benchmarker for (-d flag) and the number of threads to use (-z flag).

At runtime, the benchmarker scans the database to determine all time series available. Each iteration of the benchmarker selects a series at random and calculates the average value of the series. The necessitates pulling all data points for the series to the client side and doing the necessary calculation so it is a good test of the query capability. We can ensure the queries are consistent in terms of data point value by using the batch insert aspect of the benchmarker which ensures all series have the same number of data points.

Sample invocation and output

./timeSeriesBenchmarker.sh -h $HOST -n test -m query -z 1 -d 120 

Aerospike Time Series Benchmarker running in query mode

Time series count : 1000, Average data point count per query 1051200

Run time : 0 sec, Query count : 0, Current queries/sec 0.000, Current latency 0.000s, Avg latency 0.000s, Cumulative queries/sec 0.000
Run time : 1 sec, Query count : 1, Current queries/sec 1.003, Current latency 0.604s, Avg latency 0.604s, Cumulative queries/sec 0.999
Run time : 2 sec, Query count : 3, Current queries/sec 2.002, Current latency 0.585s, Avg latency 0.591s, Cumulative queries/sec 1.499
Run time : 3 sec, Query count : 5, Current queries/sec 2.000, Current latency 0.515s, Avg latency 0.561s, Cumulative queries/sec 1.666
Run time : 4 sec, Query count : 7, Current queries/sec 2.000, Current latency 0.583s, Avg latency 0.567s, Cumulative queries/sec 1.750
...
Run time : 120 sec, Query count : 241, Current queries/sec 2.000, Current latency 0.471s, Avg latency 0.496s, Cumulative queries/sec 2.008

Run Summary

Run time : 120 sec, Query count : 242, Cumulative queries/sec 2.016, Avg latency 0.496s

Simulation

The Aerospike Time Series API contains a realistic simulator, which is made use of by the Benchmarker.

Many time series over a short period at least, follow a Brownian Motion. The TimeSeriesSimulator allows this to be simulated. The idea is that if we look at the relative change in our observed value, then the expected mean change should be proportional to the time between observations and the expected variance should similarly be proportional to the period in question. Formally, let X(τ) be the observation of the subject property X at time τ. After a time t let the value of X be X(τ+t). The simulation distributes the value of (X(τ +t) - X(τ)) / X(τ) i.e. the relative change in X like a normal distribution with mean μt and variance σ²t.

(X(t + τ) - X(t)) / X(t) ~ N(μt,σ²t.)

More detail is available at simulation but it is useful to see that the net effect of the above is to produce sample series such as the one shown below

We can see it looks very much like the sort of graph we might see for a stock price.

More complex time series e.g. those seen for temperatures might be simulated by concatenating several series together, with different drifts and volatilities, allowing values to trend both up and down. Mean reverting series can be simulated by setting the drift to zero.

Real Life Performance

As a test, performance was examined on an Aerospike cluster deployed on 3 i3en.2xlarge AWS instances. This instance type was selected as the ACT rating of the drives is 300k, making the arithmetic simple.

Writes

In simple terms, this cluster can then support 100k (see Performance Considerations) * 1.5kbyte * 3 (number of instances) = 450mb of throughput.

We know our average write is ~8kb. We assume replication factor two for resilience purposes. Sustainable updates per second is then 450mb / 2 (replication factor) / 8kb = 28,000.

In practice a 50k update rate was easily sustained using the real time benchmarker. The reason the value is higher is that larger writes do not necessarily have a larger penalty than small writes. Also, the ACT rating guarantees operations are sub 1ms in latency 95% of the time, a guarantee not necessarily needed for time series inserts.

The cost of such a cluster would be $23k per year using on-demand pricing ($0.90 / hour / instance) or $16k per year ($0.61 / hour/ instance) if using a reserved pricing plan.

Reads

Queries retrieving 1 million points per query (1 year of observations every 30 seconds) were able to run at the rate of two per second, with end to end latency of ~0.5 seconds for a sustained period using the benchmarking tool.

Future Directions

At the time of writing, this is an initial release of this API. Further developments should be expected. Possible further iterations may include

Data compression following the Gorilla approach which potentially allows data footprint to be reduced by 90%
Labelling of data to support the easy retrieval of multiple properties for subjects. For example, several sensors may be attached to an industrial machine - it may be convenient to retrieve all this series simultaneously for analysis purposes.
A REPL (read/eval/print/loop) capability to support interrogative analysis

Download

The Time Series Client is available at Maven Central - aero-time-series-client. You can download directly or by adding the below to your pom.xml file.

<dependency>
  <groupId>io.github.aerospike-examples</groupId>
  <artifactId>aero-time-series-client</artifactId>
  <version>LATEST</version>
</dependency>

Credits

Images courtesy of Unsplash - left to right
https://unsplash.com/@joshmillerdp
https://unsplash.com/@markusspiske
https://unsplash.com/@publicpowerorg

Aerospike Document API

Ken Tune — Thu, 17 Jun 2021 09:45:01 +0000

Aerospike is a high performance, distributed, scalable, key value database. Aerospike leverages SSD technology to achieve levels of throughput and low latency exceeding even those obtained with in memory products. This allows hardware costs to be reduced 10x or more and data density to be increased 10x or more versus any other high performance solution.

Aerospike has a significant number of distinguishing characteristics versus competitor products. Here we focus on the Collection Data Type (CDT) API. The CDT API facilitates list and map oriented operations within objects thereby reducing network overhead and client side computation. It is worth noting that it is highly efficient, adding little overhead to read or write calls and composable, allowing construction of complex overall operations.

The CDT API contains within it all the primitives required to build a sophisticated document API along the lines suggested by Stefan Groessner in his well known JsonPath proposal which is in turn modelled on the XPath standard for XML. For those not familiar with it, XPath supports CRUD operations via filesystem navigation like syntax. JsonPath is essentially a port of this idea to JSON.

The Aerospike Document API provides such an API, using the CDT API. Below we demonstrate the way in which CRUD operations can be executed at arbitrary points within a JSON document using the API.

Document Creation

First we need a JSON document to work with. Consider the below, expressing some subjective, for space reasons, highlights of Tommy Lee Jones career.

{
  "forenames": [
    "Tommy",
    "Lee"
  ],
  "surname": "Jones",
  "date_of_birth": {
    "day": 15,
    "month": 9,
    "year": 1946
  },
  "selected_filmography":{
    "2012":["Lincoln","Men In Black 3"],
    "2007":["No Country For Old Men"],
    "2002":["Men in Black 2"],
    "1997":["Men in Black","Volcano"],
    "1994":["Natural Born Killers","Cobb"],
    "1991":["JFK"],
    "1980":["Coal Miner's Daughter","Barn Burning"]
  },
  "imdb_rank":{
    "source":"https://www.imdb.com/list/ls050274118/",
    "rank":51
  },
  "best_films_ranked": [
    {
      "source": "http://www.rottentomatoes.com",
      "films": ["The Fugitive","No Country For Old Men","Men In Black","Coal Miner's Daughter","Lincoln"]
    },
    {
      "source":"https://medium.com/the-greatest-films-according-to-me/10-greatest-films-of-tommy-lee-jones-97426103e3d6",
      "films":["The Three Burials of Melquiades Estrada","The Homesman","No Country for Old Men","In the Valley of Elah","Coal Miner's Daughter"]
    }
  ]
}

Using the document API we add to our database.

// Create a document client via an existing aerospikeClient
AerospikeDocumentClient documentClient = new AerospikeDocumentClient(client);

// Convert JSON string to a JsonNode
JsonNode jsonNode = JsonConverters.convertStringToJsonNode(tommyLeeJonesJson);

// Construct an appropriate key
Key tommyLeeJonesDBKey = new Key(AEROSPIKE_NAMESPACE, AEROSPIKE_SET, "src/test/resources/tommy-lee-jones.json");

// Add to database using a named bin
documentClient.put(tommyLeeJonesDBKey, documentBinName, jsonNode);

Reads

We can find out the name of Jones' best film according to 'Rotten Tomatoes' using the path $.best_films_ranked[0].films[0]. Hopefully the use of keys and list indexes is immediately intuitive when considered in the context of the above document.

documentClient.get(tommyLeeJonesDBKey, documentBinName, "$.best_films_ranked[0].films[0]");

will return

"The Fugitive"

We are not limited to retrieving primitives. An expression such as

"$.selected_filmography.1980"

will retrieve a list

["Coal Miner's Daughter","Barn Burning"]

Updates

We can add to the document. The snippet below will add a 2019 element to the filmography.

List<String> _2019Films = new Vector<String>();
_2019Films.add("Ad Astra");
documentClient.put(tommyLeeJonesDBKey, documentBinName, "$.selected_filmography.2019",_2019Films);

We can update document nodes directly - this syntax updates Jones' IMDB ranking.

documentClient.put(tommyLeeJonesDBKey, documentBinName, "$.imdb_rank.rank",45);

Append operations may also be used. For example, we can append to 'Rotten Tomatoes' list of best films using the reference $.best_films_ranked[0].films

documentClient.append(tommyLeeJonesDBKey, documentBinName, "$.best_films_ranked[0].films","Rolling Thunder");

documentClient.append(tommyLeeJonesDBKey, documentBinName, "$.best_films_ranked[0].films","The Three Burials");

Deletion

We can similarly use JsonPath like expressions to support node deletion. The line below will remove the Medium rankings from the document

documentClient.delete(tommyLeeJonesDBKey, documentBinName, "$.best_films_ranked[1]");

Availability

The library has been published to Maven Central as aerospike-document-api. Add

<dependency>
   <groupId>com.aerospike</groupId>
   <artifactId>aerospike-document-api</artifactId>
   <version>1.1.2</version>
</dependency>

to your pom.xml to use.

For full details of the API see aerospike-document-lib

Tie Breaker Functionality for Aerospike Multi-Site Clustering

Ken Tune — Fri, 06 Nov 2020 09:58:19 +0000

Aerospike's Strong Consistency (SC) and rack awareness features allow construction of multi-site database clusters in which each site has a full copy of the data and all sites are perfectly synchronized.

This allows database operation to continue in the event of network partition between sites or site loss without loss of data.

The gold standard configuration is to use three data centres, as represented in the diagram below. In this diagram, each data centre corresponds to a physical rack and a replication factor of three has been chosen. As we have three racks, rack awareness will distribute the records so that there is exactly one copy of each record on each rack. Clients can be configured to read either from the master records, which are distributed uniformly across the three racks or from the local version, be it master or replica, using relaxed reads. In the diagram the relaxed read approach is shown, which optimises read latency.

In the event of one of the three data centres becoming unavailable, there will be no data loss as each of the three data centres contains a full copy of the data. Aerospike's resilience features will ensure that, in the event of failure, client applications automatically move to reading and writing from available data centres.

For those interested in the performance of such a configuration, my colleague Jagjeet Singh has written an excellent blog detailing results for a cluster that spans the east and west coasts of the USA.

An alternate arrangement is to run across two data centres as shown below. In this diagram, each data centre again corresponds to a physical rack and a replication factor of two has been chosen. As we have two racks, rack awareness will distribute the records so that there is exactly one copy of each record on each rack. For reasons explained below, we have an odd/even split across the two DCS. Only the red nodes are initially active. The green node is on standby and is not part of the cluster. Again, the clients are shown operating in 'relaxed read' mode where they have a 'preferred' rack.

Once more, in the event of one of the two data centres becoming unavailable, there will be no data loss as each of the two data centres contains a full copy of the data.

Should the minority data centre (DC2) fail, client applications will automatically move to reading and writing from the available data centre (DC1).

Should the majority data centre (DC1) fail, the minority cluster (DC2) will block writes until a 'roster' command is issued, indicating that the minority cluster (DC2) should take over as the definitive master. The standby (green) node is also added to the cluster at this point for capacity purposes.

The odd/even arrangement is necessary as were the two data centres to contain equal number of nodes, our strong consistency rules would have the effect of 50% of partitions being active in each data centre which is unlikely to be the desired outcome.

Two trade-offs are being made here in order to guarantee consistency. The first is the need for potential operator intervention, and the second is the uneven balance of the two sides. Automation can be used to deal with the first, and a 'spare' node might well be considered a reasonable price to pay for consistency.

An alternative is available however, via a small change made in a recent server release - 5.2. It allows us to add a permanently quiesced node to the cluster. A quiesced node is one that does not master data, but may still participate in partition availability decisions. We can use such a node as a 'tie-break node', as shown in the arrangement below.

In the event of one of the DC1 or DC2 becoming unavailable, there will be no data loss as each of the two data centres contains a full copy of the data. If this event were a network partition, making DC1 unreachable from DC2 and DC3, the cluster will automatically reconfigure to serve all writes and reads from DC2 thanks to the extra vote from the tie break node, serving to make DC2+DC3 a majority cluster. Similar behaviour occurs if DC2 is unreachable from DC1 and DC3. Finally the unavailability of DC3 would have no consequence as DC1+DC2 forms a majority.

We have eliminated the need for operator intervention in the majority failure case in the above scenario as well as avoiding the need for a fully specified 'spare' node (needed previously to accommodate necessary migrations to ensure full replication of data). This is because our 'tie break node' has no capacity requirements associated with it - it is there solely for decision making purposes.

The trade-off is the need for a third data centre. It can be seen however that this still offers an advantage over the 'gold standard' in that we reduce our inventory by 1/3. An additional iteration might be to double the number of tie-breaker nodes in DC3. Although not strictly necessary this might assuage any concerns around single point of failure.

Let's see how this works in practice. In the diagram below, I distribute my cluster across 3 AWS availability zones. The data nodes are in us-east-1a/b, with the tie breaker in us-east-1c.

The Aerospike admin tool can be used to show cluster state. The IP addesses of each node are highlighted.

Next we add some data - 1m objects in this case and again, inspect state. We can see below (red highlight) that we have 1m master records, 1m replica records, and the green highlight shows us how our servers have been separated into three racks - corresponding to the availability zones. The tie-break node, 10.0.0.68, in us-east-1c, is in a rack of its own and is not managing any data.

We can simulate a network partition by completely firewalling off us-east-1a. Let's say we do this while we have active reads and writes taking place.The screenshot below shows this happening at approx 13:51:25. We can see that we get no errors on the read side (because replicas will be tried in the event of timeouts/fails) and some write timeouts (these are the in-flight writes at the time of the network partition). We also see (last 5 lines) the client removing the unavailable nodes from its internal node list and normal operation being resumed.

On the server side, we can see below that we lose all the nodes in rack 100001, corresponding to those with the 10.0.1.* addresses. The number of master records stays as expected (green highlight), while we need to create prole or replica records (blue highlight) to allow for the fact that immediately after the network partition we only have one copy of each record. This is seen in the migration statistics (purple highlight).

Once the migrations are complete (purple highlight), we have a full complement of master and prole (replica) objects (green highlight).

Conclusion

We can use the new tie-breaker capability to build fully resilient distributed Aerospike clusters, while minimising hardware usage.

Title image : https://unsplash.com/@arnok

Automated Aerospike All Flash Setup

Ken Tune — Wed, 28 Oct 2020 14:15:42 +0000

Introduction

Aerospike is a key value database maximising SSD/Flash technology in order to offer best in class throughput and latency at petabyte scale.

Standard Aerospike usage will have the primary key index in DRAM and the data on SSD. Although Aerospike's usage of DRAM is very low at 64 bytes per object, for very large numbers of objects (100bn+) users might wish to consider the all-flash mode in which the primary key index is also placed on disk. More detail at all flash usage.

There are a number of non-trivial steps to go through to set up all flash. For that reason I've extended aerospike-ansible to allow automation of this process. This article walks through the automated process. It's envisaged that this will be useful for those evaluating the feature, or looking to get up and running with it quickly.

A working knowledge of aerospike-ansible is assumed. This introductory article may also be useful.

All Flash Calculations

In order to correctly configure a system for all flash, you need to know the number of partition-tree-sprigs that are appropriate for the object count you will have in your database. You can think of a partition tree sprig as a mini primary key index - we use these in order to have a lower depth primary key tree, allowing us to lookup record location more rapidly. More detail at sprigs.

It's important for all-flash because we size the system so the sprigs fit inside single disk blocks, minimising read and write overhead.

You can find details of the calculation here, but to make life easier a spreadsheet can be found in aerospike-ansible at assets/all-flash-calculator.xlsx.

Populate the yellow cells - # of objects, replication factor and object size.

The spreadsheet will calculate required partition-tree-sprigs.

It will also determine the fraction of available disk space that should be given over to the primary key index, based on the object size. In the screenshot, we can see that for 100m records, replication factor 2, average record size 1024 bytes, the overhead per record is 172 bytes and the overall record footprint is 2220 bytes, so approx 1/13 of the disk space should be allocated to the index.

Using Aerospike-Ansible

In vars/cluster-config.yml

Set partitions_per_device to the value given in the spreadsheet - 13 in the example. The first partition on each device is used for the all flash index to ensure the correct index:data disk space ratio.
Add partition_free_sprigs: YOUR_VALUE - YOUR_VALUE would be 1024 for this example

You will also need to

Set all_flash: true
Set enterprise: true
Provide a path to a valid Aerospike feature key using feature_key: /your/path/to/key. You must therefore be either a licensed Aerospike customer, or running an Aerospike trial.

Having done that

ansible-playbook aws-setup-plus-aerospike-install.yml

You should check that the aggregate disk space across your cluster exceeds the amount recommended in the spreadsheet.

Verification

Once the setup process is complete, log into one of your cluster nodes

./scripts/cluster-quick-ssh.sh

then access asadm (admin tool) followed by info command

The index type comes up as 'flash' as per the highlight.

Data Load

You can follow the instructions in benchmarking to quickly load some data into the new configuration.

As before, we can use asadm to examine the (highlighted) disk footprint of the primary key index for (in this case) 10m records (20m includes replicas).

Conclusion

The aerospike-ansible tooling makes it easy to set up all flash for Aerospike and benefit from the DRAM saving it offers.

Cover image Michał Mancewicz

Using Aerospike Connect For Spark

Ken Tune — Wed, 21 Oct 2020 15:08:21 +0000

Aerospike is a highly scalable key value database offering best in class performance. It is typically deployed into real-time environments managing terabyte to petabyte data volumes.

Aerospike will typically be run alongside other scalable distributed software such as Kafka, for system coupling or Spark for analytics. The Aerospike Connect product line makes integration as easy as possible.

This article looks at how Aerospike Spark Connect works in practice by offering a comprehensive and easily reproduced end to end example using aerospike-ansible.

Database Cluster Setup

First take a look at Ansible for Aerospike which explains how to use aerospike-ansible.

In this example I set cluster_instance_type to c5d.18xlarge in vars/cluster-config.yml.

Follow the instructions up to and including one touch setup. You'll get as far as

ansible-playbook aws-setup-plus-aerospike-install.yml
ansible-playbook aerospike-java-client-setup.yml

which will give you a 3 node cluster by default, plus a client instance with relevant software installed.

Spark Cluster Setup

This is done via

ansible-playbook spark-cluster-setup.yml

For this example, prior to running, I set spark_instance_type to c5d.4xlarge in vars/cluster-config.yml.

This playbook creates a 3 node Spark cluster, of the given instance type, with Spark installed and running. It also installs Aerospike Spark Connect.

Note you will need to set enterprise: true and provide a path to a valid Aerospike feature key using feature_key: /your/path/to/key in vars/cluster-config.yml. You must therefore be either a licensed Aerospike customer, or running an Aerospike trial.

Near the end of the process you will see

TASK [Spark master IP & master internal url] ************************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "Spark master is 3.88.237.103. Spark master internal url is spark://10.0.2.122:7077."
}

Make a note of the Spark master internal url - it is needed later.

Load Data

Our example makes use of 20m records from the 1bn NYC Taxi ride corpus, available in compressed form at https://aerospike-ken-tune.s3.amazonaws.com/nyc-taxi-data/trips_xaa.csv.gz. We load to Aerospike using aerospike loader, which is installed on the client machine set up above. First of all we get the addresses of the hosts in the Aerospike cluster - these are needed later.

source ./scripts/cluster-ip-address-list.sh

Sample output

Adds cluster ips to this array- AERO_CLUSTER_IPS
Use as ${ AERO_CLUSTER_IPS[index]}
There are 3 entries

##########################################################

cluster IP addresses : Public : 3.87.14.39, Private : 10.0.2.58
cluster IP addresses : Public : 3.89.113.231, Private : 10.0.0.234
cluster IP addresses : Public : 23.20.193.64, Private : 10.0.1.95

aerospike loader requires a config file to load the data into Aerospike. This maps csv column postions to named and typed bins. A sample entry looks like

{
    "name": "pkup_datetime",
    "value": {
        "column_position": 3,
        "type": "timestamp",
        "encoding": "yyyy-MM-dd hh:mm:ss",
        "dst_type": "integer"
    }
}

This is provided in the repo at recipes/aerospike-spark-demo/nyc-taxi-data-aero-loader-config.json. We upload this to the client instance.

source ./scripts/client-ip-address-list.sh 
scp -i .aws.pem ./recipes/aerospike-spark-demo/nyc-taxi-data-aero-loader-config.json ec2-user@${AERO_CLIENT_IPS[0]}:~

Next get the data onto the client machine. There's more than one way to do this, but you need to plan as the dataset is 7.6Gb when uncompressed. I used the below, but specifics will depend on the specifics of your drives and filesystem.

./scripts/client-quick-ssh.sh # to log in, followed by

sudo mkfs.ext4 /dev/nvme1n1
sudo mkdir /data
sudo mount -t ext4 /dev/nvme1n1 /data
sudo chmod 777 /data

wget -P /data https://aerospike-ken-tune.s3.amazonaws.com/nyc-taxi-data/trips_xaa.csv.gz
gunzip /data/trips_xaa.csv.gz

Finally we load our data in, using the config file we uploaded.

cd ~/aerospike-loader
./run_loader -h 10.0.0.234 -p 3000 -n test -c ~/nyc-taxi-data-aero-loader-config.json /data/trips_xaa.csv

Note we're using one of the cluster ip addresses we recorded earlier.

Using Spark

Log into one of the Spark nodes. Via aerospike-ansible there is a utility script for this

./scripts/spark-quick-ssh.sh

Start up a Spark shell, using the Spark master URL we saw when running the Spark cluster setup playbook.

/spark/bin/spark-shell --master spark://10.0.2.122:7077

Import relevant libraries

import org.apache.spark.sql.{ SQLContext, SparkSession, SaveMode}
import org.apache.spark.SparkConf
import java.util.Date
import java.text.SimpleDateFormat

Supply Aerospike configuration - note we supply the cluster ip used previously:

spark.conf.set("aerospike.seedhost", "10.0.0.234")
spark.conf.set("aerospike.namespace", "test")

Define a view, and a function we will be using

val sqlContext = spark.sqlContext
sqlContext.udf.register("getYearFromSeconds", (seconds: Long) => (new SimpleDateFormat("yyyy")).format(1000 * seconds))
val taxi = sqlContext.read.format("com.aerospike.spark.sql").option("aerospike.set", "nyc-taxi-data").load
taxi.createOrReplaceTempView("taxi")

Finally we run our queries

// Journeys grouped by cab type
val result = sqlContext.sql("SELECT cab_type,count(*) count FROM taxi GROUP BY cab_type")
result.show()

+--------+--------+                                                             
|cab_type|   count|
+--------+--------+
|   green|20000000|
+--------+--------+

// Average fare based on different passenger count
val result = sqlContext.sql("SELECT passenger_cnt, round(avg(total_amount),2) avg_amount FROM taxi GROUP BY passenger_cnt ORDER BY passenger_cnt")
result.show()

+-------------+----------+                                                      
|passenger_cnt|avg_amount|
+-------------+----------+
|            0|     10.86|
|            1|     14.63|
|            2|     15.75|
|            3|     15.87|
|            4|     15.85|
|            5|     14.76|
|            6|     15.42|
|            7|     23.74|
|            8|     19.52|
|            9|      34.9|
+-------------+----------+

// No of journeys for different numbers of passengers
val result = sqlContext.sql("SELECT passenger_cnt,getYearFromSeconds(pkup_datetime) trip_year, count(*) count FROM taxi GROUP BY passenger_cnt, getYearFromSeconds(pkup_datetime) order by passenger_cnt");
result.show()

+-------------+---------+--------+                                              
|passenger_cnt|trip_year|   count|
+-------------+---------+--------+
|            0|     2014|    4106|
|            1|     2014|16557518|
|            2|     2014| 1473578|
|            3|     2014|  507862|
|            4|     2014|  160714|
|            5|     2014|  939276|
|            6|     2014|  355846|
|            7|     2014|     492|
|            8|     2014|     494|
|            9|     2014|     114|
+-------------+---------+--------+

// Number of trips for each passenger count/distance combination
// Ordered by trip count, descending
val result = sqlContext.sql("SELECT passenger_cnt,getYearFromSeconds(pkup_datetime) trip_year,round(trip_distance) distance,count(*) trips FROM taxi GROUP BY passenger_cnt,getYearFromSeconds(pkup_datetime),round(trip_distance) ORDER BY trip_year,trips desc")
result.show()

+-------------+---------+--------+-------+                                      
|passenger_cnt|trip_year|distance|  trips|
+-------------+---------+--------+-------+
|            1|     2014|     1.0|5321230|
|            1|     2014|     2.0|3500458|
|            1|     2014|     3.0|2166462|
|            1|     2014|     4.0|1418494|
|            1|     2014|     5.0| 918460|
|            1|     2014|     0.0| 868210|
|            1|     2014|     6.0| 653646|
|            1|     2014|     7.0| 488416|
|            2|     2014|     1.0| 433746|
|            1|     2014|     8.0| 345728|
|            2|     2014|     2.0| 305578|
|            5|     2014|     1.0| 302120|
|            1|     2014|     9.0| 226278|
|            5|     2014|     2.0| 199968|
|            2|     2014|     3.0| 199522|
|            1|     2014|    10.0| 163928|
|            3|     2014|     1.0| 145580|
|            2|     2014|     4.0| 137152|
|            5|     2014|     3.0| 122714|
|            1|     2014|    11.0| 117570|
+-------------+---------+--------+-------+
only showing top 20 rows

Conclusion

This shows you how quickly you can get up and running with a large data corpus. The example was done with 20m rows but this is easily extended to the full corpus. We can also see just how quickly you can get up and running with the aerospike-ansible tooling.

Aerospike on EKS (AWS K8s)

Ken Tune — Mon, 27 Jul 2020 10:12:18 +0000

Much like Java, Kubernetes offers the promise of 'write once, run anywhere'. The wry riposte (a little unfairly) in the early days of Java, was 'write once, debug everywhere'. To a certain extent this is the position we are in today with the various flavours of Kubernetes out there - getting up and running is different on Google Cloud Platform vs AWS vs Azure vs local (e.g. Minikube), and a handful of well placed pointers can be a great time saver.

This article is about getting Aerospike up and running on Amazon's Kubernetes Service - EKS. As it happens, it has little EKS specific to say about Aerospike, it is instead about the things you need to do to get EKS up and running so you can start to run Aerospike on top of it. For users of Aerospike, we just want to make it as easy as possible, wherever it is you'd like to run it.

Pre-requisites

There are four things you need

eksctl - the AWS command line utility allowing you to administer (e.g. setup/teardown) your AWS Kubernetes cluster. We use it here to set up the EKS cluster itself. Details of how to install can be found at https://github.com/weaveworks/eksctl#installation.
helm - the package manager for Kubernetes applications. We use it here to deploy our Aerospike cluster. Details of how to install at https://helm.sh/docs/intro/install/.
kubectl - the Kubernetes client allowing you to deploy and administer Kubernetes applications. We use it here to perform ad-hoc Kubernetes operations. Installation details at https://kubernetes.io/docs/tasks/tools/install-kubectl/.
The AWS CLI, allowing command line management of AWS services. We use it here for credential management. Installation details at https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html.

Below this is scripted in full for Centos - this script can be readily leveraged for use in other environments. You need to run using sudo or root.

Credentials

In order to use eksctl you will need to set up an AWS user whose credentials eksctl will run under. This is a slightly involved topic, and I have therefore split it out into a separate article

https://dev.to/aerospike/aws-credentials-for-eks-3a2i

This article shows you how to set up the IAM policy you need as well as user and group configuration. It concludes with use of aws configure to store credentials so eksctl can make use of them.

Aerospike Up and Running

Now we're in a position to launch our Aerospike cluster. First we need to create our Kubernetes cluster.

eksctl create cluster --name aero-k8s-demo

Be warned, this takes something of the order of 30 minutes to complete. Once complete, type kubectl config get-contexts to verify correct setup. A context contains Kubernetes access parameters - the user, cluster and default namespace if set. You output should be similar to

Next, we retrieve the helm chart that supports cluster setup, and add it to our local repository.

helm repo add aerospike https://aerospike.github.io/aerospike-kubernetes

We are now able to install our Aerospike cluster. We will also enable the monitoring capability.

helm install demo aerospike/aerospike \
--set enableAerospikeMonitoring=true \
--set rbac.create=true

You should see something like

As the last line in the screenshot suggests, we can take a look at our Kubernetes objects via

kubectl get all --namespace default -l "release=demo, chart=aerospike-5.0.0"

All your pods should be in the running state before proceeding - if not you can use the command below to wait until they are.

kubectl get pods --watch

Access Grafana Monitoring

In the screenshot above, you can see that the Grafana monitoring service is running on port 80. We will make that available locally on port 8080 via port forwarding.

kubectl port-forward service/demo-grafana 8080:80

We can connect to this using a browser - http://yourhostname:8080. Credentials admin/admin. Note Grafana will make you change your password on entry. Do Home -> Cluster Overview at top left and you should see something like

Note that you can do the above port forward in any environment, not just where you did the creation. You will need to make sure you have the same credentials in ~/.aws/credentials and the required kubectl context. Below I have added my eks credentials to my aws credentials file under the eks heading.

eksctl then has a neat utility allowing you to get your K8s context into your alternative environment. I'm taking care to run the command under the eks account (p flag) and in the region where the K8s cluster was created (r flag - get this from the cluster name if you are not sure).

eksctl utils write-kubeconfig --cluster aero-k8s-demo -p eks -r <REGION>

Using Aerospike

We'll demonstrate use of Aerospike via our benchmarking software. From the kubernetes-aws project

kubectl create -f aero-client-deployment.yml

will create a single pod with a container with our benchmarking software installed. You can see the Dockerfile for the images used in the aerospike-java-client-build directory.

The commands below respectively

Retrieve the name of the container running the java client
Run the run_benchmarks command against that container.

CONTAINER=$(kubectl get pod -l "app=aerospike-java-client" \
-o jsonpath="{.items[0].metadata.name}")
kubectl exec $CONTAINER -- \
/aerospike-client-java/benchmarks/run_benchmarks -h demo-aerospike

The output will be similar to

If you go back to Grafana, change 'Cluster Overview' to 'Namespace view' at top left and change the time period (top right) to 'Last 5 minutes' and wait a couple of minutes your output will be similar to

Tidying Up

To remove the java client

kubectl delete -f aero-client-deployment.yml

To uninstall the Aerospike stack

helm uninstall demo

To delete your EKS cluster

eksctl delete cluster aero-k8s-demo

Conclusion

The above gets you up and running with Aerospike on EKS, the AWS Kubernetes service. Fundamentally it is about the steps you need to take to get your EKS cluster running. From that point on (helm repo add onwards) you can use the same steps versus any Kubernetes cluster, be it EKS, GCP or minikube based.

AWS Credentials for EKS

Ken Tune — Mon, 27 Jul 2020 10:06:27 +0000

What you need to know before you can create AWS Kubernetes clusters using the command line

eksctl is the AWS command line utility allowing you to administer (e.g. setup/teardown) your AWS Kubernetes cluster. This article details how you configure the credentials you need to use the service. This article is useful as this is not detailed on the eksctl website and is non-trivial.

IAM Overview

Credentials in AWS are managed using IAM - AWS Identity and Access Management. Broadly speaking, you create policies which are granular aggregations of permissions on AWS objects. You associate these with groups to which you add users. If a user has been created for programmatic access use, the user will have an access key id and a secret access key which can be stored on disk for use in conjunction with the AWS command line interface. The same mechanism is used by eksctl.

In this article we set up the eksctl account in accordance with the principle of 'least privilege' - the account should have sufficient privileges to execute actions as needed, but no more.

Below we go through the steps in the above process in detail.

IAM Policy Setup

The eksctl website does not detail the set of IAM privileges needed to run eksctl, and trial and error is not recommended. Guidance can be found in issue 204 below however.

As this is still somewhat complicated (and incomplete) I'm going to make use of this, but simplify the process for you.

First of all pull down https://github.com/aerospike-examples/kubernetes-aws.


 bash
git clone https://github.com/aerospike-examples/kubernetes-aws
cd kubernetes-aws

The policy you need is in eks.iam.policy.template. Some permissions however are account specific - you will see this if you look for the text account-id in eks.iam.policy.template - this needs replacing with your own account id.

Find your account id by logging into the AWS console. Select 'My Account'

You will see your account id in the next screen. Copy this.

From the kubernetes-aws project you just cloned, run


 bash
./make-policy.sh YOUR_ACCOUNT_ID

The result will be saved as eks.iam.policy.

Copy the contents of eks.iam.policy to the clipboard.

Select the IAM Service in the AWS console (Services->IAM) and click 'Policies'

Next 'Create Policy'. Select 'JSON' rather than 'Visual Editor', remove the JSON you see and replace with the contents of eks.iam.policy. Your screen should look like

Now click 'Review Policy'. Give your policy a name e.g. EKS.

Finally click 'Create Policy', bottom right of the above screen.

IAM Group Setup

In this section we create an IAM group and add the EKS policy to it.

Select 'Groups', from the left hand IAM menu.

Click 'Create New Group'. Give your group a name e.g. EKS.

Click 'Next Step'. Search for the policy you created and select.

Click 'Next Step', followed by 'Create Group'. You should see your new group, EKS, appear in the group listing screen.

Create IAM User

Now we create a user and associate with the EKS group. Select 'Users' from the left hand side menu above.

Click 'Add User'. Give your user a name e.g. EKS and check the 'programmatic access' access type.

Click 'Next: Permissions'. 'Add User To Group' will be selected by default. Check the 'EKS' group.

Click 'Next:Tags' followed by 'Next:Review' and finally 'Create User'. You will see the screen below.

Keep this screen in your browser - you will need it for the steps below.

AWS CLI Credential Setup

We are now in a position to cache our credentials on disk so they can be used by the AWS CLI or eksctl.

You will need the AWS CLI. Installation details may be found at https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html.

In the environment in which you will be using the AWS CLI / eksctl type aws configure and fill in the access key and secret access key which you can obtain from the screen above. You are also required to add in the default AWS region you wish to use. If you are curious, your credentials are stored in ~/.aws/credentials.

I have pixelated my keys as a matter of good practice, but I could also have made them visible and deleted the account immediately after taking the screenshot, then recreating the user. The secret key would have been completely different.

Note that you will need to click 'show' to see the secret access key in the screen above. You are only able to do this once. You will need to request another key if you do not record what you see for use in the aws configure step. Not a big problem, see below.

Access Key / Secret Key access

IAM makes it easy to rotate keys and manage accounts. Having created your user above you can access via 'Users' in the IAM menu.

If we select 'EKS' we see

I have tabbed to 'Security Credentials' above.

Note you can make a set of credentials inactive via 'Make Inactive'. You can request a new set via 'Create Access Key'. This will again give you one time access to your secret key. It also supports key rotation.

Conclusion

In this article we showed you how to set up credentials for eksctl in accordance with the best practice of least privilege. In https://dev.to/aerospike/aerospike-on-eks-aws-k8s-m5b we make use of this when detailing how to set up an Aerospike cluster on EKS.

Multi Record Transactions for Aerospike

Ken Tune — Thu, 16 Jul 2020 11:55:35 +0000

Aerospike is a high performance key-value database. It’s aimed at institutions and use-cases that need high throughput ( 100k tps+), with low latency (95% completion in <1ms), while managing large amounts of data (Tb+) with 100% uptime, scalability and low cost.

Because that’s what we offer, we don’t, at the server level, support multi-record transactions (MRT). If that’s what you need, then you will need two phase locking and two phase commit (assuming a distributed system). This will slow your system down, and will scale non-linearly, in that if you double your transaction volume, your time spent waiting for locks will more than double. If Aerospike did that it wouldn’t be a high performance database. In fact it would be just like any number of other databases and what would be the point in that?

Furthermore, believe it or not, despite the fact that our customers ask for many things, multi-record transactions are not high on the list. This is because in practice they’re less necessary than people think outside of a relational database. In an RDBMS you do need MRT because you shard your insert/update across tables in a non-natural way. In a key value or key object database, an insert/update that might span multiple tables is actually a single record change.

Even the textbook example of a change that supposedly has to be atomic, the transfer of money between two bank accounts, is not actually atomic in the real world. You can see this for yourself if you contemplate the fact that bank transfers are not instantaneous — far from it. That is because bank transfers cross system boundaries — and the credits and debits are not co-ordinated as a distributed transaction.

This article is however concerned with multi-record transactions and specifically executing them using Aerospike. Although the server will not natively support them, they can be achieved in software via use of the capabilities Aerospike offers.

Although, as discussed above, the need for MRT in key value databases is more limited than you might expect, there may well still be use cases that demand it. An example might be a workflow system - taking items off a queue and dispatching to other queues. The transfer of work item needs to be atomic — you don’t want a work item to potentially be processed twice or not at all due to a transactional failure.

To that end, I’ve put together multi-record-txn, a package that supports atomic multi record updates in Aerospike. At the heart of this is our ability to create locks using the primitives we offer. A record can be created with a CREATE_ONLY flag and this is used for locking purposes — if a lock record for an object already exists, CREATE_ONLY will fail.

We build on this by storing the state of our records prior to update in a transaction record, which is deleted once the updates are complete. Following that, we release our locks.

Rollback, if needed, is accomplished by restoring the stored values. We supply calls for rolling back records of more than a certain age, and for releasing locks, similarly of a supplied longevity. Full details can be found in the project README.

If it’s that easy, why don’t you put it into the product? Good question. The API comes with caveats, the principle one being that dirty reads are possible i.e. reads of values that may be rolled back. Specifically we are not offering isolation. The API does however offer the ability to lock records prior to update and to check locks, so in principle, if you were to insist on all gets being preceded by locks full isolation would be achieved. That said you wouldn’t have a high performance database anymore. The correct thing to do then is to use judiciously as you would with any tool. Also offered is the ability to do optimistic locking by supplying the generation (see Aerospike FAQ ) of the record, with transaction failure occurring if generation count does not match what is expected. This keeps you secure when other database users may non-transactionally update records, without incurring the overhead of locking.

It is worth saying at this point, that the value of this API is much greater if using Aerospike Enterprise ( rather than Community Edition ) and in particular making use of Strong Consistency. Strong consistency gives you the guarantee that duplicate records in your database ( necessary for resilience purposes ) will not ever experience divergence. If you do not have this guarantee ( which very few databases in our performance range offer ) then there is potential for this to occur in the event of network partitions and process crashes. Divergence of records here would mean locks or transaction records being lost in a sub-cluster experiencing a partition event ( or process crash ). Strong Consistency gives you a guarantee this will not happen.

Usage

Multi record put that will succeed or fail atomically

Atomic multi-record incorporating generation check

So what happens if my transaction fails part way through?

Rollback

If transactions can they will self unwind. This will be done on a lock acquire or generation check exception.

This may of course not be possible in the event of a network failure for example. For that we have the rollback call. This will (see below) allow rollback of all transactions of more than a (specified by the user) certain age, together with any orphan locks i.e. locks not associated with transactions ( the absence of a transaction record means the transaction completed, but the transaction process failed while unwinding the locks ).

Rollback of expired transactions / locks

Testing

But how do I know it’s sound? Try testing it of course. The README goes into some detail on this point and the test classes even more so. If you think something has been missed let me know.

More Detail

The FAQ covers a number of questions asked to date. Please do read this section and also the caveats section of the README ( in fact all of the README ) if you are considering using this.

Further direction

An obvious next step to take would be to incorporate two-phase locking i.e. supporting shared read locks as well as exclusive write locks in order to reduce contention.

Another possibility is that the locks are on the records themselves, rather than being separate records — this might optimize single record use.

Finally, in this post, Martin Kleppmann notes that this method may be problematic if there is the possibility that transactions can pause for unexpectedly long periods (exceeding the period you would reasonably expect the transaction to complete in). If you are considering using this API, you should consider the points he raises. His suggestion of ‘fencing tokens’ is an option for incorporating into this API if there is further interest.

To that end, please let me know if you use this API, and then I’ll know if there’s appetite for more.

Any questions/comments — please feed back through the GitHub issues facility.

Record Aggregation in Aerospike For Performance and Economy

Ken Tune — Fri, 10 Jul 2020 07:37:29 +0000

A strong differentiator for Aerospike vs other key value databases is its DRAM economy. Every object has a 64 byte DRAM footprint no matter what the size of the actual object is. You can manage a billion objects using only 128Gb of DRAM, allowing for 2x replication.

Great news! A billion is a pretty big number and 3 * 512GB nodes gets me to 12bn. Within reason I can have as many objects as I like. I should start making them right away with no further thought required.

Hold your horses, cowboy. It might not be as simple as that.

For instance, what if your objects are very small? Worst case, they’re so small that they’re of the order of 64 bytes, so now your memory footprint is similar to your disk footprint. It might even be the case that your memory to disk ratio is such that your DRAM is full when your disk is half empty. In a bare metal situation, buy less disk / more DRAM for sure, but you might be in the cloud where you’re stuck with certain DRAM / storage ratios. Or maybe these machines got brought to you for re-purposing.

A technique informally known as blocking can help you. You store your objects within larger objects. Your API turns this into an implementation detail. Blocking can reduce your memory footprint, helping with the small object use case.

Aerospike lets you do this by offering a comprehensive List and Map API which allows you to place objects within objects as well as retrieving them in an efficient manner. List structures can be used for storing structured data such as time series of a regular frequency while maps can be used to reduce your key space. The API offered is a distinguishing feature of Aerospike when contrasted with other key value databases.

Let’s look again at our example. Suppose your key space is composed of device ids, and these are fundamentally UUIDs — 128 bit numbers or 32 digit hexadecimal numbers. Let’s say you anticipate you may need to store as many as 15bn of these, but each record is only around 200 bytes. Your DRAM requirement with Aerospike would be of the order of

64(bytes) * 15bn * 2(replication) = ~2Tb

Not the end of the world, but you could do better by working smarter.

Assume also that we want to keep our physical object size below 128kb — a good starting point for optimal block size on a flash device, which is recommended for Aerospike. We can put 655 200 byte objects into 128kb.
If each physical object contains 655 actual objects then we require 15bn / 655 = 22.9m container object keys. The question is, how then, given do we map from a device id to the container(physical) object key, and how do we reliably look up a logical object inside thecontainer object. The answer is that we do this using bit-masking.

A 128 bit key space can be converted into a key space of size 2,4 …. 65536… 2²⁰ keys by AND-ing the key with a binary number composed of 1,2 … 16 … 20 etc leading ones followed by trailing zeros. For our example we need a bit mask of size equivalent to the first power of two above our key space size which can be calculated as

ceiling(log(22.9 *10⁶) / log(2)) = 25 bits

This gives us a key space of size 2²⁵ = ~33.5m so we’ve got our maths correct.

Let’s look at how we make use of this in Aerospike

This function stores our device metadata inside a physical object. As described, the physical object key is derived using bit masking. Note this is efficient from a network capacity point of view — only the metadata gets sent across the network, not the full physical object.

We also need to see how to retrieve our object

The construction of the physical key is as before. This time we use the getByKey operation to retrieve the device metadata.

An important point to note is that only the metadata requested is transmitted across the network not the entire physical object. This consideration applies in general to the calls offered by the List/Map API. This is what we mean by ‘economy’ in the article title.

Finally, a code snippet showing how to calculate the bit mask using the ‘natural’ inputs.

The net benefit of all the above is that the memory footprint will be reduced in this case to

2²⁵ (keys) * 64 (DRAM cost per record) * 2 (rep factor) = 4Gb

from

15bn * 64 (DRAM cost per record) * 2 (rep factor) = ~1.8TB

The reduction factor is 447, slightly less than the '655' quoted above as on average, our 128kb blocks will not be completely filled.

Before we close, worth noting our Enterprise only ‘all flash’ capability which allows both index and data to be placed on disk thus reducing DRAM usage to very low levels. This was developed specifically with the use cases of small objects and/or very large numbers of objects (~10¹² = 1 trillion ) in mind. It will engender higher levels of latency (~5ms vs ~1ms at the 95th percentile ) but it’s still competitive vs any other database out there.

The above solution is a good example of a differentiating feature, our List and Map API , providing a distinguishing optimisation under constraints. The technique of ‘blocking’ can also be made use of for time series data which I hope to explore in a future article.

Cover image with thanks to Nana Smirnova

Ansible For Aerospike

Ken Tune — Wed, 01 Jul 2020 07:59:31 +0000

Working as a Solution Architect for Aerospike I often have occasion to create Aerospike clusters on the fly. These might be for benchmarking, demonstration purposes or investigation of a query from a prospect or customer.

Although Aerospike is straightforward to configure and install, the relatively small number of steps you have to go through does begin to add up in aggregate. This becomes more significant when you might add in the need to configure TLS, encryption on disk, strong consistency, java benchmarking client or rack awareness to name a few of the options available.

One possibility is to make use of Aerospike Kubernetes but if what you’re doing demands high levels of performance, casual use will steer you towards using VMs.

To that end I’ve put together aerospike-ansible a collection of Ansible scripts allowing configurable automation of the build of Aerospike clusters. The focus is on doing this on AWS as that’s the platform that Ansible best supports, but with a little inventory nous you can leverage these scripts on bare metal or alternate cloud provider environments*.

The scripts go beyond simply building clusters. The configurable deployment of Aerospike Java benchmarking clients and the Prometheus/Grafana based Aerospike Monitoring is also handled by the repository.

The README goes into full detail, but key supported options are

Instance type
Hosts per AZ
Community/Enterprise
Encryption at rest (*)
TLS
Strong Consistency — including roster setup with rack awareness (*)
Prometheus/Grafana monitoring stack
Aerospike Version

(*) Aerospike Enterprise Only

There’s also a video showing end to end setup of the full Aerospike cluster/client/monitoring stack on a fresh Vagrant instance. It’s around 25 minutes long but it also goes through possible wrinkles relating to installation of Ansible plus IAM setup. Colleagues tell me these scripts take as little as 5 minutes to use for the first time, even with zero knowledge of Ansible.

I’m envisaging this being helpful to those looking to get Aerospike up and running for the first time. It allows easy setup for proof of concept and development work, while providing a production grade facility. You can tear it down just as easily as you stood it up, keeping costs low.

Providing additional recipes for operational procedures such as rolling upgrades or cluster migration is very much on the cards, so I hope to get further assets and blog posts out in this area.

Any questions/comments — please feed back through the GitHub issues facility.

[*] See the README for Google Cloud Platform details