Forem: DolphinDB

Introduction to backtesting strategy: Historical data replay in DolphinDB

DolphinDB — Thu, 29 Aug 2024 03:08:33 +0000

In DolphinDB, we can import historical data into a stream table in chronological order as “real-time data” so that the same script can be used both for backtesting and real-time trading. Regarding streaming in DolphinDB please refer to DolphinDB Streaming Tutorial .

This article introduces functions replay and replayDS and then demonstrates the process of data replaying.

1. Functions

replay

replay(inputTables, outputTables, [dateColumn], [timeColumn], [replayRate], [absoluteRate=true], [parallelLevel=1])

Function replay injects data from specified tables or data sources into stream tables.

‘inputTables’ is a table or a tuple. Each element of the tuple is an unpartitioned table or a data source generated by function replayDS.
‘outputTables’ is a table or a tuple of tables, or a string or a string vector. The number of elements of outputTables must be the same as the number of elements of inputTables. If it is a vector, it is a list of the names of the shared stream tables where the replayed data of the corresponding tables of inputTables are saved. If it is a tuple, each element is a shared stream table where the replayed data of the corresponding table in inputTables are saved. The schema of each table in outputTables must be identical as the schema of the corresponding table in inputTables.
‘dateColumn’ and ‘timeColumn’ are strings indicating the date column and time column in inputTables. If neither is specified, the first column of the table is chosen as ‘dateColumn’. If there is a ‘dateColumn’, it must be one of the partitioning columns. If only ‘timeColumn’ is specified, it must be one of the partitioning columns. If information about date and time comes from the same column (e.g., DATETIME, TIMESTAMP), use the same column for both ‘dateColumn’ and ‘timeColumn’. Data are replayed in batches determined by the smallest unit of time in ‘timeColumn’ or ‘dateColumn’ if ‘timeColumn’ is not specified. For examples, if the smallest unit of time in ‘timeColumn’ is second then all data in the same second are replayed in the same batch; if ‘timeColumn’ is not specified, then all data in the same day are replayed in the same batch.
‘replayRate’ is a nonnegative integer indicating the number of rows to be replayed per second. If it is not specified, it means data are replayed at the maximum speed.
‘replayRate’ is an integer.
‘absoluteRate’ is a Boolean value. The default value is true.

Regarding ‘replayRate’ and ‘absoluteRate’:

(1) If ‘replayRate’ is a positive integer and absoluteRate=true, replay at the speed of ‘replayRate’ rows per second.

(2) If ‘replayRate’ is a positive integer and absoluteRate=false, replay at ‘replayRate’ times the original speed of the data. For example, if the difference between the maximum and the minimum values of ‘dateColumn’ or ‘timeColumn’ is n seconds, then it takes n/replayRate seconds to finish the replay.

(3) If ‘replayRate’ is unspecified or negative, replay at the maximum speed.

‘parallelLevel’ is a positive integer. When the size of individual partitions in the data sources is too large relative to memory size, we need to use function replayDS to further divide individual partitions into smaller data sources. 'parallelLevel' indicates the number of threads loading data into memory from these smaller data sources simultaneously. The default value is 1. If 'inputTables' is a table or a tuple of tables, the effective 'parallelLevel' is always 1.

replayDS

replayDS(sqlObj, [dateColumn], [timeColumn], [timeRepartitionSchema])

Function replayDS generates a group of data sources to be used as the inputs of function replay. It splits a SQL query into multiple subqueries based on 'timeRepartitionSchema' with 'timeColumn' within each 'dateColumn' partition.

‘sqlObj’ is a table or metacode with SQL statements (such as ) indicating the data to be replayed. The table object of “select from” must use a DATE type column as one of the partitioning columns.
‘dateColumn’ and ‘timeColumn’ are strings indicating the date column and time column. If neither is specified, the first column of the table is chosen as ‘dateColumn’. If there is a ‘dateColumn’, it must be one of the partitioning columns. If only ‘timeColumn’ is specified, it must be one of the partitioning columns. If information about date and time comes from the same column (e.g., DATETIME, TIMESTAMP), use the same column for both ‘dateColumn’ and ‘timeColumn’. Function replayDS and the corresponding function replay must use the same set of 'dateColumn' and 'timeColumn'.
‘timeRepartitionSchema’ is a TIME or NANOTIME type vector. ‘timeRepartitionSchema’ deliminates multiple data sources on the dimension of ‘timeColumn’ within each ‘dateColumn’ partition. For example, if timeRepartitionSchema=[t1, t2, t3], then there are 4 data sources within a day: [00:00:00.000,t1), [t1,t2), [t2,t3) and [t3,23:59:59.999).

Replay a single in-memory table

replay(inputTable, outputTable, `date, `time, 10)

Replay a single table using data sources

To replay a single table with a large number of rows, we can use function replayDS together with function replay. Function replayDSdeliminates multiple data sources on the dimension of 'timeColumn' within each 'dateColumn' partition. Parameter 'parallelLevel' of functionreplay` specifies the number of threads loading data into memory from these smaller data sources simultaneously. In this example, 'parallelLevel' is set to 2.

Replay multiple tables simultaneously using data sources

To replay multiple tables simultaneously, assign a tuple of these table names to parameter ‘inputTables’ of function replay and specify the output tables. Each of the output tables corresponds to an input table and should have the same schema as the corresponding input table. All input tables should have identical 'dateColumn' and 'timeColumn'.

Cancel replay

If function replay was called with submitJob, we can use getRecentJobs to get jobId, then cancel the replay with command cancelJob.

If function replay was called directly, we can use getConsoleJobs in another GUI session to get jobId, then cancel the replay use command cancelConsoleJob.

2. How to use replayed data

Replayed data are streaming data. We can subscribe to and process the replayed data in the following 3 ways:

Subscribe in DolphinDB. Write user-defined functions in DolphinDB to process streaming data.
Subscribe in DolphinDB. To conduct real-time calculations with streaming data, use DolphinDB’s built-in streaming aggregators such as time-series aggregator, cross-sectional aggregator and anomaly detection engine. They are very easy to use and have excellent performance. In section 3.2, we use a cross-sectional aggregator to calculate the intrinsic value of an ETF.
With third-party client through DolphinDB’s streaming API.

3. Examples

Replay level 1 stock quotes to calculate ETF intrinsic value

In this example, we replay the level 1 stock quotes in US stock markets on 2007/08/17, and calculate the intrinsic value of an ETF with the built-in cross-sectional aggregator in DolphinDB. The following are the schema of the input table ‘quotes’ and a preview of the data.

(1) To replay a large amount of data, if we load all data into memory first, we may have an out-of-memory problem. We can first use function replayDS and specify parameter 'timeRepartitionSchema' to divide the data into 60 parts based on the column 'time'.

(2) Define the output stream table ‘outQuotes’.

(3) Define a dictionary for the ETF components weights and function etfVal to calculate ETF intrinsic value. For simplicity we use an ETF with only 6 component stocks.

(4) Define a streaming aggregator to subscribe to the output stream table ‘outQuotes’. We specify a filtering condition for the subscription that only data with stock symbols of AAPL, IBM, MSFT, NTES, AMZN or GOOG are published to the aggregator. This significantly reduces unnecessary network overhead and data transfer.

(5) Start to replay data at the specified speed of 100,000 rows per second. The streaming aggregator conducts real-time calculation with the replayed data.

(6) Check ETF intrinsic values

4. Performance testing

We tested data replaying in DolphinDB on a server with the following configuration:

Server: DELL PowerEdge R730xd
CPU: Intel Xeon(R) CPU E5–2650 v4(24cores, 48 threads, 2.20GHz)
RAM: 512 GB (32GB × 16, 2666 MHz)
Harddisk: 17T HDD (1.7T × 10, read speed 222 MB/s, write speed 210 MB/s)
Network: 10 Gigabit Ethernet

DolphinDB script:

When replaying at maximum speed (parameter ‘replayRate’ is not specified) and the output table is not subscribed, it only takes about 100 seconds to replay 336,305,414 rows of data.

https://github.com/dolphindb/Tutorials_EN/blob/master/historical_data_replay.md?source=post_page-----497e24af596d--------------------------------

Helpful Tools for Quant丨Efficiently Calculate Transaction Costs from Tick Data

DolphinDB — Wed, 28 Aug 2024 01:48:54 +0000

The calculation of transaction costs from tick data often involves two tables: trade and nbbo. As the timestamps of both tables are at nanosecond level, there are virtually no exact match between the timestamps of the two tables.

Therefore, in order to calculate transaction costs, we need to locate the most recent quote before each trade (of the same stock). We may also need to calculate the average quotes within a specific window relative to each trade. These non-exact joins are frequently used in quant finance, but they are not supported in most databases.

This time, DolphinDB provides you with asof join and window join for these scenarios.

Take a look at the following case!

The data used in this example is the high-frequency data from the New York Stock Exchange, consisting of two tables: trade and nbbo, respectively containing 27 million and 78 million records. The DolphinDB script is as follows:

trade = loadTable("dfs://EQY", "trade")
select count(*) from trade

nbbo = loadTable("dfs://EQY", "nbbo")
select count(*) from nbbo

// asof join
timer TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_aj from aj(trade,nbbo,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

// window join
timer TC2 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_wj from pwj(trade,nbbo,-100000000:0,<[avg(Offer_Price) as Offer_Price, avg(Bid_Price) as Bid_Price]>,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

select * from ej(TC1,TC2,`symbol) where symbol in `AAPL`MS`EBAY

Just one line of script can implement complex calculation logic.

// asof join
timer TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_aj from aj(trade,nbbo,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

It takes 339 milliseconds to complete the calculation of transaction costs with asof join, which is more than 100 times faster than the equivalent calculation in Python pandas, and the script to calculate with window join takes 402 milliseconds.

While calculating transaction costs, DolphinDB shows an excellent performance with concise code. For a visual representation of the operation covered in this article, you can take one minute watching this demo!

https://youtu.be/2HfTRLDYUrY

Mastering Pairwise Correlations Calculation of Securities Through Coding

DolphinDB — Tue, 27 Aug 2024 07:23:22 +0000

Calculating pairwise correlations of securities sheds light on the relationship between different securities, assisting in investment decision-making and risk management. In this article, you’re going to learn how to calculate the pairwise correlations of multiple securities using high frequency data.

The tool we use is DolphinDB and the data we use is the high-frequency quote data of US stocks on August 1, 2007. The raw data is 16.1 GB with 380 million records. We will calculate the pairwise correlations of the most actively traded 500 stocks in that day.

Let’s delve into the code.

select count(*) from loadTable("dfs://TAQ", "quotes") where date= 2007.08.01

def getStockCorrelation(dateValue, num){
    quotes = loadTable("dfs://TAQ", "quotes")

    syms = (exec count(*) from quotes where date = dateValue, time between 09:30:00 : 15:59:59, 0<bid, bid<ofr, ofr<bid*1.1 group by Symbol order by count desc).Symbol[0:num]
    priceMatrix = exec avg(bid + ofr)/2.0 as price from quotes where date = dateValue, Symbol in syms, 0<bid, bid<ofr, ofr<bid*1.1, time between 09:30:00 : 15:59:59 pivot by time.minute() as minute, Symbol
    retMatrix = each(def(x):ratios(x)-1, priceMatrix)
    correlationMatrix = corrMatrix(retMatrix[1:,].ffill())

    mostCorrelated = select * from table(correlationMatrix.columnNames() as sym, correlationMatrix).unpivot(`sym, syms).rename!(`sym`corrSym`corr) context by sym having rank(corr,false) between 1:10
    return mostCorrelated
}

dateValue = 2007.08.01
num = 500
timer mostCorrelated = defStockCorrelation(dateValue,num)

First, we generate a vector of 500 stock tickers with the largest number of quote records.

syms = (exec count(*) from quotes where date = dateValue, time between 09:30:00 : 15:59:59, 0<bid, bid<ofr, ofr<bid*1.1 group by Symbol order by count desc).Symbol[0:num]

Why we use “exec” statement?

The syntax of the “exec” statement is the same as the “select” statement except that “select” always returns a table **whereas “exec*” can return a **scalar, vector, matrix or table*, which is more convenient for certain operations.

Next, we select the records for these 500 stocks and filter out dirty data, then generate a minute-level mid-price matrix with stock tickers as column labels and minutes as row labels.

priceMatrix = exec avg(bid + ofr)/2.0 as price from quotes where date = dateValue, Symbol in syms, 0<bid, bid<ofr, ofr<bid*1.1, time between 09:30:00 : 15:59:59 pivot by time.minute() as minute, Symbol

Then convert the minute-level mid-price matrix into a stock return matrix.

retMatrix = each(def(x):ratios(x)-1, priceMatrix)

Based on the stock return matrix, we calculate the correlation between every two columns to get a 500 by 500 pairwise correlation matrix.

correlationMatrix = corrMatrix(retMatrix[1:,].ffill())

So actually, it only takes 4 lines of code to get the pairwise correlation matrix from the high frequency data in DolphinDB.

In addition, running the script only takes 2.5 seconds.

To verify the accuracy, we get the 3 stocks with the highest correlation with Lehman Brothers from the pairwise correlation matrix. The outputs are Morgan Stanley, Goldman Sachs, and Merrill Lynch, which are all investment bank stocks that are supposed to be highly correlated with Lehman Brothers.

In this example, DolphinDB demonstrates excellent performance in data analysis of large amounts of data with elegant code and minimal coding efforts. The calculation process can be encapsulated into a function to be used in script as well as through various APIs such as Python, C++, Java, C#, Go, etc.

addFunctionView(getStockCorrelation)

Follow the video below to try the efficient calculation 👇
https://youtu.be/kmz5OsoDBs0

Accelerating Data Analysis: Embracing the Efficient Query and Aggregation Calculations

DolphinDB — Mon, 26 Aug 2024 08:18:06 +0000

Efficient query and calculation capabilities play a pivotal role in handling large-scale datasets, and ensuring timely and accurate data analysis. In this article, we will explore how DolphinDB’s query and advanced calculation features significantly enhance data processing and analysis workflows through practical code examples👇

In this case, we choose TSDB engine as the DolphinDB storage engine. We use a server with 16 CPU cores, 512 GB of memory, and 4 SSDs. The read speed of each disk is about 400 MB/s.

The data we use is the US Stock Exchange level 1 quotes of 4 years. The original data size is around 12.75 TB. After compression, the disk space occupied by DolphinDB is 2.12 TB.

All these data are stored in one table in DolphinDB, and there are almost 270 billion records.

login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")

First, let’s do the query.

We can get the data of Lehman Brothers at 3:59:59 PM, August 21, 2007 through the code below. It turns out that DolphinDB takes only 7.9 ms to extract one second’s data of a stock from 270 billion (12.75 TB) records.

select count(*) from quotes 

timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59

The query result is assigned to the variable x, and the content of x can be shown in the data browser.

Next, let’s calculate the average bid-ask spread of Lehman Brothers in each minute on Aug 31, 2007.

The bid-ask spread is defined as the ask price minus the bid price, and then divided by the average ask price and the bid price, and we also add some filtering conditions to ensure the validity of the data.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

Run the script, and we can see that execution takes 89 milliseconds.

The calculation result is plotted in this graph.

Then we’ll increase the data volume, remove the stock code limitation, and calculate the bid-and-ask price difference per minute.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute

The data of all stocks in the whole market for a whole day is about 10 GB. Run the script, and it takes 2.1 seconds.

For a well-designed system, the time it takes to finish a query should be linearly related to the amount of days the query consumes.

So this time, we performs the same calculation on 3 days.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

It takes 5.47 seconds, about 3 times as long as the previous query, which means that the time consumption is directly proportional to the amount of data.

In a word, performance of DolphinDB is pretty good for both querying a small amount of data and aggregating calculations with large amounts of data.

The complete script is displayed as follows 👇👀

login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")

select count(*) from quotes 

timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

And you can follow the video below, experiencing the efficient query and aggregation calculation.

https://youtu.be/A_XoHP9NboE

High Frequency Data Analysis: Converting High-frequency Signals to Discrete Buy/Sell Signals

DolphinDB — Fri, 23 Aug 2024 03:01:17 +0000

In high-frequency trading, we generate high-frequency signals from trade and quote tick data and analyze these signals to identify trading opportunities. This tutorial demonstrates how to convert high-frequency signals into discrete buy/sell/hold signals. Essentially, the problem is to convert an array of floating-point values into an array of only 3 integers: +1 (buy), 0 (hold) and -1 (sell).

The conversion rules can be quite simple. For example:

+1 if signal > t1
-1 if signal < t2
0 otherwise

Such conversion rules can be easily implemented in DolphinDB with function iif:

iif(signal > t1, 1, iif(signal <t2, -1, 0))

However, to avoid too frequent reversals of trading direction, we usually adopt a more complex set of rules: if a signal is above the t1 threshold, it’s a buy signal (+1) and the subsequent signals remain buy signals until one falls below the t10 threshold. Similarly, if a signal is below the t2 threshold, it is a sell signal (-1) and the subsequent signals remain sell signals until a signal exceeds the t20 threshold. The relationship between the thresholds is as follows:

t1 > t10 > t20 > t2

With the above rules, the value of a trade signal is determined not only by the value of the current signal but also by the state of the previous signal. This is a typical example of path dependence, which is commonly considered unsuitable or difficult to be handled by vector operations and thus very slow in scripting languages including DolphinDB.

In some cases, however, a path dependence problem can be solved with vector operations. The problem above is one such example. The next section describes how to solve it with vector operations.

First, find out the signals that fall in the ranges of a determined state:

if signal > t1, state=1
if signal < t2, state=-1
if t20<= signal <= t10, state=0

The states of the signals in the ranges of [t2,t20] and [t10,t1] are determined by the states of the signals preceding these ranges.

The DolphinDB script for implementing the above rules:

direction = (iif(signal>t1, 1, iif(signal<t2, -1, iif(signal between t20:t10, 0, NULL)))).ffill().nullFill(0)

Let’s run a simple test:

t1= 10
t10 = 6
t20 = -6
t2 = -10
signal = 20.12 8.78 4.39 -20.68 -8.49 -6.98 0.7 2.08 8.97 12.41
direction = (iif(signal>t1, 1, iif(signal<t2, -1, iif(signal between t20:t10, 0, NULL)))).ffill().nullFill(0)

The script would be like this if we use pandas:

t1=60
t10=50
t20=30
t2=20
signal=pd.Series([20.12,8.78,4.39,-20.68,-8.49,-6.98,0.7,2.08,8.97,12.41])
start=time.time()
direction1=(signal.apply(lambda signal: 1 if signal > t1 else(-1 if signal<t2 else(0 if t20 < signal < t10 else np.nan)))).ffill().fillna(0)

The test below generates a random array of 10 million signals between 0 and 100 to test the performance of DolphinDB and pandas. The test environment setup is as follows:

CPU: Intel(R) Core(TM) i7–7700 CPU @3.60GHz 3.60 GHz

Memory: 16 GB

OS: Windows 10

The executions take 171.73ms (DolphinDB) and 3.28 seconds (pandas), respectively.

DolphinDB script:

t1= 10
t10 = 6
t20 = -6
t2 = -10
signal = rand(100.0, 10000000)
direction = (iif(signal>t1, 1, iif(signal<t2, -1, iif(signal between t20:t10, 0, NULL)))).ffill().nullFill(0)

pandas script:

import time
t1= 60
t10= 50
t20= 30
t2= 20
signal= pd.Series(np.random.random(10000000) * 100)
start= time.time()
direction1=(signal.apply(lambda signal: 1 if signal > t1 else(-1 if signal<t2 else(0 if t20 < signal < t10 else np.nan)))).ffill().fillna(0)
end= time.time()
print(end- start)

How to Downsample Your Data Efficiently

DolphinDB — Thu, 22 Aug 2024 03:33:09 +0000

Tired of spending countless mintues downsampling your data? Look no further!

In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.

The basic configuration of the DolphinDB server is:

16 CPU cores
256 GB memory
4 SSDs

A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.

The data we use is:

the level 1 quotes on August, 2007 from New York Stock Exchange
around 272 GB, with 6.48 billion records

Downsampling can be performed with a SQL statement in DolphinDB.

As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.

The script is as follows:

db = database("dfs://TAQ")
quotes = db.loadTable("quotes")
select count(*)  from quotes where date between 2007.08.01 : 2007.08.31

model=select  top 1 symbol,date, minute(time) as minute, bid, ofr from quotes where date=2007.08.01,symbol=`EBAY
if(existsTable("dfs://TAQ", "quotes_minute_sql"))
 db.dropTable("quotes_minute_sql")
db.createPartitionedTable(model, "quotes_minute_sql", `date`symbol)

timer{
 minuteQuotes=select avg(bid) as bid, avg(ofr) as ofr from quotes where data between 2007.08.01 : 2007.08.31 group by symbol,date,bar(time, 60) as minute
 loadTable("dfs://TAQ", "quotes_minute_sql").append!(minuteQuotes)
}

select count(*)  from loadTable("dfs://TAQ", "quotes_minute")

The frequency can be adjusted as needed just by modifying bar(time, 60). Here 60 means the data is downsampled to 1-minute interval as the timestamp values have seconds precision.

The table “quotes_minute_sql“ is created with createPartitionedTable and the downsampled result can be appended to this table.

We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.

When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.

DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:

Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
Compression reduces the disk I/O;
Columnar storage and vectorized computation improve the efficiency of aggregation.

To learn detailed operations of data downsampling, take a look at this demo!
https://youtu.be/0vRuiz1Lf6Y

A Simpler Way to Calculate WorldQuant 101 Alphas

DolphinDB — Wed, 21 Aug 2024 02:31:10 +0000

The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation.

Take the calculation formula of Alpha#98 for example, and the data we use is the daily data from the New York Stock Exchange.

Kakushadze et al 2015 Alpha #98: 
(rank(decay_linear(correlation(vwap, sum(adv5, 26.4719), 4.58418), 7.18088)) -
rank(decay_linear(Ts_Rank(Ts_ArgMin(correlation(rank(open), rank(adv15), 20.8187), 8.62571), 6.95668), 8.07206)))

This formula involves both cross-sectional data and time series data, and the calculation uses nested functions with up to 6 levels, which is extremely difficult to implement in most systems.

We can significantly reduce the development cost of complex calculations such as Alpha#98 by using DolphinDB’s built-in functions with panel data (in matrix form).

Why is DolphinDB’s code so concise and elegant? 😲

On the one hand, using panel data provided by DolphinDB to implement the Alpha#98 factor simplifies the calculation logic and makes the code very precise. Panel data is a matrix that combines cross-sectional data and time-series data.

On the other hand, DolphinDB has more than 1,500 built-in functions and many of them are optimized. You can implement all 101 alphas with DolphinDB built-in functions.

Let’s take a look at DolphinDB script, which is very similar to the original formulas.

def alpha98Panel(vwap, open, vol){
    return tsRank(mavg(mcorr(vwap, msum(mavg(vol, 5), 26), 5), 1..7)) - tsRank(mavg(mrank(9 - mimin(mcorr(tsRank(open), 
    tsRank(mavg(vol, 15)), 21), 9), true, 7), 1..8))
}

Run the script, and it only takes 986ms to generate a matrix with 7,162 columns and 252 rows and calculate Alpha#98 with the matrix.

Furthermore, DolphinDB supports unified stream and batch processing. It provides the streamEngineParser function to automatically form a pipeline of stream engines to carry out the specified metrics calculation. You can directly use the Alpha#98 function as the metrics, with no need to modify the script for metrics calculation.

Click the demo below to get more info! 👇
https://youtu.be/B0lYBAI_FEc

DolphinDB to Offer Access to ICE Consolidated Feed and Historical Tick Data

DolphinDB — Tue, 20 Aug 2024 07:30:41 +0000

In March 2023, DolphinDB, Inc., a leading provider of time-series databases, and Intercontinental Exchange (NYSE: ICE), agreed to offer access to the ICE Consolidated Feed and its historical tick data services on DolphinDB’s high-performance time series database. This new offering provides a more efficient and reliable workflow for DolphinDB clients to access ICE’s market data for faster deployment of trading strategy research, testing and executions.

DolphinDB is a high-performance distributed time-series database, which has been widely adopted by brokerage firms, mutual funds, hedge funds, banks, and exchanges. Through this agreement, DolphinDB users can easily access and analyze a diverse range of market data from ICE for crucial insights regarding investment decisions and risk management.

The ICE Consolidated Feed provides a broad range of real-time and intra-day data and analytics from over 600 markets and third-party data sources. Global organizations including major banks, brokerage firms, and investment management firms utilize the ICE Consolidated Feed to power their pricing and analytics data requirements. Together, ICE and DolphinDB can offer clients an efficient way to manage, process and analyze massive amounts of market data.

“We are excited to work with ICE to bring a comprehensive and efficient suite of data solutions to our clients,” said Dr. Davis Zhou, CEO and founder of DolphinDB. “This integration brings together the products and resources of both parties, which transforms and streamlines data acquisition, processing, and analysis throughout the investment research process. We look forward to a collaboration where we bring out the best in each other to accelerate the development in fintech.”

“We are pleased to work with DolphinDB, and offer more clients access to ICE’s suite of market data offerings, by leveraging a streamlined integration for fast deployment and reliability,” said Magnus Cattan, Head of ICE Fixed Income and Data Services in APAC. “Our data provides a competitive option to DolphinDB’s clients for accessing in-depth cross-asset global market data.”

Both DolphinDB and ICE are committed to delivering the highest quality solutions for quantitative investment research, and this collaboration further solidifies their shared vision of advancing the industry through collaboration and innovation. This collaboration will leverage DolphinDB’s data storage and analytics capabilities and ICE’s global market data reach & expertise to deliver comprehensive, reliable and timely financial data solutions to global market participants.

About DolphinDB

DolphinDB offers a unified platform for data warehouse, analytics and streaming workloads. At its core it is a high-performance distributed time-series database. With a fully featured programming language, over 1500 built-in functions and a suite of stream computing engines, DolphinDB enables rapid development of high performance applications for mission critical tasks in global trading.

About Intercontinental Exchange

Intercontinental Exchange, Inc. (NYSE: ICE) is a Fortune 500 company that designs, builds and operates digital networks to connect people to opportunity. We provide financial technology and data services across major asset classes that offer our customers access to mission-critical workflow tools that increase transparency and operational efficiencies. We operate exchanges, including the New York Stock Exchange, and clearing houses that help people invest, raise capital and manage risk across multiple asset classes. Our comprehensive fixed income data services and execution capabilities provide information, analytics and platforms that help our customers capitalize on opportunities and operate more efficiently. At ICE Mortgage Technology, we are transforming and digitizing the U.S. residential mortgage process, from consumer engagement through loan registration. Together, we transform, streamline and automate industries to connect our customers to opportunity.

Integrating Crypto Market Data From Multiple Sources

DolphinDB — Tue, 20 Aug 2024 07:06:39 +0000

Cryptocurrency trading professionals typically use Python, Java, Rust, or C++ for exchange API access. Single-source data access presents inherent limitations, such as data loss from server downtime, delayed updates, and limited data quality and refresh rates, which can potentially impact trading decisions.

This article introduces a DolphinDB solution to access market streams (with Binance Exchange as an example) through both WebSocket and REST APIs. Performance test shows that this integrated approach enhances continuous data access and comprehensive coverage and improves data quality by providing more timely and frequent market updates. By implementing higher frequency data refreshes within shorter intervals, traders can achieve superior market responsiveness and make more informed decisions.

1. Background

Binance offers two ways for accessing the market streams:

REST API: Suitable for retrieving static data but requires polling for real-time updates.
WebSocket API: Designed for accessing real-time market streams. By maintaining a persistent connection, it enables low-latency retrieval of the latest data.

Binance provides multiple base endpoints, each with distinct performance and stability characteristics. We recommend that users test the performance and stability of each base endpoints on the designated server in their specific business context and determine the optimal number of data sources and retrieval methods based on their needs. The proposed approach combines the strengths of both methods to enhance performance and stability by subscribing to data on different base URLs using the same account and setting different processing frequencies.

2. Implementation Overview

The integrated market data solution comprises the following components.

Accessing Market Streams from Binance Exchange

For each cryptocurrency pair, market data is accessed from Binance via both WebSocket API (with subscription frequency at 100 ms) and REST API (with polling interval of 50 ms).

Processing Data from WebSocket API

Subscribing to the data from WebSocket API, and writing it to a shared stream table.
Persisting the stream table to a DFS partitioned table.

Processing Data from REST API

Setting up HTTP requests to fetch data from REST API.
Writing the data to a shared stream table.
Persisting the stream table to a DFS partitioned table.

Cleaning and Integrating Received Data

Creating a shared dictionary to maintain the latest timestamp for each trading pair.
Defining rules for filtering data with latest timestamps, and integrating selected streams into a shared stream table.
Upon receiving new data, comparing timestamps of incoming data with those in table to ensure only the latest data is written.
Persisting the stream table to a DFS partitioned table.

3. Accessing Market Streams

This chapter outlines the steps of accessing market streams from Binance’s WebSocket and REST APIs using DolphinDB WebSocket and HttpClient Plugins.

3.1 Accessing Data From WebSocket API

We establish a WebSocket subscription task using the DolphinDB WebSocket plugin to access real-time market depth data from Binance WebSocket API.

Step 1: Install and load the DolphinDB WebSocket plugin. This step can be skipped if the plugin is already loaded.

installPlugin("WebSocket")
loadPlugin("WebSocket")

Step 2: Create a shared stream table wssDepthST and persist it to disk.

colNames = `type`eventTime`transactionTime`code`firstId`lastId`finalId`bidPrice`bidQty`askPrice`askQty`currentTime
colTypes = [SYMBOL, TIMESTAMP, TIMESTAMP, SYMBOL, LONG, LONG, LONG, DOUBLE[], DOUBLE[], DOUBLE[], DOUBLE[], TIMESTAMP]
enableTableShareAndPersistence(table=streamTable(10000:0, colNames, colTypes), tableName=`wssDepthST, cacheSize=12000)

Step 3: Define functions for handling market data.

The market depth data accessed through WebSocket API is formatted as follows:

{"e": "depthUpdate",  "E": 1571889248277, "T": 1571889248276, "s": "BTCUSDT", "U": 390497796, "u": 390497878, "pu": 390497794,
 "b": [["7403.89", "0.002"], ["7403.90", "3.906"], ["7404.00", "1.428"] ,["7404.85", "5.239"], ["7405.43","2.562"]],
 "a": [["7405.96","3.340" ], ["7406.63", "4.525"], ["7407.08", "2.475"], ["7407.15", "4.800"], ["7407.20","0.175"]]}

The following script defines functions for processing the incoming data in DolphinDB.

// Define parseDepth for parsing market depth data
def parseDepth(rawData){
    rawDataDict = parseExpr(rawData).eval().data
    if (rawDataDict["e"]=="depthUpdate"){
        // Process bid data
        bTransposed = rawDataDict.b.double().matrix(b).transpose()
        bidPrice = bTransposed[0].enlist()
        bidQty = bTransposed[1].enlist()
        // Process ask data
        aTransposed = rawDataDict.a.double().matrix(a).transpose()
        askPrice = aTransposed[0].enlist()
        askQty = aTransposed[1].enlist()
        // Extract other relevant fields
        type = rawDataDict["e"]
        eventTime = timestamp(rawDataDict["E"])
        transactionTime = timestamp(rawDataDict["T"])
        code = rawDataDict["s"]
        firstId = rawDataDict["U"]
        lastId = rawDataDict["u"]
        finalId = rawDataDict["pu"]
        currentTime = gmtime(now())
        // Construct a table
        return table(typ, eventTime, transactionTime, code, firstId, 
          lastId, finalId, bidPrice, bidQty, askPrice, askQty, currentTime)
    }
}
// WebSocket event handlers
def onOpen(ws){
    writeLog("WebSocket opened to receive data")
}

def onMessage(mutable streamTable, ws, dataTable){
    for (data in dataTable[`msg]){
        res = parseDepth(data)
        streamTable.append!(res)
    }
}

def onError(ws, error){
    writeLog("WebSocket failed to receive data: " + error.string())
}

def onClose(ws, statusCode, msg){
    writeLog("Connection is closed, status code: " + statusCode.string() + ", " + 
      msg.string())
}

Step 4: Create a WebSocket subscription task.

// Connect to Binance WebSocket API
url = "wss://fstream.binance.com/stream?streams=btcusdt@depth20@100ms"
config = dict(STRING, ANY)
// Create a WebSocket subscription task
ws = WebSocket::createSubJob(url, onOpen, onMessage{streamTable=wssDepthST}, onError, onClose, "wssDepth", config)

Step 5: Manage the subscription.

After establishing the subscription, use getSubJobStat to view the subscription status:

// Check subscription status
WebSocket::getSubJobStat()

To cancel the subscription, use cancelSubJob:

// Cancel subscription
WebSocket::cancelSubJob("wssDepth")

For the complete script for Binance WebSocket data access, refer to Appendix binanceDepthWS.

3.2 Accessing Data From REST API

We create a task to send periodic HTTP requests through the HttpClient Plugin for accessing the latest market depth data from Binance REST API.

Step 1: Install and load the DolphinDB HttpClient plugin. This step can be skipped if the plugin is already loaded.

installPlugin("httpClient")
loadPlugin("httpClient")

Step 2: Create a shared stream table restDepthST and persist it to disk.

colNames = `eventTime`transactionTime`code`lastUpdateId`bidPrice`bidQty`askPrice`askQty`currentTime
colTypes = [TIMESTAMP, TIMESTAMP, SYMBOL, LONG, DOUBLE[], DOUBLE[], DOUBLE[], DOUBLE[], TIMESTAMP]
enableTableShareAndPersistence(table=streamTable(100000:0, colNames, colTypes), tableName=`restDepthST, cacheSize=12000)

Step 3: Define functions for handling market data.

The market depth data accessed through REST API is formatted as follows:

{"lastUpdateId": 1027024,"E": 1589436922972,"T": 1589436922959,
 "bids": [["4.00000000", "431.00000000"]],
 "asks": [["4.00000200", "12.00000000"]]}

Define a getDepth function that sends HTTP requests at regular intervals (every 50 ms) to continuously fetch the latest market data. The retrieved data is parsed and ingested into a stream table.

// Define getDepth to fetch market data
def getDepth(mutable restDepthST, code, baseUrl){
    do{
        try{
        param = dict(string,string)
        param['symbol'] = code;
        res = httpClient::httpGet(baseUrl,param,10000)

        rawDataDict = parseExpr(res.text).eval()
        b = double(rawDataDict.bids)
        bTransposed = matrix(b).transpose()
        bidPrice = bTransposed[0]
        bidQty = bTransposed[1]
        a = double(rawDataDict.asks)
        aTransposed = matrix(a).transpose()
        askPrice = aTransposed[0]
        askQty = aTransposed[1]

        eventTime = timestamp(rawDataDict["E"])
        transactionTime = timestamp(rawDataDict["T"])
        lastUpdateId = long(rawDataDict["lastUpdateId"])
        currentTime = gmtime(now())

        resTable = table(eventTime as eventTime, transactionTime as transactionTime, 
            code as code, lastUpdateId as lastUpdateId, [bidPrice] as bidPrice, 
            [bidQty] as bidQty, [askPrice] as askPrice, [askQty] as askQty, currentTime as currentTime)
        restDepthST.append!(resTable)
        sleep(50) 
        }
        catch(ex){
            print(ex)
            continue
        }
    }while(true)
}

Step 4: Set the base endpoint and submit the job with the name getDepth_BTC.

baseUrl = "https://fapi.binance.com/fapi/v1/depth"
submitJob("getDepth_BTC","getDepth_BTC", getDepth, restDepthST, "btcusdt", baseUrl)

Step 5: Manage submitted jobs.

// View recent jobs
getRecentJobs()
// View intermediate job information
getJobMessage("getDepth_BTC")
// Cancel the job
cancelJob('getDepth_BTC')

For the complete script for Binance REST data access, refer to Appendix binanceDepthREST.

4. Integrating Multi-Source Data

The stream tables wssDepthST and restDepthST receive market data from Binance WebSocket and REST APIs. Due to the differences in subscription methods and delays, rules for aligning and cleaning such multi-source market data are required for later data integration. These rules are designed to compare the timestamp of the incoming data with that in table to determine if the data provides the most recent information.

Specifically, we implement data integration by creating a shared dictionary to store the latest timestamps for each cryptocurrency pair and using a stream table latestDepthST to store the integrated market data. The table subscribes to wssDepthST and restDepthST with handling rules to update records with latest timestamps.

Step 1: Create a shared dictionary for maintaining the latest timestamps for each trading pair to track the most recent update time.

latestTs=syncDict(SYMBOL,TIMESTAMP)
latestTs['btcusdt'] = 2024.05.16 06:39:17.513

Step 2: Set rules for filtering and ingesting data from restDepthST and wssDepthST into table latestDepthST. By extracting data and event times, the rules check if the record has the latest timestamp. If true, the shared dictionary is updated and new records are appended to latestDepthST.

def toMerged(mutable latestTs, routeName, msg){ 
    symbolID = exec code from msg limit 1
    Timets = exec eventTime from msg limit 1
    lts = latestTs[symbolID]
    if(Timets > lts){
        latestTs[symbolID] = Timets
        appendTestTb = select code, eventTime, lastUpdateId, bidPrice, bidQty, askPrice, askQty, currentTime, routeName as dataroute from msg
        objByName("latestDepthST").append!(appendTestTb)
    }
}

Step 3: Subscribe to wssDepthST and restDepthST and specify the handler with filtering rules.

subscribeTable(tableName="restDepthST", actionName="restTolive", offset=-1, handler=toMerged{latestTs, 'restapi'}, msgAsTable=true, batchSize=1, throttle=0.001)
subscribeTable(tableName="wssDepthST", actionName="wssTolive", offset=-1, handler=toMerged{latestTs, 'wssapi'}, msgAsTable=true, batchSize=1, throttle=0.001)

This method allows subscriptions to multiple sources to be integrated into one table. Rules should be specified based on the format of the obtained market data.

5. Performance Test

We test the data coverage of using WebSocket and REST APIs separately and together by evaluating the data volume received within 24 hours. Larger data volume indicates higher refresh rates on average.

In our test environment, we use two methods to obtain BTCUSDT spot depth data: WebSocket with a 100-ms frequency (maximum available speed) and REST polling with a 50-ms interval. We then filter and insert the latest market data into a table. While the 50-ms update speed is ideal for balancing update frequency and market data acquisition across multiple currency pairs given bandwidth limitations, it’s important to note that the REST polling method actually yields higher data volumes due to its faster 50 ms interval.

After 24 hours of data acquisition:

REST polling: 1,383,480
WebSocket subscription: 644,808
Duplicate records: 10,020

The latest market table latestDepthST receives 1,753,067 new records, with the specific composition as follows:

From the results, we can see that the number of new records added to latestDepthST is greater than the number of records obtained separately through WebSocket and REST. Moreover, a higher proportion of REST data in the table indicates that this multi-source integration approach can increase the frequency and timeliness of data acquisition compared to a single access method. Additionally, since the data obtained from WebSocket and REST APIs are pushed from different servers, this method helps avoid problems caused by a single server failure. This further demonstrates that multi-source market data access can obtain market streams more quickly, reducing the risks associated with network fluctuations and exchange service failures.

6. Conclusion

This article introduces a DolphinDB-based solution in multi-source market stream access and integration of cryptocurrency market. By integrating multiple data sources and implementing efficient data cleansing and storage, DolphinDB not only improves the refresh frequency and timeliness of data acquisition but also ensures data continuity and reliability. The test results show that this approach significantly increases the amount of market data obtained and improves the data refresh interval, providing traders with more reliable and timely market data support.

In summary, DolphinDB’s multi-source market data integration solution significantly enhances the cryptocurrency market data acquisition process, providing traders with a superior framework for strategic trading decisions and activities.

7. Appendix

Unified Stream and Batch Processing of WorldQuant 101 Alphas in DolphinDB

DolphinDB — Fri, 17 May 2024 02:32:47 +0000

In 2015, the formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. To conveniently calculate these 101 alphas in DolphinDB, you can use the functions encapsulated in the module wq101alpha (wq101alpha.dos).

This module has the following advantages over Python:

Better Performance: DolphinDB outperforms Python by a median of 15.5x. For 27.5% of the alphas, DolphinDB is more than 100 times faster than Python.
Unified Stream and Batch Processing: Functions defined in this module can be used for both stream and batch processing.
Simplicity: Most of the 101 alphas can be written with DolphinDB built-in functions. No need to develop new functions.

All scripts in this tutorial are compatible with DolphinDB V2.00.8, 1.30.20 or higher.

Naming Conventions All function names in module wq101alpha start with “WQAlpha” followed by a number from 1 to 101. For examples, WQAlpha1，WQAlpha2, etc. The following is a list of parameters for alphas defined in the wq101alpha module. Each alpha may use a different set of parameters (see appendix).

Two types of alphas are defined in the wq101alpha module: alphas with and without industry information.

Examples This chapter expounds how to calculate alphas with specific examples.

2.1 Environment Setup
Add the module file wq101alpha.dos to [home]/modules.

The [home] directory is specified by the configuration parameter home. (To check the value of home, use getHomeDir())

2.2 Data Preparation
You can simulate daily data with DailyDataSimulation and industry information with IndustryInfo.

Alternatively, if you already have tables of daily data and industry information, you need to perform an equal join to combine these two tables, and make sure that the column names are consistent with the parameters defined in the module.

If you need to change the column names, you can use function prepareData in module prepare101 (add prepare101.dos to [home]/modules). It converts the column names to the defined parameters.

rawData is a table containing non-industry information.
infoData is a table containing industry information.
startTime and endTime determine the start time and end time of data.
Other parameters are column names to be converted.
Import the wq101alpha module and load the data you prepared:

use wq101alpha
use prepare101
login('admin', '123456')
rawData = loadTable("dfs://k_day_level", "k_day")
infoData = select * from loadTable("dfs://info", "info_data")
startTime = timestamp(2010.01.01)
endTime = timestamp(2010.01.31)
data = prepareData(rawData=rawData, startTime=startTime, endTime=endTime, securityidName="securityid", tradetimeName="tradetime", openName="open", closeName="close", highName="high", lowName="low", volumeName="vol", vwapName="vwap", infoSecurityidName="securityid", capName="cap", indclassName="indclass", infoData=infoData)

2.3 Calculating Alphas Without Industry Information
In the wq101alpha module, the calculation of alphas without industry information is generally conducted on two dimensions: time-series and cross-section. For these factors, you need to prepare a matrix as the input, and then call function WQAlpha#. Check Appendix for specific parameters.

For example, you can calculate alpha 1 and alpha 2 as follows:

use wq101alpha
input1 = exec close from data where tradetime between startTime : endTime pivot by tradetime, securityid
res1 = WQAlpha1(input1)

input2 = dict(`vol`close`open, panel(data.tradetime, data.securityid, [data.vol, data.close, data.open]))
res2 = WQAlpha2(input2.vol, input2.close, input2.open)

We provide function prepare# and calAlpha# in the prepare101 module to save your time spent on specifying parameters.

Function prepare# prepares the parameters required for each alpha and function calAlpha# encapsulates function prepare# and wqAlpha#.

Take alpha 1 as an example:

def prepare1(data, startTime, endTime){
    p = exec close from data where tradetime between startTime : endTime pivot by tradetime, securityid
    return p
}

def calAlpha1(data, startTime, endTime){
    input = prepare1(data, startTime, endTime)
    return WQAlpha1(input)
}

//call the module
use prepare101

res = calAlpha1(data, startTime, endTime)

In addition to matrices, parameters of function WQAlpha41, WQAlpha54, and WQAlpha101 can also be vectors.

For example, you can calculate alpha 101 using a SQL statement as follows:

use wq101alpha

res = select tradetime, securityid, `alpha101 as factorname, WQAlpha101(close, open, high, low) as val from data where tradetime between startTime : endTime

2.4 Calculating Alphas with Industry Information
To calculate alphas with industry information, you need to specify a table as the input.

Take alpha 48 as an example:

`
use wq101alpha

res = WQAlpha48(data)
`

You can also use function calAlpha# in prepare101 module.

`
def calAlpha48(data, startTime, endTime){
input = select * from data where tradetime between startTime : endTime
return WQAlpha48(input)
}

//call the module
use prepare101

res = calAlpha48(data, startTime, endTime)
`

The alpha calculation in the paper 101 Formulatic Alphas adopts several industry classifications, such as IndClass, subindustry, IndClass.industry, IndClass.sector. For the sake of convenience, only IndClass is used in this module.

Functions in the wq101alpha module return a matrix or a table. You can save your results to database if needed. Please refer to wq101alphaStorage.

Performance Comparison Our testings show that the wq101alpha module of DolphinDB outperforms Python pandas and Numpy.

Hardware
CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

OS: 64-bit CentOS Linux 7 (Core)

Data
We use the simulated daily data in a year to conduct performance testing (see TestData).

3.1 DolphinDB vs. Python Pandas
We compare the performance of alpha calculation implemented by DolphinDB module wq101alpha and Python pandas.

The following is the main script for performance testing of the wq101alpha module (see full script in wq101alphaDDBTime):

times = array(INT, 0) defs() for (i in 1:102){ if (i in passList) times.append!(NULL) else{ print(i) alphaName = exec name from defs() where name = "wq101alpha::WQAlpha"+string(i) alphaSyntax = exec syntax from defs() where name = "wq101alpha::WQAlpha"+string(i) function = alphaName + alphaSyntax t1 = time(now()) res = parseExpr(function[0]).eval() t2 = time(now()) times.append!(t2 - t1) } }
The following is the main script for performance testing of Python pandas (see full script in wq101alphaPyTime):

`
times = []

nofunc = [48, 56, 58, 59, 63, 67, 69, 70, 76, 79, 80, 82, 87, 89, 90, 91, 93, 97, 100]

for i in range(1, 102):
if i in nofunc:
times.append('no function')
continue
else:
factor = getattr(Alphas, "alpha{:03d}".format(i))
try:
t1 = time.time()
res = factor(stock)
t2 = time.time()
times.append(t2 - t1)
except Exception:
times.append('error')

`
The execution time of all 101 alphas can be found in PerformanceComparison.

Alphas that have not yet been implemented in Python pandas or whose results are questionable are excluded.

69 alphas are available for comparison (in millisecond):

The result shows that wq101alpha in DolphinDB outperforms python pandas. DolphinDB is faster than Python by a median of 15.5x. For 27.5% of the alphas, DolphinDB is more than 100 times faster than python.

3.2 DolphinDB vs. NumPy
Considering NumPy may have better performance than pandas, we choose 11 alphas that are time-consuming in pandas and implement them with NumPy. See partialAlphaNumpyTime for test results of NumPy.

Performance comparison of DolphinDB and NumPy:

We can see that while NumPy is faster than pandas, DolphinDB outperforms both.

DolphinDB has optimized the implementation of its window functions. In comparison, NumPy is not optimized for window calculations implemented by numpy.lib.stride_tricks.sliding_window_view.

Stream Processing It is complex to implement most alphas in real time, which requires more than one stream engine. DolphinDB provides the streamEngineParser function to automatically form a pipeline of stream engines to carry out the specified metrics calculation. In streamEngineParser, you can directly call functions in module wq101alpha .

See full script in wq101alphaStreamTest for the implementation of real-time alpha calculations.

Take alpha 1 for example:

Define the schemata of input and output tables.

inputSchema = table(1:0, ["SecurityID","TradeTime","close"], [SYMBOL,TIMESTAMP,DOUBLE]) resultStream = table(10000:0, ["SecurityID","TradeTime", "factor"], [SYMBOL,TIMESTAMP, DOUBLE])

Call the wq101alpha module and use WQAlpha1 as the metrics for the streamEngineParser function.

use wq101alpha metrics = <[WQAlpha1(close)]> streamEngine = streamEngineParser(name="WQAlpha1Parser", metrics=metrics, dummyTable=inputSchemaT, outputTable=resultStream, keyColumn="SecurityID", timeColumn=tradetime, triggeringPattern='perBatch', triggeringInterval=4000)
`
Check the status of the stream engines with function getStreamEngineStat().

`
getStreamEngineStat()

output

ReactiveStreamEngine->
name user status lastErrMsg numGroups ...
------------- ----------- ------ ---------- --------- ...
WQAlpha1Parser0 admin OK 0 0

WQAlpha1Parser2 admin OK 0 0

CrossSectionalEngine->
name user status lastErrMsg numRows ...
--------------- ----- ------ ---------- ------- ...
WQAlpha1Parser1 admin OK 0 2

`

Append data to stream engines and check the output table resultStream.

streamEngine.append!(data) //check the result res = exec factor from resultStream pivot by TradeTime, SecurityID

Conclusion This tutorial introduces how to calculate 101 alphas with DolphinDB built-in functions in the wq101alpha module. This module features efficiency, speed, and simplicity, and achieves unified batch and stream processing.

Appendix: Required Parameters for Each Alpha
Alphas without industry information

Alphas with industry information

Helpful Tools for Quant丨Efficiently Calculate Transaction Costs from Tick Data

DolphinDB — Fri, 17 May 2024 02:24:45 +0000

Therefore, in order to calculate transaction costs, we need to locate the most recent quote before each trade (of the same stock). We may also need to calculate the average quotes within a specific window relative to each trade. These non-exact joins are frequently used in quant finance, but they are not supported in most databases.

This time, DolphinDB provides you with asof join and window join for these scenarios.

Take a look at the following case!

The data used in this example is the high-frequency data from the New York Stock Exchange, consisting of two tables: trade and nbbo, respectively containing 27 million and 78 million records. The DolphinDB script is as follows:

trade = loadTable("dfs://EQY", "trade")
select count(*) from trade

nbbo = loadTable("dfs://EQY", "nbbo")
select count(*) from nbbo

// asof join
timer TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_aj from aj(trade,nbbo,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

// window join
timer TC2 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_wj from pwj(trade,nbbo,-100000000:0,<[avg(Offer_Price) as Offer_Price, avg(Bid_Price) as Bid_Price]>,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

select * from ej(TC1,TC2,`symbol) where symbol in `AAPL`MS`EBAY

Just one line of script can implement complex calculation logic.

// asof join
timer TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as TC_aj from aj(trade,nbbo,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by symbol

It takes 339 milliseconds to complete the calculation of transaction costs with asof join, which is more than 100 times faster than the equivalent calculation in Python pandas, and the script to calculate with window join takes 402 milliseconds.

Time Zones in DolphinDB

DolphinDB — Fri, 17 May 2024 02:19:46 +0000

This tutorial introduces how to address time zone-related issues when storing and importing temporal values to DolphinDB.

Storing Temporal Values
Importing Temporal Values through DolphinDB Java API
Importing Temporal Values from MySQL
Importing Temporal Values from a CSV File
Conclusion

DolphinDB has the following temporal data types:
https://miro.medium.com/v2/resize:fit:640/format:webp/1*1YsJiGssgMWE3PHPMWHXjA.png

Storage of Temporal Values Some databases store temporal values as Unix timestamps with a time zone offset. The time zone offset is the difference between local time and UTC (Coordinated Universal Time).

In DolphinDB, temporal values are stored as local timestamps without time zone information. A local timestamp is equivalent to Unix timestamp + UTC offset (in seconds).

For scenarios that do not require storing time zone information, you can save the temporal values directly to DolphinDB. The data will be retrieved without any time zone conversion.

In scenarios that require storing temporal values with time zone information, use gmtime to convert temporal values from local time to UTC time for storage. Use localtime (or convertTZ) to convert the data back to local time (or a specified time zone) for retrieval.

Two scenarios are presented below. By comparing the implementation of two storage solutions (with or without time zone information), we demonstrate the best practices for each scenario.

2.1 Scenario One
Context

The stock records from a New York-based exchange are stored in a database. A column contains the trade time, e.g., 2022.02.20 09:35:03.
The database is deployed in Beijing (UTC+8).
Some of the client applications that consume the data are deployed in Beijing and the others are deployed in the San Francisco (UTC-8).
Business Requirements

All users, whether in Beijing or San Francisco, must see the trade data in the exchange’s local time (UTC-5). In this section, we will use the trade time of a single record, 2022.02.20 09:35:03, as an example.
2.1.1 Solution One: Storing Time with Time Zone Information
For a database that stores time with time zone information, it processes the temporal values as follows:

Convert the local trade time (UTC-5) 2022.02.20 09:35:03 to a local timestamp 1645349703.
Get the Unix timestamp (1645367703) by subtracting the time zone offset. Save it to the database.
If a user queries the database from a business application in Beijing (UTC+8) or Western US (UTC-8), the time is retrieved based on the time zone of the server. Therefore, additional coding would be required to display the local time of the New York-based exchange.
In this scenario, we can see that storing temporal values with time zone information would introduce extra time conversions.

2.1.2 Solution Two: Storing Local Time (Without Time Zone Information)
DolphinDB stores time without time zone information. This is how it handles this scenario:

Convert the local trade time 2022.02.20 09:35:03 (UTC-5) to a local timestamp, 1645349703, and save it to the database.
When user queries this record from a business application, convert the saved timestamp back into the DATETIME format 2022.02.20 09:35:03.
In this scenario, storing the time without time zone information is obviously better. By directly storing the local time, there’s no need for Unix timestamp conversions and offset-related calculations in data retrieval.

2.1 Scenario Two
In some scenarios, time zone information must be taken into consideration. DolphinDB also provides built-in functions for time zone conversion.

Context

An online deal was made between a Chinese buyer and a Japanese vendor at 2022.05.20 11:30:05 Beijing Time (UTC +8), which is 2022.05.20 12:30:05 Tokyo Time (UTC+9).
The e-commerce company has two data centers in China and Japan, which store transaction data from China and Japan, respectively.
Business Requirements

When users browse their orders, the transaction time must be displayed in their current time zone, i.e., the Chinese buyer sees the transaction time in Beijing Time and the Japanese vendor sees it in Tokyo time.

2.2.1 Solution One: Storing Time with Time Zone Information
Use the function gmtime to convert the Beijing time 2022.05.20 11:30:05 into UTC time, and get the corresponding Unix timestamp with the built-in function long. For the database server deployed in China, the script should be:

$ gmtime(2022.05.20 11:30:05)
2022.05.20 03:30:05
$ long(gmtime(2022.05.20 11:30:05))
1653017405

As the local transaction time is first converted to UTC time, the Unix timestamp, 1653017405, is the same for the Chinese buyer or the Japanese vendor.

When users browse the order, the client application retrieves the Unix timestamp from the database and converts it to the local system time. Converted from the same Unix timestamp 1653017405, the transaction time displayed to the Chinese buyer is 2022.05.20 11:30:05 (UTC+8) and 2022.05.20 12:30:05 (UTC+9) to the Japanese vendor.

In DolphinDB, you can convert UTC time to your local time zone using the built-in function localtime. For example:

$ localtime(2022.05.20 03:30:05)
2022.05.20 11:30:05

2.2.2 Solution Two: Storing Local Time (Without Time Zone Information)
To store the transaction time in the local time zone, the following extra steps are required:

Decide which time zone should be the “local“ time zone.
Calculate the time zone offset between the “local“ time zone and the other time zone during the time conversion.
In this scenario, it is clear that we should adopt solution one, storing the Unix timestamp of the transaction time and let the server automatically convert the timestamp to the local time when data is retrieved. If we store the time without any conversion, then we’ll have to convert the time between time zones later in data retrieval, and consider other potential issues (the daylight saving time, for example) that may arise in such manual conversion.

Importing Time Using DolphinDB Java API This section uses the DolphinDB Java API 1.30.17.1.

There are four common classes that represent date and time in Java, localDate, LocalTime, LocalDateTime and Date. We recommend using LocalDate, LocalTime and LocalDateTime introduced in Java 8 with the DolphinDB Java API to process temporal values. The Date class should be used with caution (see Section 3.2 “Date Class: Use with Caution”).

3.1 Best Practices
Using LocalDateTime

LocalDateTime now = LocalDateTime.now();//obtain the current date-time
LocalDateTime time = LocalDateTime.of(2022,2,20,9,35,3);// the of method obtains an instance of LocalDateTime from year, month, day, hour, minute, second and nanosecond

// time-related operations
time.isAfter(now);
time.isBefore(now);
System.out.println(time.plusDays(1L));
System.out.println(time.minusHours(8));

// get the current time from the system clock in the specified time-zone
LocalDateTime.now(ZoneId.of("Australia/Sydney"));

LocalDateTime represents a local date-time object without timezone information. It is recommended to use this class to represent temporal values if you want to import them into a DolphinDB database.

Scenario: Import NYSE stock data from a CSV file (see appendix). The date and time values must be consistent with those in the CSV file.
https://miro.medium.com/v2/resize:fit:720/format:webp/1*4fma1hlaFSI3jieCstbEPw.png

Create a shared in-memory table in DolphinDB
Create a table and specify its schema based on the data in the CSV file. Below is an example of creating a shared in-memory table:

conn.run("t = table(100:0,`symbol`datetime`bid`ofr`bidsize`ofrsize`mode`ex`mmid,[SYMBOL,DATETIME,DOUBLE,DOUBLE,LONG,LONG,INT,CHAR,SYMBOL])\n" +
"share t as timeTest;");

Parse the CSV file

In this example we use OpenCSV, the open source third-party library, to parse CSV files. You can also use your preferred tool.

CSVReader csvReader = new CSVReader(reader) 
String[] record;
csvReader.readNext();// skip first line
while ((record = csvReader.readNext()) != null) {
    // record[i] corresponds to a cell in a row. See appendix for complete script

Define the method for handling temporal values

All parsed columns are of the STRING type. Use the DateTimeFormatter class to format dates and times in the pattern of “yyyy.MM.dd H:mm:ss“ to convert the strings into LocalDateTime objects. Then convert LocalDateTime objects to local timestamps so they can be added to BasicDateTimeVector, which is constructed in the next step.

public static int getTime(String timeStr){
    DateTimeFormatter df = DateTimeFormatter.ofPattern("yyyy.MM.dd H:mm:ss");
    LocalDateTime ldt = LocalDateTime.parse(timeStr,df);
    return Utils.countSeconds(ldt);
}

Batch insert data into DolphinDB

Construct a BasicVector for each column for batch inserting all columns. See full script.

public static void write() throws IOException {
    // example of list construction
    LinkedList<String> symbolList = new LinkedList<>(); 
    try (Reader reader = Files.newBufferedReader(Paths.get(CSV)));
        // read csv file and insert data into lists
    } catch (IOException | CsvValidationException ex) {
        ex.printStackTrace();
    }

    List<Entity> data = Arrays.asList(
        new BasicSymbolVector(symbolList),
        new BasicDateTimeVector(dtList),
        new BasicDoubleVector(bidList),
        new BasicDoubleVector(ofrList),
        new BasicLongVector(bidSizeList),
        new BasicLongVector(ofrSizeList),
        new BasicIntVector(modeList),
        new BasicByteVector(exList),
        new BasicSymbolVector(mmidList)
    );
    conn.run("tableInsert{\"timeTest\"}", data);

Validate the result

Get the entire table using Java API:

BasicTable res = (BasicTable) conn.run("select * from timeTest");
System.out.println(res.getString());

Get the same table using the DolphinDB GUI:

t = select * from timeTest
print(t)

The results of both queries are as follows:
https://miro.medium.com/v2/resize:fit:720/format:webp/1*tOu21W7QyM3KJIlg9HlfYg.png

The “datetime“ column in both of the above results are consistent with the dates and times in the original CSV file.

3.2 Date Class: Use with Caution
Users who are not familiar with how the Date class is stored and printed in Java may find the temporal values “inconsistent“ before and after the data import to DolphinDB.

Instantiate the Date class

Date date = new Date();
System.out.println(date);

Note that while the date object is printed in system time format, it is stored as a Unix timestamp (of the LONG type).

Import data to DolphinDB

(1) Execute the script below to create a shared in-memory table with two columns in DolphinDB:

conn.run("t = table(100:0,`ts`value,[TIMESTAMP,INT]);share t as timeJava;");

(2) Insert a record to the table through Java API:

public static void write() throws IOException {
    DBConnection conn = new DBConnection();
    conn.connect("localhost", 8848, "admin", "123456");
    LinkedList<Long> dtList = new LinkedList<>();// ts
    LinkedList<Integer> valueList = new LinkedList<>();// value
    Date date = new Date();
    System.out.println(date);
    dtList.add(date.getTime());
    valueList.add(1);
    List<Entity> data = Arrays.asList(
    new BasicTimestampVector(dtList),
    new BasicIntVector(valueList)
    );
    conn.run("tableInsert{\"testJava\"}", data);
}

(3) Print the date object which is inserted to the DolphinDB table to your console with System.out.println(date). Meanwhile, check the inserted record in DolphinDB GUI. Compare the two temporal values:
https://miro.medium.com/v2/resize:fit:720/format:webp/0*aVrREisBTplzcuY8.png

The two values do not match. The reason is as follows:

The Date object is instantiated from the Unix timestamp returned by System.currentTimeMillis(). For more information about the constructor of Date, see official documentation.

/**
* Allocates a <code>Date</code> object and initializes it so that
* it represents the time at which it was allocated, measured to the
* nearest millisecond.
*
* @see     java.lang.System#currentTimeMillis()
*/
public Date() {
 this(System.currentTimeMillis());
}

In Java, the string representation (printed output) of the Date class is customized by overriding its toString method which involves time zone conversion (see source code below). Therefore, although Date is stored as a Unix timestamp, it is printed based on the time zone of the server.

public String toString() {
        // "EEE MMM dd HH:mm:ss zzz yyyy";
        BaseCalendar.Date date = normalize();
        StringBuilder sb = new StringBuilder(28);
        int index = date.getDayOfWeek();
        if (index == BaseCalendar.SUNDAY) {
            index = 8;
        }
        convertToAbbr(sb, wtb[index]).append(' ');                        // EEE
        convertToAbbr(sb, wtb[date.getMonth() - 1 + 2 + 7]).append(' ');  // MMM
        CalendarUtils.sprintf0d(sb, date.getDayOfMonth(), 2).append(' '); // dd
        CalendarUtils.sprintf0d(sb, date.getHours(), 2).append(':');   // HH
        CalendarUtils.sprintf0d(sb, date.getMinutes(), 2).append(':'); // mm
        CalendarUtils.sprintf0d(sb, date.getSeconds(), 2).append(' '); // ss
        TimeZone zi = date.getZone();
        if (zi != null) {
            sb.append(zi.getDisplayName(date.isDaylightTime(), TimeZone.SHORT, Locale.US)); // zzz
        } else {
            sb.append("GMT");
        }
        sb.append(' ').append(date.getYear());  // yyyy
        return sb.toString();
}

DolphinDB, on the other hand, stores the Unix timestamp of the imported time in the LONG type and displays it in the UTC time zone.

Therefore, although temporal values appear “inconsistent“ before and after the import to DolphinDB, nothing is in fact modified during the process.

Importing Time from MySQL 4.1 MySQL Data Types and Time Zones This section discusses these temporal data types in MySQL 5.7: DATETIME, TIMESTAMP, DATE and TIME.

MySQL converts TIMESTAMP values from the current time zone to Unix timestamps for storage, and back from Unix timestamps to the current time zone for retrieval.
Such conversion does not occur for DATE, TIME or DATETIME.
Use the command below to check the values of the time zone variables in MySQL. The output shows that the system time zone is “CST“ (China Standard Time, i.e., UTC+8) and the server time zone is the same as the system time zone.

mysql> show variables like '%time_zone%';
+------------------+--------+
| Variable_name    | Value  |
+------------------+--------+
| system_time_zone | CST    |
| time_zone        | SYSTEM |
+------------------+--------+

Execute the following commands to create a database and a table in MySQL:

mysql> USE testdb;
mysql> CREATE TABLE testTable(
    -> date DATE NOT NULL,
    -> time TIME NOT NULL,
    -> ts TIMESTAMP NOT NULL
    -> );
Query OK, 0 rows affected (0.11 sec)

Insert a record. Check the data in the table:

mysql> select * from testTable;
+------------+----------+---------------------+
| date       | time     | ts                  |
+------------+----------+---------------------+
| 2022-06-14 | 23:13:15 | 2022-06-14 23:13:15 |
+------------+----------+---------------------+
1 row in set (0.00 sec)

Specify the server time zone to UTC +9 and check the table again:


mysql> set time_zone='+09:00';
Query OK, 0 rows affected (0.01 sec)

mysql> show variables like '%time_zone%';
+------------------+--------+
| Variable_name    | Value  |
+------------------+--------+
| system_time_zone | CST    |
| time_zone        | +09:00 |
+------------------+--------+
2 rows in set (0.00 sec)

mysql> select * from testTable;
+------------+----------+---------------------+
| date       | time     | ts                  |
+------------+----------+---------------------+
| 2022-06-14 | 23:13:15 | 2022-06-15 00:13:15 |
+------------+----------+---------------------+
1 row in set (0.00 sec)

Compare the query results before and after the time zone change: the “date“ and “time“ columns contain fixed DATE and TIME values and therefore dose not change; the TIMESTAMP values in the “ts“ column go forward 1 hour as the time zone changed from UTC+8 to UTC+9.

4.2 Importing MySQL Temporal Values into DolphinDB
The DolphinDB MySQL plugin allows you to smoothly import data from MySQL into DolphinDB with automatic data type conversion. The supported temporal data types are as follows:

DolphinDB also has the following additional temporal data types: MONTH, MINUTE, SECOND, NANOTIME, NANOTIMESTAMP

First, import a CSV file containing NYSE stock data into MySQL. The global time_zone value is set to UTC.

The CSV file and the script for importing it into MySQL can be downloaded in the Appendix. To better compare the DATE, TIME and TIMESTAMP values in the result, a “ts“ column (of the TIMESTAMP type) containing the Unix timestamps (which are 5 hours ahead of the exchange’s local time) for the trade is added to the table.

Check the imported data in MySQL:

mysql> select * from taqTs limit 5;
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+
| symbol | date       | time     | bid   | ofr   | bidsiz | ofrsiz | mode | ex  | mmid | ts                  |
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+
| A      | 2022-02-20 | 06:24:34 |  1.00 |  0.00 |      1 |      0 |   12 | 'P' | null | 2022-02-20 11:24:34 |
| A      | 2022-02-20 | 06:24:56 |  0.00 |  0.00 |      0 |      0 |   12 | 'P' | null | 2022-02-20 11:24:56 |
| A      | 2022-02-20 | 07:02:00 |  0.00 | 54.84 |      0 |      1 |   12 | 'T' | null | 2022-02-20 12:02:00 |
| A      | 2022-02-20 | 07:02:00 | 22.17 | 54.84 |      1 |      1 |   12 | 'T' | null | 2022-02-20 12:02:00 |
| A      | 2022-02-20 | 07:08:22 | 37.00 | 54.84 |      1 |      1 |   12 | 'T' | null | 2022-02-20 12:08:22 |
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+

Load the MySQL plugin in DolphinDB and import the table from MySQL into DolphinDB as an in-memory table.

// load plugin
loadPlugin("/DolphinDB/server/plugins/mysql/PluginMySQL.txt");

// connect to MySQL Server
conn = mysql::connect(`127.0.0.1, 3306, `root, "root", `testdb)

// import data
taqDdb = mysql::load(conn, "SELECT * FROM taqTs");

// view imported data
select * from taqDdb;

The temporal values in the result are consistent with the DATE, TIME and TIMESTAMP values in MySQL. Therefore, to import Unix timestamps of the TIMESTAMP type from MySQL to DolphinDB, make sure that the global time_zone system variable is set to UTC.

Note that in MySQL the time_zone variable can be set at global and session level. DolphinDB stores the imported data from MySQL according to the global time zone. Therefore, if the temporal values are inconsistent before and after the import, check whether your session time zone is different from the global time zone in MySQL.

For example, set time_zone = '+08:00' in a MySQL session and then query the table. The query result shows that in this particular session, the “ts“ column values are displayed in UTC +8 time, instead of in UTC time.

mysql> set time_zone='+08:00';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from taqTs limit 5;
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+
| symbol | date       | time     | bid   | ofr   | bidsiz | ofrsiz | mode | ex  | mmid | ts                  |
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+
| A      | 2022-02-20 | 06:24:34 |  1.00 |  0.00 |      1 |      0 |   12 | 'P' | null | 2022-02-20 19:24:34 |
| A      | 2022-02-20 | 06:24:56 |  0.00 |  0.00 |      0 |      0 |   12 | 'P' | null | 2022-02-20 19:24:56 |
| A      | 2022-02-20 | 07:02:00 |  0.00 | 54.84 |      0 |      1 |   12 | 'T' | null | 2022-02-20 20:02:00 |
| A      | 2022-02-20 | 07:02:00 | 22.17 | 54.84 |      1 |      1 |   12 | 'T' | null | 2022-02-20 20:02:00 |
| A      | 2022-02-20 | 07:08:22 | 37.00 | 54.84 |      1 |      1 |   12 | 'T' | null | 2022-02-20 20:08:22 |
+--------+------------+----------+-------+-------+--------+--------+------+-----+------+---------------------+
5 rows in set (0.00 sec)

Connect DolphinDB to MySQL again and import the above table. As the MySQL global time zone is still UTC and the session time zone is not specified, the “ts“ column in the imported table remain unchanged.
https://miro.medium.com/v2/resize:fit:720/format:webp/0*17zp6F3zhn0KVCPF.png

Now, we change the MySQL global time zone to UTC +8 through set global time_zone='+08:00'. Reconnect to MySQL to import the table and check the result in DolphinDB. As the TIMESTAMP values are stored as Unix timestamps and converted back to the current time zone for retrieval in MySQL, the “ts“ column values are displayed in UTC+8 time DolphinDB:
https://miro.medium.com/v2/resize:fit:720/format:webp/0*8zubZMz6s6PsjSvW.png

Importing Time from a CSV File With the functions loadText and loadTextEx, you can import CSV files to in-memory tables or DFS tables, respectively. This section uses loadText as an example to describe how to import time from a CSV file.

Note: It is recommended to check the data type of each column with the extractTextSchema function before loading. If a column does not have the expected data type, enter the correct data type of the column in the schema table. For a temporal column, we also need to specify a format such as “MM/dd/yyyy” in the schema table. For more information, see Parsing and Format of Temporal Variables and Importing Text Files Tutorial.

Load the CSV file into DolphinDB:

data = loadText("/data/taq.csv")

Specify the DATE type for the “date“ column values and TIME type for the “time“ column values. The temporal values will be stored without time zone information.

DolphinDB provides time zone conversion functions for retrieving the temporal values from database. Take the “data“ table as an example:

gmtime — converts local time (UTC+8 in this example) to GMT (UTC+0 zone):

gmData = select gmtime(concatDateTime(date,time)) as utc, * from data;

localtime — converts GMT (UTC+0 zone) to local time zone (UTC+8 in this example):

localData = select localtime(utc) as localtime,* from gmData;

convertTZ — time conversion between time zones

convertTZ(2022.04.25T08:25:45,"US/Eastern","Asia/Shanghai");

DolphinDB stores temporal values without time zone information. Instead, it lets you decide in which time zone you would like to store and retrieve temporal values by offering various time zone conversion functions.

Conclusion You may encounter the following issues with temporal data storage in DolphinDB:

After data migration, temporal values displayed in DolphinDB are different from how they are displayed in the source database.
DolphinDB stores the timestamps retrieved from the source database as integers without making any modifications. The examples of Java API and MySQL plugin describe the reason for and solution to the “inconsistency” in time display.

Storing temporal values without time zone information.
Save the temporal values (of DATE, TIME or TIMESTAMP type) directly to DolphinDB. The data will be retrieved without any time zone conversion.

Storing temporal values with time zone information.
Use gmtime to convert temporal values from local time to UTC time for storage. Use localtime (or convertTZ) to convert the data back to local time (or a specified time zone) for retrieval.