Forem: MangoQuant

Qlib - 模型训练和预测(Model Training & Prediction)

MangoQuant — Sat, 18 Oct 2025 07:37:32 +0000

我们quick start之前已经跑通，能修改yaml配置文件进行训练模型。但如果想更灵活的话，还需要手动写代码，今天来介绍一下。
参考了官网的一些文档，但直接用的话跑不通（我是macOS11，qlib 0.9.7, python 3.11），于是进行了改造。再加上一些指标分析的代码，我们一起来看一下吧。

一、初始化

# 合并后的完整脚本：训练模型 + 计算 IC 指标
# reference: https://qlib.readthedocs.io/en/latest/component/model.html

import qlib
import pandas as pd
from qlib.contrib.model.gbdt import LGBModel
from qlib.contrib.data.handler import Alpha158
from qlib.utils import init_instance_by_config, flatten_dict
from qlib.workflow import R
from qlib.workflow.record_temp import SignalRecord, PortAnaRecord
from qlib.contrib.eva.alpha import calc_ic

# 初始化 Qlib 数据路径
qlib.init(provider_uri="~/Documents/code/my_develop/qlib_data/cn_data_snapshot", region="cn")

market = "csi300"
benchmark = "SH000300"

# 数据处理器配置
data_handler_config = {
    "start_time": "2008-01-01",
    "end_time": "2020-08-01",
    "fit_start_time": "2008-01-01",
    "fit_end_time": "2014-12-31",
    "instruments": market,
}

我们首先需要导入一些python包，provider_uri路径要修改成自己的。

二、配置


# 任务配置：模型 + 数据集
task = {
    "model": {
        "class": "LGBModel",
        "module_path": "qlib.contrib.model.gbdt",
        "kwargs": {
            "loss": "mse",
            "colsample_bytree": 0.8879,
            "learning_rate": 0.0421,
            "subsample": 0.8789,
            "lambda_l1": 205.6999,
            "lambda_l2": 580.9768,
            "max_depth": 8,
            "num_leaves": 210,
            "num_threads": 20,
        },
    },
    "dataset": {
        "class": "DatasetH",
        "module_path": "qlib.data.dataset",
        "kwargs": {
            "handler": {
                "class": "Alpha158",
                "module_path": "qlib.contrib.data.handler",
                "kwargs": data_handler_config,
            },
            "segments": {
                "train": ("2008-01-01", "2014-12-31"),
                "valid": ("2015-01-01", "2016-12-31"),
                "test": ("2017-01-01", "2020-08-01"),
            },
        },
    },
}

配置内容可以参考我前面的文章 Qlib - 工作流workflow配置详解。
因为他们本质上是一样的，只不过一种是yaml格式，一种是python代码的字典格式。
这里以lightGBM为例，我前面也介绍过：《LightGBM: 一种高效的梯度提升决策树算法》论文（A Highly Efficient Gradient Boosting Decision Tree）
用的是Alpha158因子（我前面也介绍过），也就是构建了158个指标。再加上LightGBM模型进行训练和预测。

三、训练模型


def main():
    print("【Step 1】初始化模型和数据集...")
    model = init_instance_by_config(task["model"])
    dataset = init_instance_by_config(task["dataset"])

    print("【Step 2】启动实验并训练模型...")
    with R.start(experiment_name="workflow"):
        R.log_params(**flatten_dict(task))
        model.fit(dataset)

这里用的是main()函数，否则在macos上运行会报错。
我们先构建好模型model和数据集dataset，
然后再进行模型的训练拟合model.fit(dataset)

四、模型预测

        print("【Step 3】生成预测信号...")
        recorder = R.get_recorder()
        sr = SignalRecord(model, dataset, recorder)
        sr.generate()

        print("【Step 4】获取当前实验的 recorder_id，用于后续读取结果...")
        recorder_id = recorder.id
        print(f"当前实验的 recorder_id 为：{recorder_id}")

经过这一步，可以将模型在测试集上进行预测，并且保存到本地。
用于后续的效果分析。

五、效果分析IC计算

    # 使用 recorder_id 读取预测结果
    print("【Step 5】读取预测结果并计算 IC...")
    recorder = R.get_recorder(experiment_name="workflow", recorder_id=recorder_id)
    print("已保存的 artifacts：", recorder.list_artifacts())

    # 获取 artifact 路径
    artifact_path = recorder.artifact_uri.replace("file://", "")
    pred = pd.read_pickle(f"{artifact_path}/pred.pkl")
    label = pd.read_pickle(f"{artifact_path}/label.pkl")

    print("预测结果（前5行）：")
    print(pred.head())
    print("预测结果（后5行）：")
    print(pred.tail())
    print("预测结果时间范围：", pred.index.get_level_values('datetime').unique())

    print("标签结果（前5行）：")
    print(label.head())
    print("标签结果（后5行）：")
    print(label.tail())

    # 计算 IC
    ic = calc_ic(pred['score'], label['LABEL0'])

    print("【Step 6】IC 指标统计：")
    print("IC 均值：", ic[0].mean())
    print("IC 标准差：", ic[0].std())
    print("IC 绝对值均值：", ic[0].abs().mean())
    print("Rank IC 均值：", ic[1].mean())
    print("Rank IC 标准差：", ic[1].std())
    print("Rank IC 绝对值均值：", ic[1].abs().mean())

if __name__ == '__main__':
    main()

通过这段代码，我们可以读取本地预测数据和标签数据，并且进行效果分析。
并且，我这里先分别打印一下头尾部数据，看看数据格式、日期比对等。
然后计算IC和Rank IC，方便看模型效果。

六、运行结果

我们执行一下代码python qlib_model_demo.py，可以得到下面的结果（大概2分钟完成）：

(freq) test1@budas-MacBook-Pro user % python qlib_model_demo.py
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
[56122:MainThread](2025-10-18 11:31:21,785) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56122:MainThread](2025-10-18 11:31:21,795) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56122:MainThread](2025-10-18 11:31:21,796) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
【Step 1】初始化模型和数据集...
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
[56133:MainThread](2025-10-18 11:31:28,381) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56131:MainThread](2025-10-18 11:31:28,382) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56136:MainThread](2025-10-18 11:31:28,382) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56135:MainThread](2025-10-18 11:31:28,383) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56132:MainThread](2025-10-18 11:31:28,384) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56134:MainThread](2025-10-18 11:31:28,385) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56133:MainThread](2025-10-18 11:31:28,388) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56136:MainThread](2025-10-18 11:31:28,388) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56133:MainThread](2025-10-18 11:31:28,388) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56136:MainThread](2025-10-18 11:31:28,388) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56135:MainThread](2025-10-18 11:31:28,389) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56131:MainThread](2025-10-18 11:31:28,389) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56135:MainThread](2025-10-18 11:31:28,389) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56131:MainThread](2025-10-18 11:31:28,389) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56134:MainThread](2025-10-18 11:31:28,390) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56132:MainThread](2025-10-18 11:31:28,390) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56134:MainThread](2025-10-18 11:31:28,390) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56132:MainThread](2025-10-18 11:31:28,390) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError. XGBModel is skipped(optional: maybe installing xgboost can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
ModuleNotFoundError.  PyTorch models are skipped (optional: maybe installing pytorch can fix it).
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
[56222:MainThread](2025-10-18 11:32:31,653) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56223:MainThread](2025-10-18 11:32:31,653) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56226:MainThread](2025-10-18 11:32:31,653) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56225:MainThread](2025-10-18 11:32:31,653) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56224:MainThread](2025-10-18 11:32:31,654) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56227:MainThread](2025-10-18 11:32:31,657) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[56223:MainThread](2025-10-18 11:32:31,658) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56226:MainThread](2025-10-18 11:32:31,658) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56225:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56222:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56223:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56225:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56226:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56222:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56224:MainThread](2025-10-18 11:32:31,659) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56224:MainThread](2025-10-18 11:32:31,660) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56227:MainThread](2025-10-18 11:32:31,661) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[56227:MainThread](2025-10-18 11:32:31,662) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56122:MainThread](2025-10-18 11:32:34,882) INFO - qlib.timer - [log.py:127] - Time cost: 73.083s | Loading data Done
[56122:MainThread](2025-10-18 11:32:36,646) INFO - qlib.timer - [log.py:127] - Time cost: 0.462s | DropnaLabel Done
[56122:MainThread](2025-10-18 11:32:39,763) INFO - qlib.timer - [log.py:127] - Time cost: 3.116s | CSZScoreNorm Done
[56122:MainThread](2025-10-18 11:32:39,826) INFO - qlib.timer - [log.py:127] - Time cost: 4.943s | fit & process data Done
[56122:MainThread](2025-10-18 11:32:39,827) INFO - qlib.timer - [log.py:127] - Time cost: 78.029s | Init data Done
【Step 2】启动实验并训练模型...
[56122:MainThread](2025-10-18 11:32:39,849) INFO - qlib.workflow - [exp.py:258] - Experiment 376922499957687719 starts running ...
[56122:MainThread](2025-10-18 11:32:40,631) INFO - qlib.workflow - [recorder.py:345] - Recorder 09c2896631e24499baebacb64603256c starts running under Experiment 376922499957687719 ...
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
    --dirstat-by-file[=<param1,param2>...]
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
    --stat[=<width>[,<name-width>[,<count>]]]
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
    --ignore-space-at-eol
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
    --color-words[=<regex>]
                          equivalent to --word-diff=color --word-diff-regex=<regex>
    --color-moved[=<mode>]
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
    --relative[=<prefix>]
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
    --ignore-submodules[=<when>]
                          ignore changes to submodules in the diff generation
    --submodule[=<format>]
                          specify how differences in submodules are shown
    --ita-invisible-in-index
                          hide 'git add -N' entries from the index
    --ita-visible-in-index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --rotate-to <path>    show the change in the specified path first
    --skip-to <path>      skip the output to the specified path
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       Output to a specific file

[56122:MainThread](2025-10-18 11:32:40,664) INFO - qlib.workflow - [recorder.py:378] - Fail to log the uncommitted code of $CWD(/Users/test1/Documents/code/my_develop/qlib_data/user) when run git diff.
fatal: not a git repository (or any of the parent directories): .git
[56122:MainThread](2025-10-18 11:32:40,693) INFO - qlib.workflow - [recorder.py:378] - Fail to log the uncommitted code of $CWD(/Users/test1/Documents/code/my_develop/qlib_data/user) when run git status.
error: unknown option `cached'
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
    --dirstat-by-file[=<param1,param2>...]
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
    --stat[=<width>[,<name-width>[,<count>]]]
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
    --ignore-space-at-eol
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
    --color-words[=<regex>]
                          equivalent to --word-diff=color --word-diff-regex=<regex>
    --color-moved[=<mode>]
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
    --relative[=<prefix>]
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
    --ignore-submodules[=<when>]
                          ignore changes to submodules in the diff generation
    --submodule[=<format>]
                          specify how differences in submodules are shown
    --ita-invisible-in-index
                          hide 'git add -N' entries from the index
    --ita-visible-in-index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --rotate-to <path>    show the change in the specified path first
    --skip-to <path>      skip the output to the specified path
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       Output to a specific file

[56122:MainThread](2025-10-18 11:32:40,721) INFO - qlib.workflow - [recorder.py:378] - Fail to log the uncommitted code of $CWD(/Users/test1/Documents/code/my_develop/qlib_data/user) when run git diff --cached.
Training until validation scores don't improve for 50 rounds
[20]    train's l2: 0.990585    valid's l2: 0.99431
[40]    train's l2: 0.986931    valid's l2: 0.993693
[60]    train's l2: 0.984352    valid's l2: 0.99349
[80]    train's l2: 0.982319    valid's l2: 0.993382
[100]   train's l2: 0.980442    valid's l2: 0.99331
[120]   train's l2: 0.97871 valid's l2: 0.993247
[140]   train's l2: 0.976987    valid's l2: 0.993334
[160]   train's l2: 0.97536 valid's l2: 0.993338
Early stopping, best iteration is:
[122]   train's l2: 0.978519    valid's l2: 0.993238
【Step 3】生成预测信号...
[56122:MainThread](2025-10-18 11:33:25,686) INFO - qlib.workflow - [record_temp.py:198] - Signal record 'pred.pkl' has been saved as the artifact of the Experiment 376922499957687719
'The following are prediction results of the LGBModel model.'
                          score
datetime   instrument
2017-01-03 SH600000   -0.042865
           SH600008    0.005925
           SH600009    0.030596
           SH600010   -0.013973
           SH600015   -0.141758
【Step 4】获取当前实验的 recorder_id，用于后续读取结果...
当前实验的 recorder_id 为：09c2896631e24499baebacb64603256c
[56122:MainThread](2025-10-18 11:33:25,736) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waiting `async_log` Done
【Step 5】读取预测结果并计算 IC...
已保存的 artifacts： ['label.pkl', 'pred.pkl']
预测结果（前5行）：
                          score
datetime   instrument
2017-01-03 SH600000   -0.042865
           SH600008    0.005925
           SH600009    0.030596
           SH600010   -0.013973
           SH600015   -0.141758
预测结果（后5行）：
                          score
datetime   instrument
2020-07-31 SZ300413   -0.078162
           SZ300433   -0.101778
           SZ300498   -0.054418
           SZ300601   -0.147531
           SZ300628    0.030925
预测结果时间范围： DatetimeIndex(['2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
               '2017-01-13', '2017-01-16',
               ...
               '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23',
               '2020-07-24', '2020-07-27', '2020-07-28', '2020-07-29',
               '2020-07-30', '2020-07-31'],
              dtype='datetime64[ns]', name='datetime', length=871, freq=None)
标签结果（前5行）：
                         LABEL0
datetime   instrument
2017-01-03 SH600000   -0.001831
           SH600008   -0.002398
           SH600009    0.001493
           SH600010    0.003520
           SH600015   -0.007142
标签结果（后5行）：
                         LABEL0
datetime   instrument
2020-07-31 SZ300413   -0.037566
           SZ300433   -0.031677
           SZ300498   -0.006531
           SZ300601    0.090264
           SZ300628    0.004142
【Step 6】IC 指标统计：
IC 均值： 0.04993267859655785
IC 标准差： 0.12446777545374337
IC 绝对值均值： 0.10684702003092757
Rank IC 均值： 0.051507508261451146
Rank IC 标准差： 0.12273969195405407
Rank IC 绝对值均值： 0.10527763281820797
(freq) test1@budas-MacBook-Pro user %

七、结果分析

我们可以忽略一些不重要的警告信息。

[56227:MainThread](2025-10-18 11:32:31,662) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data_snapshot')}
[56122:MainThread](2025-10-18 11:32:34,882) INFO - qlib.timer - [log.py:127] - Time cost: 73.083s | Loading data Done
[56122:MainThread](2025-10-18 11:32:36,646) INFO - qlib.timer - [log.py:127] - Time cost: 0.462s | DropnaLabel Done
[56122:MainThread](2025-10-18 11:32:39,763) INFO - qlib.timer - [log.py:127] - Time cost: 3.116s | CSZScoreNorm Done
[56122:MainThread](2025-10-18 11:32:39,826) INFO - qlib.timer - [log.py:127] - Time cost: 4.943s | fit & process data Done
[56122:MainThread](2025-10-18 11:32:39,827) INFO - qlib.timer - [log.py:127] - Time cost: 78.029s | Init data Done

上面这部分日志，展示了一些加载数据、预处理数据的过程，以及消耗的时长。

【Step 2】启动实验并训练模型...
[56122:MainThread](2025-10-18 11:32:39,849) INFO - qlib.workflow - [exp.py:258] - Experiment 376922499957687719 starts running ...
[56122:MainThread](2025-10-18 11:32:40,631) INFO - qlib.workflow - [recorder.py:345] - Recorder 09c2896631e24499baebacb64603256c starts running under Experiment 376922499957687719 ...

...
[56122:MainThread](2025-10-18 11:32:40,721) INFO - qlib.workflow - [recorder.py:378] - Fail to log the uncommitted code of $CWD(/Users/test1/Documents/code/my_develop/qlib_data/user) when run git diff --cached.
Training until validation scores don't improve for 50 rounds
[20]    train's l2: 0.990585    valid's l2: 0.99431
[40]    train's l2: 0.986931    valid's l2: 0.993693
[60]    train's l2: 0.984352    valid's l2: 0.99349
[80]    train's l2: 0.982319    valid's l2: 0.993382
[100]   train's l2: 0.980442    valid's l2: 0.99331
[120]   train's l2: 0.97871 valid's l2: 0.993247
[140]   train's l2: 0.976987    valid's l2: 0.993334
[160]   train's l2: 0.97536 valid's l2: 0.993338
Early stopping, best iteration is:
[122]   train's l2: 0.978519    valid's l2: 0.993238

这部分是训练过程，early-stop 在 122 棵
训练集 L2=0.9785，验证集 L2=0.9932 训练误差 < 验证误差，轻微过拟合，但不算严重，模型的早停策略起作用了。

【Step 3】生成预测信号...
[56122:MainThread](2025-10-18 11:33:25,686) INFO - qlib.workflow - [record_temp.py:198] - Signal record 'pred.pkl' has been saved as the artifact of the Experiment 376922499957687719
'The following are prediction results of the LGBModel model.'
                          score
datetime   instrument
2017-01-03 SH600000   -0.042865
           SH600008    0.005925
           SH600009    0.030596
           SH600010   -0.013973
           SH600015   -0.141758
【Step 4】获取当前实验的 recorder_id，用于后续读取结果...
当前实验的 recorder_id 为：09c2896631e24499baebacb64603256c
[56122:MainThread](2025-10-18 11:33:25,736) INFO - qlib.timer - [log.py:127] - Time cost: 0.000s | waiting `async_log` Done
【Step 5】读取预测结果并计算 IC...
已保存的 artifacts： ['label.pkl', 'pred.pkl']
预测结果（前5行）：
                          score
datetime   instrument
2017-01-03 SH600000   -0.042865
           SH600008    0.005925
           SH600009    0.030596
           SH600010   -0.013973
           SH600015   -0.141758
预测结果（后5行）：
                          score
datetime   instrument
2020-07-31 SZ300413   -0.078162
           SZ300433   -0.101778
           SZ300498   -0.054418
           SZ300601   -0.147531
           SZ300628    0.030925
预测结果时间范围： DatetimeIndex(['2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
               '2017-01-13', '2017-01-16',
               ...
               '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23',
               '2020-07-24', '2020-07-27', '2020-07-28', '2020-07-29',
               '2020-07-30', '2020-07-31'],
              dtype='datetime64[ns]', name='datetime', length=871, freq=None)
标签结果（前5行）：
                         LABEL0
datetime   instrument
2017-01-03 SH600000   -0.001831
           SH600008   -0.002398
           SH600009    0.001493
           SH600010    0.003520
           SH600015   -0.007142
标签结果（后5行）：
                         LABEL0
datetime   instrument
2020-07-31 SZ300413   -0.037566
           SZ300433   -0.031677
           SZ300498   -0.006531
           SZ300601    0.090264
           SZ300628    0.004142
【Step 6】IC 指标统计：
IC 均值： 0.04993267859655785
IC 标准差： 0.12446777545374337
IC 绝对值均值： 0.10684702003092757
Rank IC 均值： 0.051507508261451146
Rank IC 标准差： 0.12273969195405407
Rank IC 绝对值均值： 0.10527763281820797

这里主要就是预测的结果，以及计算效果。

我们用表格来分析一下：

指标	数值	业内参考
IC 均值	0.050	0.03 以下≈无效；0.05≈“可用”；0.1+≈“优秀”
IC 绝对值均值	0.107	同上，绝对值越高越好
IC 标准差	0.124	波动大，方向不稳定
Rank IC	与 IC 几乎持平	说明非线性单调性也没带来额外信息

所以这套代码属于刚刚可用水平，但至少我们跑通了。
剩下只需要优化模型，提升IC即可。

继续努力💪

八、所有代码

我把这套整体demo放在：https://github.com/JizhiXiang/Quant-Strategy上。

Qlib - 下载数据实操 Download data

MangoQuant — Fri, 17 Oct 2025 07:00:37 +0000

很多人听说了Qlib这个量化工具后，想去尝试，但卡在了第一步，不知道怎么下载数据，今天主要讲讲如何下载数据。

一、下载（静态数据包）

1.脚本代码下载

# download 1d
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn

# download 1min
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --region cn --interval 1min

# 美股下载（源代码/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py里找到的）
python scripts/get_data.py qlib_data --target_dir <qlib_data_1d_dir> --interval 1d
# 美股更新
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --trading_date 2021-06-01

【最终还是需要以我实际运行的代码为准，我的qlib版本是0.9.7】

这个A股数据经我实操，发现只能到2020年的数据，应该是官方静态数据包，用来做实验的。

2.实操-下载A股1d数据

我这里下载到自定义的路径了：

python scripts/get_data.py qlib_data --target_dir ../qlib_data/cn_data --region cn

3.实操-下载美股1d数据

如果你有多个不同来源的数据集，最好如下进行分开。

qlib_data/
 ├── cn_data_tushare/      # A股数据源（含更新）
 ├── cn_data_snapshot/     # 官方静态包(2020-09-25)
 ├── us_data_yahoo/        # 美股
 └── hk_data_yahoo/        # 港股

我的代码：

# 美股下载
python scripts/get_data.py qlib_data --target_dir ../qlib_data/us_data_yahoo --interval 1d --region us

下载完成，速度还挺快的。

二、检查数据的健康状况

Qlib 提供了一个脚本来检查数据的运行状况。
检查要点如下：

检查 DataFrame 中是否缺少任何数据。
检查 OHLCV 列中是否有任何高于阈值的较大阶跃变化。
检查 DataFrame 中是否缺少任何必需的列（OLHCV）。
检查 DataFrame 中是否缺少 'factor' 列。

我们可以执行以下命令来检查数据是否健康。

# for daily data
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data

# for 1min data
python scripts/check_data_health.py check_data --qlib_dir ~/.qlib/qlib_data/cn_data_1min --freq 1min

还可以修改这些参数：

freq：数据频率。
large_step_threshold_price：允许的最大价格变化
large_step_threshold_volume：允许的最大音量变化。
missing_data_num：允许数据为空的最大值。 ### 1.实操 - 数据健康根据我的实际目录， > python scripts/check_data_health.py check_data --qlib_dir ../qlib_data/cn_data

(freq) test1@budas-MacBook-Pro qlib % ls
CHANGELOG.md        LICENSE         SECURITY.md     pyproject.toml      tests
CHANGES.rst     MANIFEST.in     build_docker_image.sh   qlib
CODE_OF_CONDUCT.md  Makefile        docs            scripts
Dockerfile      README.md       examples        setup.py
(freq) test1@budas-MacBook-Pro qlib % python scripts/check_data_health.py check_data --qlib_dir  ../qlib_data/cn_data
[25774:MainThread](2025-10-16 18:01:32,906) INFO - qlib.Initialization - [config.py:452] - default_conf: client.
[25774:MainThread](2025-10-16 18:01:34,294) INFO - qlib.Initialization - [__init__.py:75] - qlib successfully initialized based on client settings.
[25774:MainThread](2025-10-16 18:01:34,294) INFO - qlib.Initialization - [__init__.py:77] - data_path={'__DEFAULT_FREQ': PosixPath('/Users/test1/Documents/code/my_develop/qlib_data/cn_data')}
                              open        close          low         high        volume  factor
instrument datetime
SH000905   2007-01-15  1881.467041  1986.538940  1881.467041  1986.538940  3.881263e+09     NaN
           2007-01-16  1991.689941  2055.020020  1991.040039  2055.020020  4.807442e+09     NaN
           2007-01-17  2064.290039  2035.650024  1992.540039  2093.889893  5.883289e+09     NaN
           2007-01-18  2025.849976  2085.399902  2002.660034  2085.399902  5.094946e+09     NaN
           2007-01-19  2097.500000  2159.639893  2097.500000  2159.649902  5.792517e+09     NaN
...                            ...          ...          ...          ...           ...     ...
           2020-09-21  6492.225098  6445.964844  6434.323730  6515.336914  1.260237e+10     NaN
           2020-09-22  6392.024902  6358.651855  6341.483887  6462.975098  1.213059e+10     NaN
           2020-09-23  6382.281250  6392.032715  6356.163574  6411.204590  9.504884e+09     NaN
           2020-09-24  6353.693848  6244.638672  6243.282227  6354.682617  1.128015e+10     NaN
           2020-09-25  6271.700684  6236.890625  6206.182129  6285.471680  8.779878e+09     NaN

[3285 rows x 6 columns]
2025-10-16 18:03:15.435 | INFO     | __main__:check_required_columns:144 - ✅ The columns (OLHCV) are complete and not missing.
2025-10-16 18:03:15.820 | INFO     | __main__:check_missing_factor:172 - ✅ The `factor` column already exists and is not empty.

Summary of data health check (3875 files checked):
-------------------------------------------------
2025-10-16 18:03:15.820 | WARNING  | __main__:check_data:189 - There is missing data.
             open  high  low  close  volume
instruments
SH000903        4     4    4      4       4
SH600000        3     3    3      3       3
SH600004       53    53   53     53      53
SH600006       96    96   96     96      96
SH600007       32    32   32     32      32
...           ...   ...  ...    ...     ...
SZ300771        2     2    2      2       2
SZ300772        2     2    2      2       2
SZ300773        2     2    2      2       2
SZ300802        1     1    1      1       1
SH000905        0     0    0      0       0

[3613 rows x 5 columns]
2025-10-16 18:03:15.825 | WARNING  | __main__:check_data:192 - The OHLCV column has large step changes.
            col_name        date   pct_change
instruments
SH000300      volume  2016-01-08     3.216515
SH000903      volume  2016-01-08     3.425825
SH600000      volume  1999-12-15  1396.263672
SH600004      volume  2003-05-23   634.929382
SH600006      volume  1999-11-16    40.254677
...              ...         ...          ...
SZ300869        open  2020-08-26     0.502253
SZ300869        high  2020-08-25     0.513636
SZ300869         low  2020-08-25     0.557708
SZ300877        high  2020-08-25     0.698236
SZ300877       close  2020-08-25     0.768708

[4744 rows x 3 columns]

2.分析

先打印基本信息段，把 中证 500（SH000905）拉出来做样例展示， 2007-01-15 ～ 2020-09-25 共 3285 条日线 6 个字段：open / close / low / high / volume / factor 虽然factor 全为NaN，但没关系后面会计算。
OLHCV 五列没有整列缺失 factor 列存在。
警告一缺失数据（Missing rows）缺了多少根 bar。可能是因为停牌等原因，只要不是缺失过多，影响不大。
警告二大幅跳变（Large step change）每只股票，每列数据，每日波动超过 50 % 则表示异常跳变。

三、查看下载数据

1.查看A股数据

import qlib
from qlib.data import D

qlib.init(provider_uri="/Users/test1/Documents/code/my_develop/qlib_data/cn_data", region="cn")

symbol = "SH000905"
df = D.features(
    instruments=[symbol],
    fields=["$open", "$close", "$low", "$high", "$volume"],
    start_time="2010-01-01",
    end_time="2025-01-01"
)

if df.empty:
    print(f"没有找到 {symbol} 的数据，请检查代码或时间范围。")
else:
    print(f"{symbol} 数据共 {len(df)} 条记录。")
    print(df.head())
    print(df.tail())
    df.to_csv(f"{symbol}_data.csv", encoding="utf-8-sig")
    print("保存成功！")

provider_uri路径要写对。
注意symbol的格式，
另外 instruments=[symbol]是一个list，可以传入多只股票等。
还可以结合我上一篇文章，获取股票列表等。

从我的执行记录可以看出，尾部的数据并没有真的到2025年（这是因为数据不全导致的），所以我们有时候需要进行打印、排查等。

2.查看美股数据

我们以苹果AAPL公司为例，上市时间为1980年，但我们的官方静态数据包只有2000年到2020年的数据。
下面我们运行代码，注意修改一些配置参数。

import qlib
from qlib.data import D

qlib.init(provider_uri="/Users/test1/Documents/code/my_develop/qlib_data/us_data_yahoo", region="us")

symbol = "AAPL"
df = D.features(
    instruments=[symbol],
    fields=["$open", "$close", "$low", "$high", "$volume"],
    start_time="1980-01-01",
    end_time="2025-01-01"
)

if df.empty:
    print(f"没有找到 {symbol} 的数据，请检查代码或时间范围。")
else:
    print(f"{symbol} 数据共 {len(df)} 条记录。")
    print(df.head())
    print(df.tail())
    df.to_csv(f"{symbol}_data.csv", encoding="utf-8-sig")
    print("保存成功！")

我们可以通过这种方式，检查数据是否充足。发现2020年之后的数据没有，这样引出下一节内容：更新数据。

四、更新数据

# 美股更新
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir ../qlib_data/us_data_yahoo --trading_date 2019-01-01 --end_date 2022-01-01 --region us

# 指定美股
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir ../qlib_data/us_data_yahoo --trading_date 2019-01-01 --end_date 2022-01-01 --region us --instruments AAPL,MSFT

但是很不幸，接口请求不通，可能需要其他方式。
结果：

(freq) test1@budas-MacBook-Pro qlib % python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir ../qlib_data/us_data_yahoo --trading_date 2019-01-01 --end_date 2022-01-01 --region us --instruments AAPL,MSFT
2025-10-17 14:40:09.386 | INFO     | collector:get_instrument_list:266 - get US stock symbols......
2025-10-17 14:40:10.913 | WARNING  | data_collector.utils:wrapper:558 - _get_eastmoney: 1 :request error
2025-10-17 14:40:14.190 | WARNING  | data_collector.utils:wrapper:558 - _get_eastmoney: 2 :request error
2025-10-17 14:40:17.669 | WARNING  | data_collector.utils:wrapper:558 - _get_eastmoney: 3 :request error
2025-10-17 14:40:20.949 | WARNING  | data_collector.utils:wrapper:558 - _get_eastmoney: 4 :request error
2025-10-17 14:40:24.227 | WARNING  | data_collector.utils:wrapper:558 - _get_eastmoney: 5 :request error
Traceback (most recent call last):
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py", line 1021, in <module>
    fire.Fire(Run)
  File "/Users/test1/py_env/freq/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/py_env/freq/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/py_env/freq/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py", line 988, in update_data_to_bin
    self.download_data(delay=delay, start=trading_date, end=end_date, check_data_length=check_data_length)
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py", line 802, in download_data
    super(Run, self).download_data(max_collector_count, delay, start, end, check_data_length, limit_nums)
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/base.py", line 402, in download_data
    _class(
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py", line 86, in __init__
    super(YahooCollector, self).__init__(
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/base.py", line 80, in __init__
    self.instrument_list = sorted(set(self.get_instrument_list()))
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/yahoo/collector.py", line 267, in get_instrument_list
    symbols = get_us_stock_symbols() + [
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/utils.py", line 359, in get_us_stock_symbols
    _all_symbols = _get_eastmoney() + _get_nasdaq() + _get_nyse()
                   ^^^^^^^^^^^^^^^^
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/utils.py", line 554, in wrapper
    _result = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/test1/Documents/code/my_develop/qlib/scripts/data_collector/utils.py", line 316, in _get_eastmoney
    raise ValueError("request error")
ValueError: request error

另外，还有一种方法，就是先从别的地方下载数据，如何再转换成qlib格式。

参考：官网文档

《LightGBM: 一种高效的梯度提升决策树算法》论文（A Highly Efficient Gradient Boosting Decision Tree）

MangoQuant — Thu, 16 Oct 2025 02:24:24 +0000

起因

前面我们介绍了一些Qlib的基本入门用法，然后进一步深入，发现里面有很多的模型、算法等。那么今天就先以QuickStart的LightGBM入手，来详细了解一下吧。

代码：https://github.com/microsoft/LightGBM

LightGBM: 一种高效的梯度提升决策树算法

《LightGBM: A Highly Efficient Gradient Boosting Decision Tree》链接
作者：Guolin Ke¹, Qi Meng², Thomas Finley³, Taifeng Wang¹, Wei Chen¹, Weidong Ma¹, Qiwei Ye¹, Tie-Yan Liu¹

¹微软研究院，²北京大学，³微软雷德蒙研究院

会议：第31届神经信息处理系统大会（NIPS 2017），美国加州

摘要

梯度提升决策树Gradient Boosting Decision Tree（GBDT）是一种流行的机器学习算法，拥有如 XGBoost 和 pGBRT 等高效实现。尽管这些实现采用了许多工程优化，但==在特征维度高、数据量大==的情况下，其效率和可扩展性仍不尽如人意。一个主要原因是：对于每个特征，它们需要扫描所有数据样本来估计所有可能分裂点的信息增益，这一过程非常耗时。

为解决这一问题，我们提出了两种新技术：基于梯度的单边采样Gradient-based One-Side Sampling（GOSS） 和 互斥特征捆绑Exclusive Feature Bundling（EFB）。GOSS 通过==排除大量梯度较小的==样本，仅使用剩余样本估计信息增益。我们证明，由于大梯度样本在信息增益计算中起更关键作用，GOSS 能在数据量大幅减少的情况下仍保持较高的估计精度。EFB 则==将互斥特征（即几乎不会同时取非零值的特征）捆绑在一起，减少特征数量==。我们证明，寻找最优的互斥特征捆绑问题是 NP-hard 的，但一个贪心算法可以在常数近似比下获得良好效果。

我们将集成 GOSS 和 EFB 的新 GBDT 实现称为 LightGBM。在多个公开数据集上的实验表明，LightGBM 可将传统 GBDT 的训练速度提升 20 倍以上，同时保持几乎相同的精度。

1 引言

梯度提升决策树（GBDT）因其高效性、准确性和可解释性而被广泛使用，在多分类、点击预测、排序等任务中表现出色。然而，随着大数据时代的到来（特征数和样本数都急剧增加），GBDT 面临新的挑战，尤其是在准确性与效率之间的权衡。

传统 GBDT 实现需要对每个特征扫描所有样本，以估计所有可能分裂点的信息增益，因此其计算复杂度与特征数和样本数成正比，导致在大数据场景下训练非常缓慢。

为应对这一挑战，我们提出以下两种新技术：

1.1 梯度单边采样（GOSS）

尽管 GBDT 中没有样本权重的概念，但我们注意到不同梯度的样本在信息增益计算中的作用不同。梯度较大的样本（即训练不足的样本）对信息增益贡献更大。因此，在降采样时，==我们应保留大梯度样本，仅随机丢弃小梯度样本，并通过权重调整保持数据分布的一致性==。

1.2 互斥特征捆绑（EFB）

==高维数据通常具有稀疏性，许多特征几乎不会同时取非零值（如 one-hot 编码特征）。我们可以将这些“互斥”特征捆绑为一个特征==，从而减少特征数量。我们证明该问题是 NP-难的，但贪心算法可在常数近似比下获得良好效果。

我们将集成这两种技术的 GBDT 实现称为 LightGBM，实验表明其训练速度提升显著，且精度几乎无损。

2 预备知识

2.1 GBDT 及其复杂度分析

GBDT 是一种集成模型，每轮训练一棵==拟合负梯度==（残差）的决策树。训练的主要开销在于寻找最优分裂点。常用的方法是预排序算法（精确但慢）和直方图算法（高效，内存友好）。我们基于后者进行优化。

直方图算法的复杂度为：

构建直方图：O(#data × #feature)
寻找分裂点：O(#bin × #feature)

由于 #bin ≪ #data，构建直方图是瓶颈。==若能减少 #data 或 #feature==，即可显著加速训练。

2.2 相关工作

XGBoost：支持预排序和直方图算法，是目前最快的 GBDT 实现之一。
SGB（随机梯度提升）：每轮随机采样样本，但会损失精度。
特征降维：如 PCA、特征选择，但可能损失信息。

3 梯度单边采样（GOSS）

3.1 算法描述

==GOSS 保留大梯度样本，对小梯度样本进行随机采样，并通过权重因子调整其贡献==，保持信息增益估计的准确性。

算法流程：

按梯度绝对值降序排序样本；
保留前 a×100% 的大梯度样本；
从剩余样本中随机采样 b×100%；
给小梯度样本赋予权重 (1−a)/b；
使用加权样本构建直方图并估计信息增益。

（==为了便于理解，博主注：==
把 GOSS 想象成“在线考试老师只改关键题”：

先给所有样本的“错题程度”打分（=梯度绝对值 |g_i|）。
梯度越大 → 错题越严重 → 越要保留。把全班按 |g_i| 从高到低排队，前 a×100% 的“重灾区”学生全部留堂。
剩下的“几乎做对”的学生里，再随机抽 b×100% 出来，以免完全忽略他们。
为了让“抽样班”的成绩分布跟原来一样，给这些被抽到的“好学生”一个放大系数 weight = (1−a)/b。直观理解：本来有 (1−a) 比例的好学生，现在只抽了 b 比例，所以每人要“代表” (1−a)/b 个自己。
用这套加权后的子样本去计算直方图、信息增益、分裂点——速度变快，但分布仍接近原班。

a、b 是什么？

用户可调的超参数，不是固定常数。
论文实验里给出的典型区间： – a ∈ [0.05, 0.2]（保留 5 %–20 % 的大梯度样本） – b ∈ [0.05, 0.2]（再从剩余里抽 5 %–20 %） – 合计采样率 ≈ a + b，常见 10 %–30 %。
LightGBM 的默认设置： – top_rate (a) = 0.2 – other_rate (b) = 0.1 如果精度下降，优先增大 a（大梯度样本更关键）。） ### 3.2 理论分析

我们定义信息增益为分裂后方差减少量。GOSS 使用采样后的样本来估计信息增益，其误差上界为：

其中：

(对理论证明感兴趣的，可以去看看论文原文。)

结论：

当样本数 n 很大时，误差趋于 0；
GOSS 优于随机采样（SGB），尤其在信息增益范围大时；
采样增加了基学习器的多样性，有助于泛化。

4 互斥特征捆绑（EFB）

4.1 算法描述

高维稀疏特征中，许多特征几乎不会同时非零（如 one-hot 特征）。我们将这些特征捆绑为一个“互斥特征 bundle”，从而减少特征数。

（==博主注：==
One-hot 特征（又称独热编码、一位有效编码）是把离散取值（类别）映射成二进制向量的最常用手段：

假设某特征有 N 个不同类别。
建立一个长度 = N 的全 0 向量。
样本属于第 i 类时，把向量的第 i 位置为 1，其余保持 0。

结果：每个类别对应唯一一个“1”位，所以叫 “one-hot”。

例子

特征“颜色”取值 {红, 绿, 蓝} → 3 维 one-hot 向量

红 → [1, 0, 0]
绿 → [0, 1, 0]
蓝 → [0, 0, 1]

）

两个关键问题：

哪些特征可以捆绑？ → 转化为图着色问题，特征是顶点，互斥关系是边；
如何合并特征？ → 为每个特征分配偏移量，使其值域不重叠，合并为一个新特征。

定理 4.1：寻找最优的互斥特征捆绑问题是 NP-难的。

贪心算法（算法 3）：

构建冲突图；
按度数降序排序特征；
遍历特征，尝试将其加入现有 bundle（冲突数 ≤ γ），否则新建 bundle。

合并算法（算法 4）：

为每个特征分配偏移量；
合并为一个新特征，保留原始特征值的可区分性。

复杂度：

捆绑阶段：O(#feature²)，仅执行一次；
合并后：直方图构建复杂度从 O(#data × #feature) 降为 O(#data × #bundle)，#bundle ≪ #feature。

（同样，对证明感兴趣的可以看原文。）

5 实验

5.1 数据集

数据集	样本数	特征数	类型	任务	指标
Allstate	12M	4,228	稀疏	二分类	AUC
Flight Delay	10M	700	稀疏	二分类	AUC
LETOR	2M	136	稠密	排序	NDCG@10
KDD10	19M	29M	稀疏	二分类	AUC
KDD12	119M	54M	稀疏	二分类	AUC

5.2 实验设置

对比方法：XGBoost（预排序 & 直方图）、lgb_baseline（无 GOSS/EFB）、SGB（随机采样）
参数：GOSS 采样率 a=0.05~ 0.1，b=0.05~0.1；EFB 冲突率 γ=0
线程数：16，迭代次数：固定，早停

5.3 结果

训练时间（每轮平均秒数）

数据集	xgb_exa	xgb_his	lgb_baseline	EFB_only	LightGBM
Allstate	10.85	2.63	6.07	0.71	0.28
Flight Delay	5.94	1.05	1.39	0.27	0.22
LETOR	5.55	0.63	0.49	0.46	0.31
KDD10	108.27	OOM	39.85	6.33	2.85
KDD12	191.99	OOM	168.26	20.23	12.67

==LightGBM 在所有数据集上均最快==，最高提速 21 倍（Allstate）

测试精度（AUC / NDCG@10）

数据集	xgb_exa	xgb_his	lgb_baseline	SGB	LightGBM
Allstate	0.6070	0.6089	0.6093	0.6064±7e-4	0.6093±9e-5
Flight Delay	0.7601	0.7840	0.7847	0.7780±8e-4	0.7846±4e-5
LETOR	0.4977	0.4982	0.5277	0.5239±6e-4	0.5275±5e-4
KDD10	0.7796	OOM	0.78735	0.7759±3e-4	0.78732±1e-4
KDD12	0.7029	OOM	0.7049	0.6989±8e-4	0.7051±5e-5

LightGBM 精度==与最佳基线几乎一致==，无显著损失

5.4 GOSS 分析

提速：GOSS 单独带来约 2 倍提速；
精度：在相同采样率下，GOSS 精度 始终优于 SGB

5.5 EFB 分析

提速：EFB 在稀疏数据集上带来 显著提速（如 KDD10 上 6.3 倍）；
原因：减少特征数、避免零值计算、提升缓存命中率

6 结论

我们提出了 LightGBM，一种集成 GOSS 和 EFB 的高效 GBDT 实现。理论分析与实验结果表明：

GOSS 能在减少样本的同时保持信息增益估计的准确性；
EFB 能有效减少特征数，尤其适用于高维稀疏数据；
LightGBM 在多个大规模数据集上实现了 最高 20 倍以上提速，且精度无损。

未来工作：

研究 GOSS 中 a、b 的最优选择策略；
进一步优化 EFB，使其适用于稠密特征场景。