Dataset and Protocol


The dataset used in this contest was acquired using a XIMEA SNm4x4 VIS Camera. The vides were captured at 25 frames per second (FPS). Each frame was originally captured in 2D with 16 bands arranged in a mosaic mode. Each frame is then converted to 3D with the first two dimensions index the location of each pixel, and the third dimension indexes the band number (code provided). The 16 bands covers the range from 470nm to 620nm, and each band is in the size of 512×256 pixels. RGB videos were also acquired at the same frame rate in a view point very close to the hyperspectral videos. False-color videos generated from the hyperspectral videos are also provided.

Camera Calibration

The camera calibration process involves two steps: dark calibration and spectral correction. Dark calibration aims to remove the influence of noises produced by the camera sensor. It is done by subtracting a dark frame from the captured image, for which the dark frame was captured with lens covered by a cap. The goal of spectral calibration is to reduce the distortion of spectral responses. It is done by applying a sensor-specific spectral correction matrix on the acquired image.

Image Registration

The hyperspectral sequences and color sequences are registered to make them describe almost the same scene. This is done by manually selecting matching points in the first frame of both hyperspectral and color videos and then calculating geometrical transformation. The resulted transformation matrix was applied to all subsequent color frames to make alignment with the corresponding hyperspectral frames.

Image Conversion

To ensure fair comparison, the hyperspectral videos were converted to false color videos using CIE color matching functions. This produces strictly spatially aligned hyperspectral and false-color videos.

For details on the above steps, please refer to: F. Xiong, J. Zhou, and Y. Qian. "Material based object tracking in hyperspectral videos", IEEE Trans. Image Process., vol. 29, no. 1, pp. 3719-3733, 2020.


A single upright bounding box is provided for the location of the target object in each frame. The bounding box is represented by the centre location and its height and width. The labels for hyperspectral and color videos were generated independently. The labels for the hyperspectral videos can be used directly on the false-color videos.


The whole dataset contains 40 sets of videos for training and 35 sets of videos for testing. Every video is labelled with associated challenging factors out of eleven attributes, including illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC), and low resolution (LR).

Dataset and Source Code Links

Source Code Link:

Evaluation code:

2D image to hyperspectral cube conversion code: