Image processing, animal tracking, and pose estimation. (A) Schematic illustrating the setup of the top-view video camera and the depth sensor on top of the cage. The top-view camera and depth sensor were placed as close as possible to minimize the parallax effect. Unit: millimeters. (B) MATLAB-generated schematic showing 3D registration of the top-view video camera and the depth sensor into a common coordinate system. Locations of checkerboard patterns (Methods and Fig. S3) used for calibration are shown on the left, and the calculated positions of the two cameras are shown on the right. (C) Pose estimation using information from both top view camera and depth sensor. An ellipse that best fits an animal detected in the segmented 3D video frames is used to describe the position, orientation, shape, and scale of the animal. Head orientation is determined by the standing position, moving direction, and a set of features extracted using a previously developed machine learning algorithm (Methods). The pose of an animal is thus described by an ellipse using a set of five parameters: centroid position (x, y), length of the long axis (l), length of the short axis (s), and head orientation (θ). (D) Validation of pose estimation against ground truth (manually annotated ellipses in individual video frames). Each histogram represents the distribution of differences of individual pose parameters and overall performance between pose estimation and ground truth (see Methods for the definition of differences of individual pose parameters and overall performance). Numbers in the parenthesis at the top of each plot represent the percentage of frames to the left of the dashed lines, which represent the 98% percentiles of the differences between two independent human observers (Fig. S5). n = 634 frames.