The PiGraphs dataset is composed of 30 reconstructed scenes and 63 interaction recordings.
To create the reconstructed scenes, we use a volumetric fusion framework on scans obtained using a Structure sensor. Each scene comes with a surface mesh with a labeled segmentation, and a set of labeled voxels.
These 63 observations are video recordings of five subjects (4 male, 1 female) with skeletal tracking provided by the Kinect.v2 devices. The total recording duration is about two hours (100k frames at 15Hz) with a per-recording average length of 2 minutes and an average of 4.9 action annotations.
In total, there are 298 actions, and the average action duration is 8.4s.
There are 43 observed combinations of verb-noun pairs with 13 common action verbs such as look, sit, stand, lie, grasp, and read. 19 object categories are associated with these verbs (e.g., couch, bed, keyboard, monitor).