How to Perform Action Recognition on Keypoints with ST-GCN++

Action recognition is a computer vision task aimed at identifying human actions in visual data, using machine learning techniques to analyze motion and appearance patterns. This field is distinct from traditional classification, focusing on temporal dynamics in videos and has applications in surveillance, healthcare, sports analysis, and more.

Leonard So

What is Action Recognition?

Action recognition is a computer vision task that focuses on identifying and understanding human actions or activities from visual data, typically images or videos. The goal of action recognition is to develop algorithms and models that can automatically detect and classify various actions performed by humans or objects in a given scene.

In action recognition, the primary task is to recognize the specific action or activity being performed by analyzing the motion and appearance patterns within the visual data. This involves detecting and tracking relevant objects, understanding their spatial and temporal relationships, and extracting meaningful features that represent the actions. These features can include information about motion trajectories, pose configurations, object interactions, and more.

Action recognition is a challenging task due to factors such as variations in lighting conditions, camera viewpoints, occlusions, and the complex nature of human motion. Researchers in this field often employ machine learning techniques, including deep learning, to extract relevant features and train models capable of accurately recognizing and classifying actions. These models are typically trained on large datasets containing labeled examples of various actions.

How is Action Recognition Different From Classification?

Action recognition differs from traditional classification in the type of information that is being extracted and analyzed. Action recognition focuses primarily on temporal information, while classification focuses on spatial information.

Temporal information encompasses aspects such as the order and timing of events, motion trajectories, and the dynamics of objects or individuals in the video. For instance, in recognizing the action of "jumping," action recognition needs to observe the progression of the individual leaving the ground, reaching the peak of the jump, and landing back down. Recognizing this action requires capturing the temporal dependencies and changes over the course of several frames.

In classification, however, each image or frame is treated as an independent entity, and the model doesn't take into account the order or timing of these frames. It is concerned solely with the spatial characteristics of the individual frame. For instance, when classifying a static image of a cat, the model focuses on the cat's appearance, shape, and context within that single frame, without considering any motion or temporal context. The loss in temporal context in classification can lead to severe misclassifications that action recognition can easily resolve.

What are Some Action Recognition Applications?

Since action recognition involves appearance and motion patterns, there is a wide range of applications that can benefit from such analysis.

A person punching/slapping another person with keypoints and an action label overlaid.
Action recognition used for surveillance [source].
Basketball players playing with action labels tracking each player.
Basketball player action tracking as a sport [source] application of action recognition.

Below is a non-exhaustive list of common applications for action recognition:

  • Surveillance and Security: Identifying suspicious or abnormal activities in security camera footage.
  • Healthcare: Monitoring patient movements for rehabilitation and assessing physical therapy exercises.
  • Sports Analysis: Analyzing sports events to track player movements and extract insights for coaching and analysis.
  • Human-Computer Gesture Interaction: Enabling gesture-based interfaces and controlling devices through actions.
  • Automated Driving: Understanding pedestrian and driver actions for autonomous vehicles to navigate safely.
  • Entertainment: Enhancing virtual reality experiences and character animations in games and movies.

What is ST-GCN++?

Spatial-Temporal Graph Convolutional Network (ST-GCN) is a type of neural network architecture designed specifically for action recognition tasks in videos. It is a deep learning model that leverages both spatial and temporal information in video sequences to improve the accuracy of action recognition. ST-GCN is particularly effective for capturing the dynamic patterns and interactions between different body parts during actions.

ST-GCN architecture diagram depiction.
ST-GCN architecture. [Source]

The architecture of ST-GCN is based on graph convolutional networks and is well-suited for modeling actions as a sequence of human body poses. It treats the human body as a graph, where nodes represent body joints (such as elbows, knees, and shoulders) and edges represent the spatial relationships between these joints. Temporal information is incorporated by processing multiple frames of a video sequence.

ST-GCN++ is a recent variant that implements simple modifications to the original architecture by removing the complicated attention mechanisms while reducing the computational overhead. These optimization techniques work in tandem to further support the goal of achieving real-time action recognition.

What can ST-GCN++ Achieve?

ST-GCN++, just like its variants in the ST-GCN family, has demonstrated superior performance in action recognition tasks compared to earlier architectures. Its ability to model the dynamics of actions over time makes it well-suited for recognizing a wide range of complex actions accurately. It is robust to variations in the speed at which actions are performed. It can recognize actions even when the timing and tempo vary, making it suitable for real-world scenarios where actions may not always follow a fixed rhythm.

Architectural diagram using multi-modal inputs for traffic speed forecasting.
Example of multimodal context-based ST-GCN for traffic speed forecasting, which incorporates structured data such as passenger volume and weather conditions. [Source

ST-GCN++ is also not just limited to visual data, but can be combined with other modalities such as audio or text to perform multimodal action recognition. This means recognizing actions not only based on visual cues but also by integrating information from other sources, enhancing the model's performance and robustness.

Training a Custom ST-GCN++ Model for Sit-Stand Recognition

ST-GCN++ is highly customizable for 2D and 3D use cases. For the purpose of this tutorial, we will focus primarily on the 2D scenario. The ST-GCN++ model zoo contains pretrained models on the NTU-RGB+D dataset, but the annotations follow the HRNet 2D skeleton, which is similar to the COCO-Pose skeleton format containing 17 keypoints. This may be useful for generic action recognition applications, but certain use cases may benefit from a more detailed skeleton with a larger number of keypoints.

NTU-RGB+D 2D skeleton with 25 keypoints.[Source]

We will be re-training ST-GCN++ to produce a sit-stand recognition model using a subset of the NTU-RGB+D dataset with the original 25 keypoints.

1. Setting Up Your Environment

To get started, simply clone our Datature Resources repository and navigate to `example-scripts/action-recognition`. To set up PYSKL which provides the training environment, run the following commands in your terminal:

git clone
cd pyskl
# This command runs well with conda 22.9.0, if you are running an early conda version and face some errors, try to update your conda first
conda env create -f pyskl.yaml
conda activate pyskl
pip install -e .

Other miscellaneous packages are listed in `requirements.txt` in the `action-recognition` root folder. To install these, simply run the command:

cd ..
pip install -r requirements.txt

2. Data Preparation (Optional)

The sit-stand subset of the NTU-RGB+D dataset has been preprocessed and conveniently included as a pickle file (`sit_stand.pkl`) that can directly be used for training. If you wish to use a different subset or a custom dataset, do check out this guide to generate your own pickle file.

The training configuration file will also need to be updated to reflect the path to the updated pickle file. Do note that the PYSKL library does not provide a convenient way to define a custom skeleton as part of the training configuration file or adding arguments to the training execution command. Instead, you will need to add your custom skeleton directly into the PYSKL code in `pyskl/pyskl/utils/` (function `get_layout()` on line 97) as shown in the following code snippet:

self.num_node = NUM_KEYPOINTS
self.inward = [(kp1, kp2), ...] # list of your pairs of adjacent keypoints = CENTER_KEYPOINT_ID

You will also need to add the adjacency matrix of your custom skeleton into `pyskl/pyskl/datasets/pipelines/` (function `__init__()` of class `JointToBone` on line 295) as shown in the following code snippet:

self.pairs = ((kp1, kp2), ...) # list of your pairs of adjacent keypoints

3. Training and Validation

Once the dataset has been prepared, we are now ready to start the training by running the command below. You should start to see training and validation logs after a short period of initialization.

bash pyskl/tools/ configs/  \
    --validate \
    --test-last \

Using the above script, you should see outputs such as that displayed below:

Screenshot of code training output in terminal.
Evaluation metrics as part of training logs.

Understanding The Evaluation Metrics

To evaluate the model, two common metrics in the context of image classification tasks are used - Top-1 accuracy and Top-5 accuracy. Top-1 accuracy measures the proportion of correctly predicted samples out of the total samples in the dataset. In other words, it checks if the model's top prediction for each input matches the true label. Top-5 accuracy is a more relaxed metric. Instead of considering only the top prediction of the model, it takes into account the top five predictions for each input. If the correct label is among the top five predictions, it is considered a correct classification.

ST-GCN++ Top-1 Accuracy over 16 epochs.
Graph of ST-GCN++ Top-1 Accuracy across 16 training epochs. Higher Top-1 accuracy means that the model is classifying the right actions more reliably.

Top-1 accuracy is often used as the primary metric to evaluate a model's performance in tasks where precision and exactness are crucial. Top-5 accuracy is more commonly used in tasks where there is a certain level of ambiguity or when there are multiple correct answers (e.g., recognizing different species of animals in a photo). In our case, since the majority of the data consists of a single human subject, the metric to focus on would be the Top-1 accuracy of 97.5%.

Integrating ST-GCN++ with YOLOv8 Pose Estimation

To test out our newly-trained model on unseen data, we will first use a pose estimation model to detect all keypoints to be fed into ST-GCN++ as inputs. A custom sit-stand pose estimation model has already been trained using Ultralytics YOLOv8 using the 25 keypoints.

GIF of test video of person sitting down with keypoint skeleton detected.
Pose estimation results using YOLOv8-Pose Small 640x640.

We have provided a testing script (``) that runs both the pose estimation and action recognition stages sequentially. The input is a single video which should contain one action, either sit-to-stand or sitting down, and the output is that same video with the corresponding action label overlaid. You can run the following code to start the evaluation.

python input/sample.avi output/output.mp4 \
  --config configs/ \
  --checkpoint test/stgcn_model.pth \
  --pose-checkpoint test/ \
  --skeleton test/nturgbd_skeleton.txt \
  --det-score-thr 0.7 \
  --label-map tools/sit_stand.txt

The GIF below displays the output of the video inputted of the subject sitting down. The overlay shows both the action label and the keypoints that were used as inputs for the evaluation.

GIF of same test video but with the correct action label of sitting down.
Using ST-GCN++ in the second stage to produce an action label (image top-left).

As one would be able to observe through this example and from the metrics of the top-1 accuracy shown above, ST-GCN++ is successfully able to differentiate between sitting down and standing up for skeleton data in a video. While this is a relatively simple example with only two classes, this can certainly be extended to a larger number of classes if you desire.

Try It On Your Own Data

To try it on your own data, you should train your own YOLOv8 pose estimation model with Ultralytics, which is made easy with their API, as seen here. You will also need your own action recognition data and potentially keypoint data, unless you follow the article directly and use NTU RGB+D. To do so, please apply for access to their dataset here, and you can subsequently use their data for training your own YOLOv8 pose estimation model and train the ST-GCN++ model subsequently. The code we use can all be found in our Resources Github repository here.

Our Developer’s Roadmap

Action recognition is one of the extension applications that we are looking to integrate with keypoint detection models through Datature's Nexus. With Datature’s video annotation tools such as video interpolation and video tracking, keypoint labelling on videos will be more streamlined and time-saving for users on our platform to kickstart their action recognition journey. To facilitate more practical use cases with action recognition, we plan to introduce other features such as 3D reconstruction estimation, which opens up avenues for depth estimation in a 2D environment.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us about how Action Recognition fits in with your usage. 

For more detailed information about the Action Recognition functionality, customization options, or answers to any common questions you might have, read more about the process on our Developer Portal.


  • Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [PDF].
  • Duan, H., Wang, J., Chen, K., & Lin, D. (2022, October). Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 7351-7354).
  • Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, Alex C. Kot, "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding", IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019. [PDF].
  • Yan, S., Xiong, Y., & Lin, D. (2018, April). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).

Build models with the best tools.

develop ml models in minutes with datature