Always wanted to try your hand at learning ASL (American Sign Language) but not sure where to start? What better way to start off than to familiarize yourself with the the ASL alphabets! In this follow-along tutorial, we’re going to be building an ASL alphabet detection model to aide in our learning where you’ll be able to translate ASL alphabets in real time!
At the end of this tutorial, you will be able to replicate this entire project in under 20 minutes - not including the training bit (which might take an hour or so). A summary of the steps we're going to be taking is shown below so be sure to sign up for a Nexus account (which entitles you to 500 GPU training minutes) and download the accompanying dataset contributed by David Lee.
- Image & Annotation Upload
- Inspecting Labels and Performing Sanity Checks with Metadata Query
- Setting up model training pipeline
- Running model training
- Inspecting and validating model
- Model Iteration & next steps
Once you're logged in to the Nexus platform, you'll be prompted to create your first project, let's go ahead and call it "ASL Detection" - or feel free to name it anything else.
The first thing we want to do is to upload our image dataset. Go ahead and select the entire 'train' folder. You'll be prompted to upload all files in the folder and thumbnails will be generated shortly.
Next, instead of spending hours labelling our images, let's upload our annotations. The dataset we downloaded comes with labels in the CSV four corner detection format, so all we have to do is to upload the single .csv file.
Note: You can upload all the annotations without splitting them into train / valid as our platform matches the labels to the filename of your images. This also means that filenames must match. For example, an image file named "image1.png" needs to have the accompanying annotation file name in the .csv file. Check out the full list of supported annotation formats in our documentation here
Inspecting Annotations + Sanity Checks
Next, its time to inspect our labels. With 26 different classes in our dataset, it’s imperative that we ensure all labels are labeled correctly. The last thing we want is for labeling errors in the dataset to produce a model that has it’s alphabets jumbled up! The first way we could inspect our dataset is by using the in-built annotator and sifting through all 700+ images. However, a better way, would be using Metadata Query - our advanced search and filtering engine that makes sanity checks pre-training an absolute breeze.
Navigate to the “Images” tab. We’re going to be inspecting all the relevant images for each alphabet. Metadata Query essentially allows us to query things like “show me all images that are labelled with an “A” or something like “show me all images that have more than one label”. Metadata query can also support advanced searches such as operators and comparisons to support more complex logical queries as well, read more here!
Based on sanity check of “instances > 1” - which is essentially the same as “show me all images that have more than 1 label” - we notice that there is a single image with duplicate bounding boxes. Clicking on the image brings you to our built-in annotator and we can then correct this labeling error by deleting the duplicate bounding box.
In addition, the metadata query also provides us with a quick overview of all the relevant images associated with that label which gives us a quick way to spot for any outliers. Here, we can see that an image of the alphabet "M" has been wrongly labeled as "N" - tough to spot since both alphabets resemble each other!
Once all the images have been labeled and corrected, we can use the tag distribution graph on the project overview page to ensure that there are no data imbalances which may cause the model to be biased.
Building your training pipeline
Now let's get to the fun bit; building your model training workflow. Create a new workflow and name it however you want. A best practice we have is to name your workflow based on your selected model parameters i.e model title, model architecture, number of epochs, and applied augmentations. This helps a ton when we look back at our artifacts and wish to figure out the general configurations used.
Datature Training Workflow Setup
Simply right click on the canvas and select the modules. A full workflow should consists of the Dataset, Augmentations and Model.
Dataset - Clicking on the card allows you to select your train-test split ratio as well as an option for you to shuffle your dataset.
Augmentations - This is where we select relevant augmentations as a pre-processing step to logically enhance our dataset on the fly to increase the ability to generalize on unseen variations. You may select as many augmentations (that make sense for your dataset or use case) by selecting the checkboxes. Toggling Advanced Mode will also enable you to enter the probability of each augmentation for users who like full control of their parameters. Our library of augmentations supports up to 30 augmentations ranging from positional (vertical / horizontal flips) to color space augmentations to account for variances in lighting conditions. In the context of this model - we’re going to enable the (i) horizontal flip, (ii) random rotate as well as the (iii) motion blur feature - since the end goal would be for our model to make inferences on video, thus we would want to account for movements.
Model - This is where we select the base model architecture to train our model on. Datature utilizes state of the art model architectures for transfer learning so feel free to select the model for your use case as we understand that some users may or may not be willing to trade accuracy for computational complexity and latency. For this tutorial, we will be using the setup as shown above.
Now that all our modules have been set up and connected, let's go ahead and preview our augmentations.
Clicking on Preview Augmentations at the bottom bar provides you with a preview of how the augmentations will be applied to your dataset. Tip: the one thing to keep in mind when selecting augmentations is that we want our data to account for any potential variations we may encounter in the production environment.
Once you're happy with your augmentations and workflow, selecting Run Training will provide you with a final configuration summary based on the parameters you've chosen. You'll also be able to specify the hardware acceleration and train your models on up to 8 GPU's based on your batch size and model selection.
Now its time to sit back and monitor your model training in real-time. This is great for teams to spot early signs of overfitting which allows them to kill their trainings early. Once training is completed, you'll be able to look at key metrics for computer vision such as loss functions, precision and recall. Smoothing functions are also available on the TensorBoard and graphs can be re-arranged to your liking!
Models trained on the Datature Nexus platform are stored as 'artifacts'. Let's go ahead and generate a TensorFlow model (with support for more models coming soon). This takes anywhere between 5 - 10 minutes and we can download the model to our local machine afterwards or choose to generate a project secret on our API settings page if we're intending to use Portal to visualize our neural network model.
Validating our ASL Translation Model
We've successfully trained our model...now what? Its time to inspect our models visually! Even though the TensorBoard provides us with standard metrics like Precision, Recall and the various loss functions, it’s always a good idea to move past aggregate metrics and loss functions by actually visualizing how our model makes predictions on new never-before-seen images and data collected from the production environment.
Our tool of choice - Portal. Our open-source library that lets anyone visualize inspect the performance of their model easily in minutes. Portal can be loaded as an executable file or can be run as a web application (more details on Github). Once Portal is successfully initialized, we'll want to register and load our model. If you've downloaded your model locally, all you need to do is to paste the entire folder path and load it once it has been registered. Alternatively, you can also enter your project secret and model key.
Once we've loaded in our model, we'll go ahead and load in sample images and videos from our 'test' dataset under the assets folder. Selecting Analyze on Portal initiates our loaded model to run an inference on the current asset to return any objects which we have labeled. There are a ton of other cool features on Portal like Confidence Thresholds, IoU, Class Filtering and Bulk Analysis, which we will leave the curious to discover on this tutorial: Inspect Model Inferences on Images and Videos with Portal.
Once all the model has ran an inference on all of our assets, we compared the ground truth labels to their predicted labels and summarized the performance of the model as follows:
Model Results @ 50% Confidence Interval
Although we managed to obtain some inferences from the model - there is still room for improvement especially when running inferences on images that have been taken on my own webcam. We find that these mispredictions and model confusion could be due to the following:
- Dataset comprised of images taken from a mobile camera - which tend to yield higher resolution images compared to a webcam
- Training data did not comprise of images from my background - hence model was not "normalized" to my environment + face and hence performed poorly on webcam images
- Distance - model had a hard time recognizing ASL alphabets from the webcam at further distances possibly due to the training set comprising of images captured fairly close to the camera
Next Step: Model Iteration
Now that we've hypothesized potential explanations for model mispredictions - it comes down to iterating the model. The key logical "next step" in our case would be to train the model using images captured from my webcam and background in order to increase the variation in the overall dataset that is comprised mainly of images captured from a phone. This also helps the model "learn" from images that have hands located slightly further away from the camera as well.
And there we go! That concludes our tutorial on training a very rudimentary ASL character detector! Although this is just the first step - we hope that this paves the way for further research in the field where not just letters but full words and sentences can be translated based on video data.
Try it for yourself!
Now that you've seen the capabilities of the Datature and Portal platform, it's time to apply it to your own industry's use case! We've seen how some of Datature's users have developed computer vision use cases from defect detection models to automate assessment and grading fruits in their factories, to developing human traffic counters in retail stores. If you'd like to experiment with the model yourself, feel free to download the trained TensorFlow model here.
The possibilities of computer vision are endless and whether you're developing a proof-of-concept model or fine-tuning model performance, our platform allows you to do it in a data-centric manner. For more inspiration about the possibilities of computer vision for your industry, check out our Solutions Page to see how we're helping users solve their industry's toughest problems. That's all from us and we can't wait to see what you'll come up with!
If you have more questions, feel free to join our Community Slack to post your questions. If you have troubles building your own model as you fight with CUDA or Tensor Mismatches, simply use our platform, Nexus, to build one in a couple of hours for free!