What is Dataset Management?
Dataset management covers the wide range of operations used by data scientists to monitor and modify their dataset to meet the needs of their use case. This covers operations such as adding, removing, filtering, and creating hierarchical structures for assets.
Why is Dataset Management Important?
In machine learning, it is critical for practitioners to understand the various facets of their training dataset, as the quality of data directly affects the training and predictive performance of the trained model. Therefore, it is important that they have the tools to be able to supplement and correct different aspects in their dataset.
Dataset management has an impact at every step of the machine learning pipeline. At the asset uploading stage, data may neet be tracked at various levels of detail, such as when the data was collected and providing smaller subcategories that it would fall under for filtering later. At the annotation and dataset analysis stage, having these tags is incredibly useful for filtering and grouping such that users can check various details for certain subsets of the data to spot potential issues such as outliers and inconsistencies. Being able to remove images of a certain type or bolster them when necessary has significant impact at the model training stage, such that the model is able to train on a well balanced and sufficiently informative dataset. Finally, at the model validation stage, it can be useful to know which types or groups of images were used at the training level before you validate with what should be a comparable and representative test dataset.
Additionally, careful dataset management is very useful as a tool for accountability, as it makes it easy for others to understand where data is sourced and how it is organized. For example, for researchers, it would be really useful to track training results with the different sets or subsets of data that are being used. In deployment pipelines where cyclical processes like active learning and model retraining occur with new sets of data, it can be very useful to track what data is being added such that if debugging or post-training analysis is needed to uncover where differences in results may have occurred. Dataset management practices facilitate all of these possibilities.
How Does Asset Group Management Work on Datature?
Asset Group Management begins at the asset uploading stage. When you upload a new set of images to your project, you will first confirm the images that you are uploading. The next step of the process is to classify the new images under pre-existing or new groups, or just under the root or main dataset group. In the example below, custom augmented images are being uploaded, and one can choose to group them in just pre-existing groups or make a new group with a custom name that I’ve chosen to be augmentations. New images can be assigned to multiple groups.
Once the images are uploaded under their respective groups, you will be able to filter for an individual or a combination of groups on the Assets page. To do this, you can use the Select Groups button at the top right corner below the image uploading section, where you will be able to input a single or multiple group names as well as choose whether the filter should filter to find only images that are part of all groups listed or just at least one group. For easy access, we also display the existing list of group names below. After you select the Filter button, the Assets page will only display the relevant images. To get rid of the filter, you can close out the filter tag on the top left of the images and all images will be displayed again.
On the right of the filtering function is a Settings button, which is for Group Management. In these settings, you can add, rename, and remove groups.
If you want to make your groups even before the relevant images are uploaded, you can do so in this page, by simply typing out a new group name and selecting Add Group.
To rename your existing groups, you can select one of the groups under the Current Groups tab which will provide extra information about the group, such as the number of occurrences in the group, as well as the ability to rename the group by typing a new name and selecting the arrow.
Finally, as a new way to reorganize and remove images, you can select the small button next to the name of an existing group. The following menu will appear, and the checkbox options allows you to choose between simply removing the organizational group and moving those images to the main group, deleting all the images in the group as well. This is really helpful if you decide to retroactively remove previously added groups of image data for whatever reason. This gives more flexibility and ease in asset inflow and outflow, as you don’t have to worry about accidentally missing out on removing certain images or mistakenly removing images, as you can now remove by group tag, which is all handled on our system.
Besides the asset level, Asset Group Management also has an impact at the Annotator level. While annotating, one can use the same filtering in the Assets page but down at the bottom left corner of the annotator, to better find images that need to be annotated. This can be useful if the annotation job is being split between multiple annotators, and groups can be used to indicate what images should be annotated. In the same vein, there is a filename search which can be used to search for specific images in the annotator using the filename.
Asset Group Management can be an incredibly useful to make simple tasks much more easy to navigate and allows dataset manipulation to be even more efficient and accessible.
Want to Get Started?
Asset Group Management is available on all accounts, so all users on Nexus can enjoy and reap the benefits of our new feature. You can try it out with pre-existing projects as well!
Our Developer’s Roadmap
Overall, Asset Group Management is an incredibly useful tool that will continue to be integrated into other parts of our pipeline. Versioning in an iterative process like machine learning model training and deployment is incredibly important and we will continue to push for tools that contribute to transparency, accessibility, and collaboration.
If you have questions, feel free to join our Community Slack to post your questions or contact us about how Asset Group Management fits in with your usage.
For more detailed information about the Asset Group Management functionality, customization options, or answers to any common questions you might have, read more about dataset management on our Developer Portal.