
The text generation AI ChatGPT is gaining attention. When given a theme, it can generate text based on that theme in natural language, and it can also support programming, marking the arrival of a highly advanced AI. However, such AIs learn from text data and code that already exist on the internet, so in specific fields that require high levels of expertise and confidentiality, such as medical records, there may not be enough information for the AI to learn from. To solve problems in these fields using AI, it is still necessary to incorporate various forms of tacit knowledge, such as human experience, wisdom, and intuition, into algorithms. Therefore, human annotation work is still required in many situations.
Annotation tools are essential for the annotation work of adding information to each piece of large data. However, when you search for "annotation tools," various names appear, each with different supported file formats and functions, making it difficult to decide which tool to use. Therefore, this time, we will focus on text annotation and introduce three key points to consider when choosing annotation tools, along with three recommended annotation tools.
- Table of Contents
-
- 1. Three Points to Choose an Annotation Tool
- 1-1. Purpose
- 1-2. Features and Usability
- 1-3. Management
- 2. Comparison of 6 Annotation Tools
- 2-1. FastLabel
- 2-2. Brat
- 2-3. LabelBox
- 2-4. CVAT
- 2-5. VoTT
- 2-6. Labelimg
- 3. Summary
- 4. Human Science Annotation Agency Services
- 4-1. Extensive track record of creating 48 million teacher data entries
- 4-2. Resource Management Without Using Crowdsourcing
- 4-3. Utilizing the latest data annotation tools
- 4-4. Fully equipped security room within the company
1. Three Points to Choose an Annotation Tool
1-1. Purpose
Text annotation tools need to be selected based on the type of AI model you want to build in-house. Typical types of text annotation include "named entity recognition," "sentiment analysis," and "class classification," but the optimal annotation tool varies for each. For example, for "named entity recognition," a feature that allows specific words in a sentence to be enclosed in span tags is necessary. For "sentiment analysis" using dialogue, it would be beneficial to have tagging for each sentence. In "class classification," which categorizes the entire text, a tagging function for the whole text is required. Since the types of annotations that can be performed vary by tool, choose a tool that fits your purpose.
1-2. Features and Usability
In annotation work that processes vast amounts of data, the functionality and usability (operability) of the tools are crucial. From the perspective of operability, it is important for the UI (button arrangement and screen layout) to be intuitive enough to operate without a manual, whether shortcut keys are well-equipped, and whether actions like data loading are smooth, as these factors contribute to productivity improvement. In terms of functionality, it is advisable to consider whether the tool can create the necessary data for AI learning, such as the ability to associate span tags with each other.
Additionally, annotation tools are broadly divided into cloud-based and local installation types. Cloud-based tools require no installation and can be used immediately after creating an account and logging in.
On the other hand, local types can operate without transferring data to external cloud servers, providing peace of mind in terms of data security management. Some tools have a high barrier to entry, requiring downloads from version control systems like GitHub or executing commands for installation. Additionally, many tools lack features for bulk data management, making data management cumbersome and not very suitable for collaborative work.
Furthermore, the data formats that can be output by each tool vary. Whether the desired output format is supported is also one of the important points to consider when choosing a tool.
1-3. Management
When working with many annotators on a single project, the management functions for annotators and tasks (the smallest unit of annotation work) are also crucial points not to be overlooked. For example, being able to check the daily progress of annotators (number of annotations, number of completed tasks, number of rejections, etc.) and the status of each task (annotated, reviewed, under review, on hold, etc.) can facilitate smooth management operations and also help ensure quality.
Most local tools lack such management features, but many cloud tools come equipped with management functions, making them effective for projects involving large amounts of data carried out by multiple people over an extended period.
2. Comparison of 6 Annotation Tools
This time, we will introduce six representative annotation tools in the field of text annotation.
2-1. FastLabel
FastLabel is a cloud-based annotation tool that supports images, videos, text, audio, 3D, and automatic annotation.
FastLabel's text annotation supports "Named Entity Recognition", "Classification", and "Pair Classification".
"Named Entity Recognition" is an annotation that extracts specified words or sentences from the text. "Classification" allows you to categorize the entire text as a whole into specified types. Additionally, "Pair Classification" enables you to compare and classify two texts side by side.
In addition, FastLabel operates smoothly, always displaying quickly when loading pages or navigating between menus. It also supports auto-annotation, which can reduce manual labor costs. Furthermore, it has project management features that allow for tracking work progress and reviewing data all within the tool.
For information about FastLabel, click here.
2-2. brat
BRAT stands for "BRAT Rapid Annotation Tool" and is an open-source, locally installed tool used in browsers. It allows for the extraction of named entities from text and their associations. By linking proper nouns to sources like Wikipedia, it is also possible to normalize nouns. Multiple users can access the annotation data and work simultaneously.
To use this, Python 2 is required, and installation is done by entering commands in the terminal or similar. Settings for classification labels cannot be done within the tool; instead, you need to write directly into the label configuration files provided in the installed brat directory. Additionally, you must create a file in advance to export the annotation data. Information about these installations and necessary settings is only briefly explained on the homepage, so the barrier from installation to the start of annotation work can be considered somewhat high. Furthermore, there are no project management features such as review functions or progress/status tracking, so when advancing a project with multiple people, it is necessary to establish an appropriate management plan to compensate for this.
There are many external forums about projects using this tool, where you can refer to various projects. It can be said to be optimal for annotation work as academic research.
For information about brat, click here.
2-3. LabelBox
LabelBox is a cloud-based annotation tool. It supports various annotations for images, videos, text, DICOM-compliant medical data, and map data such as COG. The paid version offers a wealth of features, while the free version is positioned as a trial version with limited functionality. The text annotation in the free version supports classification on a sentence-by-sentence basis. It can be used for sentiment analysis of dialogue, among other applications.
The paid version supports various text annotations such as named entity extraction and text classification. Additionally, if you use already annotated data, auto-annotation is also possible. It also includes management features for reviews and progress tracking, making it suitable for large-scale projects or ongoing projects.
For information about LabelBox, click here.
2-4. CVAT
CVAT (Computer Vision Annotation Tool) is an installed open-source annotation tool developed and released by Intel Corporation.
Supports image and video annotation, with capabilities for rectangles, polygons, lines, points, circles, and cubes, and also includes an automatic annotation feature. Automatic annotation can be performed on over 80 predefined objects (such as cars, people, airplanes, bicycles, dogs, etc.).
Although CVAT does not have a feature to directly return images with issues to the annotator during checks, annotators can record the URLs of problematic images in an input field called "Issue Tracker," allowing them to move to the indicated images via the link and make corrections.
Additionally, it is characterized by a very rich variety of exportable data formats (CVAT, COCO, Datumaro, CamVid, Cityscapes, etc.).
For more information about CVAT, click here.
2-5. VoTT
VoTT (Visual Object Tagging Tool) is an open-source annotation tool developed by Microsoft that can be installed.
It supports image and video annotation, operates smoothly, and features a user interface that is intuitive enough for those without annotation experience to use.
VoTT provides installers for Windows, Mac, and Linux, making it easy for anyone to install.
However, it does not have features for managing annotators or task progress, or for checking functions, so for projects involving multiple people, alternative management methods are necessary.
It supports output formats such as Azure Custom Vision service, Microsoft Cognitive Toolkit (CNTK), PascalVOC, TensorFlow records, VoTT JSON, and CSV.
2-6. labelimg
Labelimg is an open-source tool for image annotation that supports bounding box annotation.
During installation, you will need to use the terminal or similar to input commands, but compared to other tools, this step is simple.
You just need to place the file defining the class names as classes.txt in the specified folder, prepare a folder for images and a folder for annotation output files, and specify the paths for each folder in the tool to start the annotation work. It can be used locally, so it also supports annotation work that cannot be done with cloud tools.
Like VoTT, there are no management features such as task assignment, progress management, or work feedback, so if multiple people are working together, these management skills will be necessary.
The output formats supported are PascalVOC and YOLO.
3. Summary
This time, we explained three points to consider when choosing an annotation tool, and introduced three recommended text annotation tools.
As the number of annotation tools has increased recently, it is important to choose and utilize the most suitable annotation tool for your company's purposes in order to streamline the time-consuming annotation tasks as much as possible.
If you want to reduce the cost of implementing annotation tools, considering outsourcing the annotation itself is also an effective option. Our company offers a wide range of services from consultation on annotation tools to the outsourcing of annotation, so please feel free to reach out to us.
4. Human Science Annotation Agency Services
4-1. Extensive track record of creating 48 million teacher data entries
At Human Science, we participate in AI model development projects across various industries, including natural language processing, medical support, automotive, IT, manufacturing, and construction. To date, we have provided over 48 million high-quality training data through direct transactions with many companies, including GAFAM. We handle a wide range of annotation projects, from small-scale projects to long-term large-scale projects with 150 annotators, regardless of the industry. Companies that want to implement AI models but are unsure where to start are encouraged to consult with us.
4-2. Resource Management Without Using Crowdsourcing
At Human Science, we do not use crowdsourcing; instead, we advance projects with personnel directly contracted by our company. We form teams that can deliver maximum performance based on a solid understanding of each member's practical experience and their evaluations from previous projects.
4-3. Utilizing the latest data annotation tools
One of the annotation tools introduced by Human Science, AnnoFab, allows customers to receive progress checks and feedback from the cloud even during the project. By ensuring that work data cannot be saved on local machines, we also take security into consideration.
4-4. Fully equipped security room within the company
At Human Science, we have a security room that meets ISMS standards within our Shinjuku office. This allows us to handle even highly confidential projects on-site while ensuring security. We consider the protection of confidentiality to be extremely important for all projects. Our staff undergoes continuous security training, and we exercise the utmost caution in handling information and data, even for remote projects.