Some parts of this page may be machine-translated.

 

[Spin-off] The Surprising Difficulty of Data Annotation? ~Tips for Choosing a Data Annotation Outsourcing Company Based on Difficulty~

[Spin-off] The Surprising Difficulty of Data Annotation? ~Tips for Choosing a Data Annotation Outsourcing Company Based on Difficulty~



Spin-off blog project
- Annotation that supports AI in the DX era. The unexpected difficulty of annotation.
What is the unexpected difficulty of annotation?
~Tips for selecting an annotation outsourcing service based on difficulty~

Our company has been publishing various blogs about data annotation and AI. In those blogs, we have mainly shared general knowledge and know-how. Data annotation may seem simple at first glance, as it involves putting the content into words, but it is actually a task that cannot be avoided by humans and contains a lot of "ambiguity". Therefore, there is a lot of interaction between people involved in the process. As a result, it requires a lot of experience and know-how to ensure quality and productivity, which cannot be achieved by just following clean theories.

 

Therefore, we believe that understanding the specific problems and solutions that occur in the actual data annotation process can serve as a helpful guide to success in data annotation.

 

In our company, what actually happens and what specific responses and measures are taken? Unlike regular blogs, in our spin-off blog project titled "Data Annotation: Supporting AI in the DX Era. The Realities of the Analog Field", we would like to share the realities of the field, including our unique features and commitments.

 

>>Past Published Blogs (Some)

How to outsource annotation work? 7 tips

How to deal with edge cases that cannot be covered in the specification document

7 Tips to Successfully Lead Annotations

Teacher data is essential for creating good teachers.

 

Table of Contents

1. What is the unexpected difficulty of data annotation?

This time, I would like to talk about the difficulty of data annotation that cannot be avoided when considering outsourcing and selecting proxy companies for annotation.
I think that everyone who is reading this blog can easily imagine "difficult annotation" when specialized knowledge and domain knowledge are required for annotation and labeling in each field. For example, in the medical field or manufacturing industry, it is often difficult to make judgments or label without being familiar with the field, such as special appearance defects, and many customers who are considering outsourcing or proxy for annotation may also have concerns about this area.

 

However, there are unexpectedly difficult tasks, and when outsourcing data annotation, it is important to be aware that there may be a decrease in quality and higher costs than expected. Therefore, in this case, we would like to discuss what constitutes a difficult data annotation, excluding tasks that require specialized knowledge or domain expertise, from the perspective of those working in the annotation department.

 

Many types of label classes

Before starting the data annotation process, the worker will first have a general understanding of the label types and specifications. While it may seem intuitive, humans can only remember about 10 types of labels at most. If the labeling is for everyday objects, this may not apply, but when there are many label types, typically over 15-20, the worker will have to constantly refer to the specifications and procedures, resulting in decreased productivity and efficiency, as well as increased costs and project duration. Additionally, as the number of labels increases, there will naturally be more similar labels, leading to more confusion and potential labeling errors. As the worker becomes more experienced, these issues will be resolved, but for smaller projects with tight deadlines, the worker may only become familiar with the process by the time it is completed.

 

Many exceptions and edge cases

Data annotation is commonly seen in language-specific text, but when there are many exceptions or edge cases that are not mentioned in the specifications, work may be stopped and the Q&A sheet, which accumulates the response methods for the specifications and edge cases, will be checked. If it is still unclear, the PM or reviewer/QA personnel who are familiar with the specifications will be consulted. However, there are also many cases where even the PM cannot make a decision, so in such cases, the PM will ask the customer for questions and discussions to determine the direction.

 

In order to ensure quality, the PM needs to compile and share these Q&A cases with all workers and create an environment where they can easily view and confirm them. In language-specific data annotation, it is inevitable that exceptions and edge cases will increase, but when these exceptions and edge cases increase, information may not reach the workers or they may be too focused on their work to check all the details. As a result, the probability of errors occurring increases, so the PM needs to determine whether it will affect all workers or only the specific worker, and announce updates to the Q&A sheet as needed. It is also important to communicate the policy and direction in an abstract and easy-to-understand manner, as workers may not be able to remember all the detailed information. Holding meetings with workers and explaining orally can also be effective in ensuring quality, requiring various management strategies.

 

Ambiguity is high and there is no absolute correct answer

This is also commonly seen in the annotation of text and dialogue, for example, labeling based on the type of human emotion for dialogue text. These annotations, which have a lot of areas where judgments differ among individuals, tend to have a higher level of difficulty overall. Expressions of emotion, for example, if the worker feels that way, in a sense, there is no other correct answer, and they can only label it as they feel.

 

To ensure the quality of data annotation that is influenced by the sensibilities of such individuals, it is important to carefully select and assign workers, as well as manage the suitability of the personnel who will realize it. In addition, in the work, at the beginning, it is necessary to carefully perform the work while checking the specifications and procedures, so even if the labeling tendency matched the annotation specifications, as the work is repeated, the senses may become numb, and the direction and boundaries of different labels gradually shift, and unknowingly, the annotation results may also deviate from the specifications.

 

Data annotation, which has no absolute correct answer, is often done through a "consensus check" where multiple people annotate the same material and determine the correct answer through majority vote or agreement rate. In such annotation, it is common to not conduct third-party rechecks or reviews, and even if they are conducted, they often have little effect. Therefore, it is important for the project manager to regularly check the labeling tendencies of the workers and give instructions to correct the direction in order to ensure quality.

 

2. Tips for Outsourcing Data Annotation Based on Difficulty

Do not increase the number of labels or classes carelessly.

There are also goals and objectives for AI development, so I think there are some unavoidable aspects. However, when thinking "Let's label this just in case," the number of labels and classes will continue to increase. This also becomes a discussion of balance and trade-offs between development goals and goals, but it is important to clearly define the goals and annotation specifications for AI development and not increase the number of labels and classes carelessly.

 

Choose an annotation vendor that excels in handling exceptions and edge cases, as well as information management.

Exceptions and edge cases are inevitable in data annotation, and a lot of time is spent managing the annotation process to handle these exceptions and edge cases. In the field of data annotation, it is not an exaggeration to say that the management of the annotation process starts and ends with handling edge cases, in addition to setting up and preparing for the annotation project.

 

If the management and communication of edge cases and thoroughness of information are not in place, it not only leads to work mistakes and error occurrences, but also means that reviewers and checkers responsible for QA do not understand how to handle edge cases and exceptions, making their checks meaningless. Therefore, it is necessary for the PM to have the know-how to properly manage information and thoroughly communicate it to everyone involved in the annotation process. In particular, this type of information is difficult to convey to workers through text alone, so it may be necessary to hold meetings and verbally convey nuances, as well as to confirm whether the other party understands in the moment. In this sense, it is more effective to entrust the work to a vendor that focuses on preventing errors rather than one that simply checks and corrects them through brute force, as this ultimately leads to lower costs and ensures stable quality.

 

Choose a vendor with extensive experience in data annotation work with high ambiguity.

As mentioned earlier, there are specific know-how and management methods for data annotation, such as consensus checks and majority voting. Even for data annotation without checks by a majority of workers, it is not possible to ensure the expected quality if the work is left as it is. Especially for text annotation such as emotion labeling, human sensitivity and perception are often important factors, and it is necessary to understand the characteristics of personnel in advance in order to assign appropriate personnel. In that sense, it is also beneficial to entrust the management of personnel suitability and meticulous management to a vendor with experience in such annotation and checking methods.

 

3. Summary

What has been mentioned so far may not necessarily mean that data annotation is difficult in its purest sense. However, neglecting these factors can greatly affect the quality, cost, and delivery time of data annotation, whether it is done in-house or outsourced to a data annotation agency, and may not produce the desired results, making it difficult in that sense. In short, data annotation that requires high expertise is difficult because the correct answer cannot be determined = difficult because the expected quality cannot be guaranteed. In that sense, the high level of expertise required for data annotation and the factors mentioned so far are the same.

 

If you are designing the specifications for data annotation, you may not feel that the factors that increase the difficulty of annotation, as mentioned so far, are so difficult, as you have delved deeply into the specifications and considered them. However, for data annotators who only have general knowledge or are seeing the specifications for the first time, it can be said that the hurdle for the work is high, which is somewhat natural. If you get a quote from a data annotation outsourcing company, the cost may be higher than expected. When you open the lid, there may be many errors in the delivered data. We hope this will be helpful to avoid situations such as being asked to review the cost after the work is completed.

 

 

Author:

Kazuhiro Sugimoto

Annotation Department Group Manager

 

・Previous position as a Project Manager for a Tier 1 automotive parts manufacturer, with experience in quality design and improvement guidance for production lines, as well as managing model line construction projects and consulting teams for business efficiency improvement (lean improvement) across multiple departments.
・In current position, involved in launching and expanding the data annotation business, as well as directing the construction and improvement of management systems for data annotation projects, after experience in management systems such as ISO and knowledge management promotion. Holds a QC Level 1 certification and is a member of the Japan Association of Public Universities.



 

 

 

Related Blogs

 

 

Popular Article Ranking

Contact Us / Request for Materials

TOP