How to ensure and improve the quality of teacher data? Explanation of practical methods!

The use of AI is advancing in various fields. With AI, not only quantitative data but also qualitative data can be processed using its learning capabilities. As a result, the areas where AI is utilized will continue to expand. AI improves its recognition accuracy through learning. There are various methods such as "supervised learning" which uses training data called teacher data, and "unsupervised learning" where AI interprets data on its own and continues learning without the need for training data. Here, we will mainly introduce practical methods for creating training data in the process of AI development in a project.

Table of Contents

1. What is Teacher Data?
2. Importance of Teacher Data Quality
2-1. Benefits of High-Quality Teacher Data
2-2. What is High-Quality Teaching Data?
2-3. Disadvantages of Low-Quality Teacher Data
3. How to Create Basic Teacher Data
3-1. Setting the Objective
3-2. Collecting Data
3-3. Perform Data Annotation
4. Specific Practices for Ensuring the Quality of Teacher Data
4-1. Clarify the purpose
4-2. Ensure Quality and Appropriate Quantity of Collected Data
4-3. Ensure the Quality of Data Annotation
4-4. Consideration for Security, Privacy, and Copyright
5. Summary
6. Data Annotation Services by Human Science Co., Ltd.

1. What is Teacher Data?

In order for AI to be able to recognize necessary information from data, learning is necessary. The data used for this learning is called annotated data. For example, in order for AI to recognize whether a car is in an image or not, we prepare an image of a car and annotate it with the information "there is a car in this photo" as the annotated data. The process of adding information to data is called data annotation.

2. Importance of Teacher Data Quality

The quality of teacher data is a very important factor in determining the accuracy of AI. AI learns based on teacher data and makes predictions for unknown data. The quality of teacher data greatly affects the accuracy of AI. To avoid problems such as AI making incorrect recognitions due to poor quality teacher data, it is important to ensure quality when creating teacher data.

2-1. Benefits of High-Quality Teacher Data

AI trained with high-quality teacher data can make accurate judgments on unknown data. This eliminates the need for retraining AI or modifying teacher data, allowing for the project to progress without slowing down the speed of AI development.

2-2. What is High-Quality Teaching Data?

What exactly does it mean to be of high quality? We have mentioned the benefits of this for AI learning, but what does it mean specifically?

In order to achieve the goal of AI development, it is necessary to clearly define the requirements for AI to correctly recognize data and to define what kind of training data is needed. Based on this definition, it is necessary to prepare work instructions and specifications for annotation, and create training data according to them.

One thing to keep in mind here is that accuracy and high quality of data annotation are different. Let's consider the example of annotating a car with a bounding box. You might think that for high quality, the bounding box must perfectly fit the car. However, if the specification does not require such a high level of annotation accuracy, a certain margin can still ensure quality. What is important is to faithfully follow the specification, rather than blindly pursuing accuracy.

High-quality teacher data refers to accurately annotated teacher data based on work instructions and specifications.

2-3. Disadvantages of Low-Quality Teacher Data

When AI is trained with low-quality training data, misrecognition cannot be avoided. In such cases, not only the correction or addition of training data, but also the management cost associated with it will be incurred. Therefore, when considering outsourcing data annotation, it is necessary to consider not only cost reduction, but also quality and various other aspects.

For example, if there is a description in the specification document that says "create a bounding box that fits perfectly around the car," but a rough bounding box is created, it may include background information other than the car. In that case, the AI learning model may learn with noise such as background information, which could result in misrecognition. As a result, it may be necessary to recreate the training data, incurring unnecessary costs and causing delays in development, which could have a negative impact on the project.

3. How to Create Basic Teacher Data

Creating teacher data is divided into three main steps. First, we set the goal of the AI. Next, we collect the necessary data. And finally, we perform data annotation on the collected data.

3-1. Setting the Objective

First, we set the purpose of AI. For example, developing AI that can perform image recognition while driving for the purpose of autonomous driving. Simply saying "image recognition" does not mean that AI will automatically learn, so we need to set what kind of objects we want AI to learn.

3-2. Collecting Data

Collect the necessary data for the set purpose. In the example above, this would include images taken with an onboard camera.

3-3. Perform Data Annotation

We perform annotations such as bounding boxes on collected data to create training data for AI to correctly recognize the data.

※Meaning of Data Annotation
Data annotation originally means "adding annotations".
For example, it is like putting a sticky note on a book or adding an asterisk to a word and inserting an annotation in the margin.
In the creation of training data, specific locations of data that need to be recognized by AI are specified (such as bounding boxes or segmentation for images, or underlines for text) and linked with labels (also known as labeling), allowing AI to learn from the data.

4. Specific Practices for Ensuring the Quality of Teacher Data

Here, we will introduce the basic practical methods for creating teacher data.

4-1. Clarify the purpose

As mentioned in the section on setting goals, it is difficult for AI to learn if the purpose is vague. Even when creating training data, vague instructions will not allow for accurate data annotation. As a result, the quality of the training data will not be determined and AI learning will not go well. By clarifying the purpose, you can avoid such problems.

4-2. Ensure Quality and Appropriate Quantity of Collected Data

In order for AI to learn, a certain amount of organized teacher data is necessary. The amount varies depending on the complexity of the goal, so it cannot be said definitively, but in the case of images, it often requires thousands to tens of thousands of images.

Use our company's data

If your company has a large amount of data accumulated, you can utilize it. Even data that was saved without assuming AI usage, such as meeting minutes, call logs, images, and videos, may have the potential for creating new value by utilizing AI. Since there is no need to collect new data, the development period can also be shortened. There are also tools that can extract images from videos, so if you can prepare the necessary amount of data, utilizing your company's data would be a good choice.

Use Survey

Conducting surveys and emails is also a good way to collect data. In the past, it was a very time-consuming and costly process, such as street surveys, mail, and phone calls, but now, by utilizing SNS and crowdsourcing, it is relatively easy and low-cost to directly obtain the raw voices of the target audience that you want to analyze with AI.

Utilizing Datasets

If your company is unable to gather such a large amount of data on its own, you may also consider utilizing open datasets such as COCO.

: Be mindful of data bias

Let's prepare various types of data. For example, when collecting images from car-mounted cameras, we will include various scenes such as urban areas, highways, mountainous areas, night scenes, and rainy weather, not just images from urban areas. By doing this, we can prevent AI from overfitting to specific situations and improve accuracy only for those images, while reducing accuracy for other images.

Include negative samples as well

Include data that you want to recognize, including data that does not have a target. This type of data is called a negative sample and is effective in improving the recognition accuracy of AI.

4-3. Ensure the Quality of Data Annotation

In order to ensure the quality of teacher data, it is necessary to perform accurate data annotation. To do this, it is important to determine the specifications for annotation, design standards and rules so that anyone can produce the same quality teacher data, and implement various management methods to create high-quality teacher data in accordance with those specifications.

Determine the specifications of data annotation

The specifications for data annotation will be determined based on clearly defined requirements in setting the objective.

Designing Work Processes and Rules

We will create work instructions and specifications based on the data annotation specifications. When creating them, let's devise a work method that is less likely to make mistakes. Since data annotators may not understand the work instructions and specifications at the same level, there may be variations in how they label them. It is necessary to design a work process to minimize these potential questions and variations. For example, incorporating a process to not only check but also provide feedback on the annotated data.

Quality Management for Ensuring Quality

To ensure the quality of data annotation, it is important to determine specifications and design processes. However, in order to proceed with annotation based on these, appropriate management is essential. Always keep an eye out for discrepancies between work instructions and specifications, and in addition to checks and feedback, hold meetings and individual hearings as needed to ensure quality management.

4-4. Consideration for Security, Privacy, and Copyright

Data handled by annotations may require working in a highly secure environment depending on the case. There is also a possibility of using data sets without knowing that they may violate privacy or be protected by copyright. It is important to carefully examine the conditions under which the data can be used beforehand.

5. Summary

The accuracy of AI recognition is influenced by the quality of the training data. Even if the goal is clearly defined and the necessary amount and type of data can be collected, desirable results cannot be obtained if the quality of the training data is low. To avoid this, we have explained practical methods for ensuring and improving quality. Among them, data annotation, which is necessary for creating training data, requires manual work by humans, so management centered around people is essential. These may seem obvious, but let's not overlook them and make sure to firmly grasp them.

By implementing these practices, the quality of teacher data can be ensured, and improvement in AI accuracy can be expected. This will lead to the success of the project if the set objectives are achieved, and will allow for a smooth progression to the next step, such as tackling more challenging objectives or releasing products.

6. Data Annotation Services by Human Science Co., Ltd.

Rich track record of creating 48 million pieces of teacher data

At Human Science, we are involved in AI model development projects in various industries such as natural language processing, medical support, automotive, IT, manufacturing, and construction. Through direct transactions with numerous companies including GAFAM, we have provided over 48 million high-quality training data. We handle various annotation projects regardless of industry, from small-scale projects to large-scale projects with 150 annotators. If your company is interested in introducing AI models but unsure of where to start, please consult with us.

Resource Management without Using Crowdsourcing

At Human Science, we do not use crowdsourcing and instead directly contract with workers to manage projects. We carefully assess each member's practical experience and evaluations from previous projects to form a team that can perform to the best of their abilities.

Utilize the latest data annotation tools

One of the annotation tools introduced by Human Science, AnnoFab, allows customers to check progress and provide feedback on the cloud even during project execution. By not allowing work data to be saved on local machines, we also consider security.

Equipped with a security room within the company

At Human Science, we have a security room that meets the ISMS standards in our Shinjuku office. This allows us to provide on-site support for highly confidential projects and ensure security. We consider confidentiality to be extremely important for all projects at our company. We continuously provide security education to our staff and pay close attention to the handling of information and data, even for remote projects.