
When performing annotation, it is often seen that crowdsourcing is utilized to gather a large number of personnel and conduct a significant amount of annotation. There are also many annotation vendors that leverage such crowdsourcing. In this article, I would like to explain the tips for utilizing crowdsourcing in annotation work while highlighting the points to consider, as well as the advantages and disadvantages of using crowdsourcing or annotation vendors that utilize it.
- Table of Contents
-
- 1. Benefits of Crowdsourcing
- 1-1. Scope and Diversity of Work
- 1-2. Short Delivery Time and Low Cost
- 2. Disadvantages of Crowdsourcing
- 2-1. Variation in Quality
- 2-2. Security
- 2-3. Communication and Education
- 2-4. Knowledge and Difficulty of Specific Domains
- 3. Summary (Tips for Utilizing Crowdsourcing)
- 4. Human Science Annotation Agency Services
1. Benefits of Crowdsourcing
1-1. Scope and Diversity of Work
Annotation often requires a large amount of training data, and especially when there is a need to prepare data in a short period of time, it goes without saying that a significant amount of human resources is inevitably required. There is a considerable amount of time and effort involved from hiring personnel to actually starting the work, but when a large number of personnel is needed, it is undoubtedly effective to utilize crowdsourcing, which has a large pool of human resources. When annotation or training data is relatively simple and does not require long-term training for workers, and the skill requirements for personnel are not particularly high, utilizing crowdsourcing becomes an effective means.
If you want to collect diverse data, utilizing crowdsourcing is one effective method. With crowdsourcing, which has a large pool of human resources, not only is it relatively easy to collect large amounts of data, but in cases such as AI development for OCR, where a significant amount of handwritten text data is required, gathering small amounts of handwritten data from many people can create a diverse training dataset, which is beneficial for AI learning and improving accuracy. Collecting such data is where crowdsourcing excels, and it is a very effective method not only for data collection but also when seeking diversity in training data.
1-2. Delivery and Cost
Needless to say, hiring a large number of personnel in-house incurs significant costs and labor. Even annotation vendors that maintain a certain level of personnel through a registration system require considerable time and costs to secure a large number of personnel for specific annotation projects and to deploy them for work, which inevitably adds to the fees. In this regard, crowdsourcing and annotation vendors that utilize it already have a large pool of personnel, allowing them to secure staff relatively quickly and at a lower cost, enabling them to start work at an early stage. As a result, not only are hiring costs reduced, but delivery times are generally shorter as well.
In addition to recruitment, by deploying a large number of personnel, we can inevitably shorten the delivery time of the work itself. Shortening the delivery time of the work itself means that the period for managing personnel and work data is also reduced. While managing a large amount of data and personnel in a short period of time is required, it generally tends to improve cost efficiency and reduce total costs.
2. Disadvantages of Crowdsourcing
2-1. Variation in Quality
Utilizing a large number of personnel in a short period to mass-produce training data is a significant advantage; however, ensuring consistent annotation and quality of training data among workers in such situations is often challenging. It is obvious, but as the number of workers increases, the communication and thoroughness of work instructions, points of caution, changes in specifications, and methods for handling edge cases often do not reach every corner. Additionally, confirming whether workers understand the content of those instructions becomes important in the management of annotation projects, but the more people there are, the more difficult this becomes, increasing the risk of quality variation.
In many cases, crowdsourcing typically involves contracts for each project or task, which can lead to a weakened sense of loyalty towards the client company. No matter what measures are taken, there tends to be a low commitment to the quality of work from the workers.
Related Columns
>How to deal with edge cases that cannot be covered in specifications
>Management of Annotation Work in Human Science
2-2. Security
Crowdsourcing workers often work from home, and especially when handling sensitive information such as annotations, it is common for confidentiality to be high. In cases where work needs to be done in a security room within the vendor company, it generally becomes difficult to accommodate this. Regarding security training, since contracts with client companies are often project-based and temporary, it is generally challenging to ensure security from a soft aspect, such as continuous education. There are inherent limitations to security measures that rely solely on hard aspects, and even if the workers themselves are not aware of security violations, a lack of education can lead to insufficient knowledge and awareness of security, resulting in unintentional security breaches. This is a common occurrence.
Related Columns
>Creating an annotation work environment on-site (in our company's security room)
2-3. Specific Domain Knowledge and Difficulty
Annotations can sometimes require a high level of difficulty, specific domain knowledge, and expertise. Even in the large talent pool of crowdsourcing, there are not many individuals with specialized knowledge in the relevant fields. To acquire expertise and domain knowledge, and to ensure the quality of high-difficulty annotations, it is crucial to keep workers as consistent as possible and focus on their skill development. However, it is generally said that in crowdsourcing, contracts are often project-based, making it difficult to maintain a consistent workforce over a relatively long period while continuously providing education and training for skill development.
Related Columns
What is the surprising difficulty of annotation?
2-4. Communication and Education
The education on security and specific domain knowledge mentioned above, along with instructions and thoroughness in work, requires communication with the workers. This is especially true for annotation tasks that involve making judgments on ambiguous items, where the importance of communication increases.
Furthermore, specific domain knowledge often contains many tacit elements, and during training, relying solely on text-based manuals and materials is often insufficient. It is usually more effective to show the actual screen and communicate while demonstrating the process, as this enhances understanding. To achieve this, real-time Q&A during meetings is necessary; however, as the number of participants increases, it becomes more challenging to coordinate meeting times, leading to delays in training and instructions, which in turn can result in issues where the outcomes are not reflected in the deliverables.
At first glance, annotation may seem like a simple task, and one might think that if work instructions and procedures are carefully prepared, such training and communication are not that important. However, this perception may stem from being well-versed in annotation specifications and specific domain knowledge. Additionally, as the work progresses, exceptions and edge cases inevitably arise in annotation tasks, making such communication essential.
Related Columns
3. Summary
As mentioned so far, I believe you have understood the advantages and disadvantages of utilizing crowdsourcing. If you grasp and utilize these, there is no doubt that crowdsourcing is a very effective means for annotation and data collection.
However, for this to be effective, it is important to appropriately differentiate based on various situations such as the purpose of AI development, the phase of development, annotation specifications, required quality levels, characteristics of the work, and security. What has been discussed so far is general and does not apply to everything. There are many annotation vendors that, while utilizing crowdsourcing, continuously innovate and improve to overcome its disadvantages and maximize its advantages in their services.
Therefore, when considering outsourcing annotations and data collection, we recommend that you do not rely solely on email or inquiry forms. Instead, it is advisable to have a meeting to discuss the matters mentioned so far and obtain a quote based on that discussion.
4. Human Science Annotation Agency Services
A rich track record of creating 48 million pieces of training data
At Human Science, we participate in AI model development projects across various industries, including natural language processing, medical support, automotive, IT, manufacturing, and construction. Through direct transactions with many companies, including GAFAM, we have provided over 48 million high-quality training data sets. We handle a wide range of data labeling projects, from small-scale projects to long-term large-scale projects with 150 annotators, regardless of the industry. If your company wants to implement AI models but doesn't know where to start, please feel free to consult with us.
Resource management without using crowdsourcing
At Human Science, we do not use crowdsourcing; instead, we advance projects with personnel directly contracted by our company. We form teams that can deliver maximum performance based on a solid understanding of each member's practical experience and their evaluations from previous projects.
Utilize the latest data labeling tools
One of the data labeling tools introduced by Human Science, AnnoFab, allows customers to receive progress checks and feedback from the cloud even during the project's progress. By ensuring that work data cannot be saved on local machines, we also take security into consideration.
Equipped with a security room in-house
At Human Science, we have a security room that meets ISMS standards within our Shinjuku office. This allows us to handle even highly confidential projects on-site while ensuring security. We consider the protection of confidentiality to be extremely important for all projects. Our staff undergoes continuous security training, and we exercise the utmost caution in handling information and data, even for remote projects.