What is fine-tuning in LLM?

07/24/2024

What is fine-tuning in LLM?

Since the emergence of LLMs (Large Language Models) represented by ChatGPT, many people have been able to easily interact with LLMs, amazed by their ability to perform various tasks such as complex programming, document creation, information retrieval, and analysis, as if they were truly conversing with a human. The movement to leverage this capability for business and promote digital transformation (DX) is also accelerating. By utilizing LLMs, dramatic efficiency improvements are expected in all aspects of business, including content creation, marketing, market analysis, document search, and knowledge management. However, there are challenges where the expected results are not achieved even after implementing LLMs. In this article, we will explain the challenges associated with the utilization of LLMs and discuss ways to address them.

Table of Contents

1. Challenges in Utilizing LLMs
2. What is Fine-Tuning?
3. Summary
4. Human Science Annotation, LLM RAG Data Outsourcing Service

1. Challenges in Utilizing LLMs

The overwhelming advantage of LLMs over traditional language models is their ability to understand and output information in a conversational manner, with a very high level of accuracy, comparable to or even exceeding that of humans. Some models possess intelligence that meets the passing standards of the national medical examination. The capabilities of LLMs have been achieved through the enhancement of the three key factors that determine AI accuracy: "computational power," "data volume," and "number of parameters." While LLMs may seem all-powerful, they also have their challenges.

The data that LLM learns from is based on a vast amount of text data collected from the internet. However, it does not continuously learn from the enormous amount of text that is generated in real-time; rather, it has learned from data up to a certain point (for example, for the free version of ChatGPT 3.5, this is up to September 2021). Additionally, while LLM excels at generating general responses across a wide range of fields, it cannot learn about specialized areas or internal company data that is not available on the internet (and even if it could, the amount of data would not be sufficient), which means it cannot provide accurate answers. In some cases, it may produce incorrect information that appears to be genuine, a phenomenon known as "hallucination."

For example, let's consider utilizing LLMs in the manufacturing industry. There is a vast amount of documentation accumulated daily in various operations such as design and development, production management, quality control, and maintenance and inspection for each company. On the other hand, many of the company's unique documents are not publicly available on the internet, so LLMs cannot learn from them. Therefore, using publicly available LLMs as they are may result in not getting answers or experiencing hallucinations, and you may not achieve the desired results.

Various methods have been researched and devised to address the challenges when trying to utilize LLMs for specific purposes. Among them, fine-tuning and RAG are widely known. Therefore, this time we will focus on fine-tuning and explain it.

Reference Blog

>LLM and RAG: An Explanation of the Utilization of Generative AI in Business

2. What is Fine-Tuning?

Fine-tuning is a method of providing additional training data tailored to a specific purpose to a pre-trained LLM model and further training it. Fine-tuning itself is a technique used in deep learning with neural networks even before LLMs, where a new layer is added to the end layer of the network to train on a new dataset. The advantage of fine-tuning is that it can be performed with a very small dataset (though still in the thousands to tens of thousands) compared to pre-training, as it involves additional training on a pre-trained LLM.

LLMs provided by companies like OpenAI offer APIs for fine-tuning, allowing users to perform fine-tuning using these resources. The basic process involves preparing a dataset, adding data for the LLM to learn from, and evaluating the results. The format of the data to be prepared varies by LLM provider, but generally, it consists of data structured as <input, output>. For example, OpenAI's training data format is in JSON, consisting of prompts (questions) and completions (answers).

Example
{"prompt":"What are the features of model HS024?\n##\n", "completion":"As a new feature, it is equipped with a LiDAR sensor that can perceive three-dimensional space.\n###\n"}

The amount of training data varies depending on the complexity and difficulty of the objective, but at least several thousand to tens of thousands of data points are required. While it is possible to tune with a smaller amount, there is a risk of forgetting the pre-trained parts or overfitting, so it is important to prepare a sufficient amount of data when performing fine-tuning.

Additionally, the quality of the dataset is also important. Naturally, the data must not contain incorrect information. It is also necessary to comprehensively cover all the information that you want to train on. For example, in order to enable searching for your company's product information, it is essential to prepare comprehensive information related to the product, such as its specifications, features, design, and release date.

Creating such data often has to be done manually, and considering the amount and quality of the necessary data, it is easy to imagine that it will require a tremendous amount of labor. When conducting fine-tuning in-house, additional work to create training data arises on top of regular operations, which may make it difficult to allocate resources for that purpose.

3. Summary

LLMs possess a versatility that is unmatched by previous AIs. However, that does not mean they can be expected to be utilized effectively in specialized fields or for the latest information as is.

To utilize LLMs in business, it is necessary to customize them according to specific purposes, and one of the methods we have highlighted this time is fine-tuning. As we have seen here, there is a process that requires human effort in collecting and creating the training datasets for fine-tuning. Often, the work of creating training data does not require specialized knowledge, such as that expected from IT engineers. In such cases, outsourcing to vendors specializing in training data creation can be a good option.

Our company has extensive experience and achievements in annotation work for creating "training data," starting with natural language processing annotation, even before the advent of LLMs. We hope to leverage this experience and achievements to assist our clients in their digital transformation by creating training data for fine-tuning LLMs.

4. Human Science Annotation, LLM RAG Data Outsourcing Service

Over 48 million pieces of training data created

At Human Science, we are involved in AI model development projects across various industries, starting with natural language processing and extending to medical support, automotive, IT, manufacturing, and construction, just to name a few. Through direct business with many companies, including GAFAM, we have provided over 48 million pieces of high-quality training data. No matter the industry, our team of 150 annotators is prepared to accommodate various types of annotation, data labeling, and data structuring, from small-scale projects to big long-term projects.

Resource management without crowdsourcing

At Human Science, we do not use crowdsourcing. Instead, projects are handled by personnel who are contracted with us directly. Based on a solid understanding of each member's practical experience and their evaluations from previous projects, we form teams that can deliver maximum performance.

Support for not just annotation, but the creation and structuring of generative AI LLM datasets

In addition to labeling for data organization and annotation for identification-based AI systems, Human Science also supports the structuring of document data for generative AI and LLM RAG construction. Since our founding, our primary business has been in manual production, and we can leverage our deep knowledge of various document structures to provide you with optimal solutions.

Secure room available on-site

Within our Shinjuku office at Human Science, we have secure rooms that meet ISMS standards. Therefore, we can guarantee security, even for projects that include highly confidential data. We consider the preservation of confidentiality to be extremely important for all projects. When working remotely as well, our information security management system has received high praise from clients, because not only do we implement hardware measures, we continuously provide security training to our personnel.