
LLMs (Large Language Models), represented by ChatGPT, have significantly improved in performance since their emergence. LLMs can perform various tasks such as generating text and program code, information analysis, and information gathering, with prompts (instructions to AI) being given in spoken language (natural language). In contrast to previous "discriminative" AI models, which had to be developed from scratch for specific tasks, the barriers to utilization have been greatly lowered, leading to their application in various fields such as business, research and development, and entertainment. In particular, in the business sector, improving operational efficiency through digital transformation (DX) has become an essential challenge, and efforts to utilize LLMs are becoming more active.
Many LLMs are being developed under the leadership of American companies such as Meta and Microsoft. These models are based on training in English, and they are adapted for other languages through additional training. In general usage in Japanese, their performance can be considered sufficient, but there are challenges in accuracy when it comes to more specialized fields. This is also true for their application in business. If there were LLMs specialized in Japanese that reflect Japan's unique culture and business practices, it would be expected to further promote digital transformation in domestic companies. As a result, there is an increasing number of companies and research institutions developing LLMs specialized in Japanese. In this context, we will introduce LLMs that are specialized in Japanese.
- Table of Contents
1. Why a Japanese Language-Specific LLM?
The capabilities of LLMs are primarily realized by learning from the vast amount of data available on the internet. As of January 2024, English accounts for over 50% of the languages used in web content, with Spanish in second place at 6%. Japanese is approximately 4% (Statista: Languages most frequently used for web content as of January 2024, by share of websites, February 2024 article). As the name suggests, LLMs enhance their performance by learning from massive datasets. With an increase in data volume, they can learn more broadly, improving versatility, generative ability, and accuracy. Currently, many of the developing companies are based in English-speaking regions, so it can be said that the language LLMs excel in the most is English.
Of course, many LLMs support Japanese, and you can witness their excellent capabilities in tools like ChatGPT. However, the amount of data available in Japanese on the internet is significantly less than that in English, so depending on the content of the questions, they may not be able to generate accurate responses. Additionally, Japanese involves a complex interplay of multiple scripts such as kanji, hiragana, and katakana, and the boundaries between words are not always clear. There are unique challenges, such as the need to understand sentences in the context of preceding and following text, which is less of a requirement in English. Furthermore, there is room for improvement in many existing LLMs regarding their ability to generate responses that naturally incorporate Japanese-specific cultural nuances and expressions, such as honorifics and phrasing. Next, we will introduce LLMs that are currently being developed with a focus on Japanese.
2. Recommended 3 Japanese-Specific LLMs
●CyberAgentLM3
This is a Japanese language-specific LLM developed by CyberAgent, Inc. As of July 2024, it is a model with 22.5 billion parameters, and it has recorded performance equivalent to Meta's LLM "Meta-Llama-3-70B-Instruct" on the Japanese language ability evaluation metric "Nejumi LLM Leaderboard 3." It is one of the top-class models for Japanese LLMs.
Reference URL: CyberAgentLM3 Demo
●ao-Karasu
Lightblue Inc. is a startup that offers services to add AI assistant features to chat tools like Slack, and it will release its Japanese LLM "Qarasu" with top-class performance in Japan in December 2023. After the release of this model, further development will continue, and the ao-Karasu announced in March 2024, just four months later, is an LLM with 72 billion parameters, boasting performance that surpasses GPT-3.5.
Reference URL: ao-Karasu: Cutting-edge 72B Japanese LLM Development
●ELYZA
ELYZA, a startup born from the Matsuo Laboratory at the University of Tokyo, specializes in a Japanese language-focused LLM. Based on Meta's "Llama 3," this model has been trained using a unique Japanese dataset, and the commercially available "Llama-3-ELYZA-JP-8B" boasts 8 billion parameters while being lightweight. It has achieved performance comparable to "GPT-3.5 Turbo" and "Gemini 1.0 Pro" in automatic evaluations using two benchmarks for measuring Japanese performance (ELYZA Tasks 100 and Japanese MT-Bench).
As of June 2024, a model with 70 billion parameters is also in development, which boasts Japanese performance surpassing that of GPT-4, and you can experience a glimpse of its capabilities in the demo version.
Reference URL: ELYZA LLM for JP Demo Version
3. Summary
We have introduced three Japanese language-specific LLMs so far. Of course, there are many companies and research institutions developing excellent Japanese language-specific LLMs beyond these. Among such LLMs, there are models that rival the performance of ChatGPT, and it is believed that the utilization of these LLMs in business will progress more than ever. If we can leverage Japanese language-specific LLMs, we can accelerate the promotion of DX, such as improving operational efficiency and knowledge management using internal data in conjunction with technologies like RAG. On the other hand, much of this internal data consists of unstructured data, such as emails, documents containing charts and images, meeting minutes, and business reports, which are difficult to use as training data in their original form. When implementing LLMs, it is necessary to undertake tasks such as structuring this data. Since it may be difficult to carry out such tasks internally, we recommend considering the use of external vendors specializing in data structuring as you proceed with the implementation of LLMs.
4. Human Science Annotation, LLM RAG Data Structuring Agency Service
A rich track record of creating 48 million pieces of training data
At Human Science, we are involved in AI model development projects across various industries, starting with natural language processing, including medical support, automotive, IT, manufacturing, and construction. Through direct transactions with many companies, including GAFAM, we have provided over 48 million high-quality training data. We accommodate various types of annotation, data labeling, and data structuring, from small-scale projects to long-term large projects with a team of 150 annotators, regardless of the industry.
Resource management without using crowdsourcing
At Human Science, we do not use crowdsourcing; instead, we advance projects with personnel directly contracted by our company. We form teams that can deliver maximum performance based on a solid understanding of each member's practical experience and their evaluations from previous projects.
Supports not only annotation but also the creation and structuring of generative AI LLM datasets
In addition to labeling and annotation for identification systems for data organization, we also support the structuring of document data for the construction of generative AI and LLM RAG. Since our founding, we have been engaged in manual production as a primary business and service, leveraging our unique know-how gained from a deep understanding of various document structures to provide optimal solutions.
Equipped with a security room in-house
At Human Science, we have a security room that meets ISMS standards within our Shinjuku office. Therefore, we can ensure security even for projects that handle highly confidential data. We consider the protection of confidentiality to be extremely important for all projects. Even for remote projects, our information security management system has received high praise from our clients, as we not only implement hardware measures but also continuously provide security training to our personnel.