
In addition to overseas AI companies, domestic companies and research institutions in Japan are also focusing on the development and implementation of "Japanese-compatible LLMs (Large Language Models)." However, even when simply saying "LLMs strong in Japanese," the direction of their strengths varies greatly depending on the model.
This article organizes key points to keep in mind when selecting a Japanese language LLM, while cross-comparing global LLMs and Japan-originated LLMs.
Table of Contents
- 1. Why is Japanese a difficult language for LLMs?
- 1.1 Structural mismatches with LLMs designed based on English assumptions
- 1.2 The Difficulty of Tokenization and the Problem of Orthographic Variations
- 1.3 The Characteristic of Japanese Having Many "Different Expressions with the Same Meaning"
- 2. Comparison Between Global LLMs and Japan-Originated LLMs
- 2.1 Global Multilingual LLMs
- 2.2 LLMs Specialized and Optimized for Japanese
- 3. Characteristics of Japanese LLMs Revealed by Various Evaluation Results
- 4. How to Choose the "Optimal Japanese LLM" by Use Case
- 5. "Data Quality" Supporting the Accuracy of Japanese LLMs
- 5.1 Challenges Unique to Japanese Data: Orthographic Variations, Ambiguity, and Unstructured Data
- 5.2 High-quality Annotation and Data Structuring Support Accuracy
- 5.3 In RAG and Business-specific LLMs, Data Design Determines Outcomes
- 6. Summary: Human Science Annotation Support
- 6.1 Human Science Solutions
- 6.2 For Companies Facing These Challenges
1. Why is Japanese a difficult language for LLMs?
In LLM development, Japanese has language-specific difficulties as outlined below.
1.1 Structural mismatches with LLMs designed based on English assumptions
The difficulty in developing Japanese LLMs is not simply because "Japanese is complex." A major underlying reason is that LLM research and development have primarily progressed centered on English. Assumptions that work well in English often do not directly apply to Japanese in many cases.
For example, English is a language where meaning is easily determined by word order, but Japanese indicates relationships within a sentence through particles. "He saw her" and "Her, he saw" have almost the same meaning despite the difference in word order. On the other hand, if particles such as "ga," "wo," or "ni" are used incorrectly, who did what to whom can change significantly. In Japanese, the role of particles is more important than word order, so the model needs the ability to accurately handle these subtle function words.
Also, in Japanese, subjects and objects are often omitted. Even in short sentences like "Confirmed" or "Please," it is necessary to determine from the surrounding context who confirmed what and what is being requested. This is especially important in conversations and business documents. Since it is necessary to supplement and understand information that is not explicitly stated, simply processing sentences in isolation is insufficient.
Furthermore, the appropriate use of honorifics and writing styles is also a major challenge. In Japanese, the suitable way of expression changes depending on the relationship with the other person, the situation, and one's position within the organization. Although "miru" (to see), "goran ni naru" (honorific form of see), and "haiken suru" (humble form of see) have similar meanings, the contexts in which they should be used differ. Even if the content is correct, mistakes in the direction or level of politeness of honorifics can result in unnatural sentences or expressions that may be considered rude.
In this way, Japanese LLMs require the ability to simultaneously handle multiple elements such as particles, ellipsis, honorifics, and writing styles, not just word order. Simply applying English models directly to Japanese makes it difficult to consistently generate expressions that are natural in Japanese and appropriate to the context.
1.2 The Difficulty of Tokenization and the Problem of Orthographic Variations
One of the major challenges for Japanese LLMs is the issue of what unit to use when handling text. Generally, LLMs do not split text by words or morphemes themselves but by units called subwords. Subwords are chunks of characters smaller than or close to words, used by the model to process sentences efficiently.
In English, because there are spaces between words, it is relatively easier to utilize word boundaries when splitting into subwords. In contrast, Japanese basically has no spaces. Therefore, it becomes difficult to determine how much should be treated as a single meaningful unit. For example, the sentence "using machine learning" can be handled as different units such as "machine learning," "machine," "learning," the object particle, and "use," depending on the model or vocabulary construction.
Furthermore, Japanese contains a mixture of kanji, hiragana, katakana, alphanumeric characters, and symbols. It is not uncommon for multiple different spellings to be used for the same meaning, such as "moving house," "moving house" (alternative spelling), "move," and "moving house" (phonetic spelling). When such variations in spelling occur, even words with the same meaning may be treated as different strings within the model.
This issue also affects the quality of search and generation. If expressions with the same meaning are not understood as the same, related information cannot be effectively linked. Additionally, if rare notations or technical terms are split too finely, it can become difficult to understand the context and generate natural text. In Japanese LLMs, it is important not only to simply split characters but also to grasp meaningful units by considering variations in notation and context.
1.3 The Characteristic of Japanese Having Many "Different Expressions with the Same Meaning"
Japanese has a wide variety of expressions that convey the same meaning. For example, "This time is difficult," "I will have to pass," and "I will consider it positively" can have similar or different meanings depending on the context. Since Japanese often avoids direct negation and conveys intentions indirectly, judging solely by the surface words can lead to misunderstandings. LLMs need the ability to read not only the literal meaning but also what is intended in the given situation.
Also, in Japanese communication, implicit assumptions and relationships can be important. Expressions such as "Daijoubu desu" (It's okay), "Kekkou desu" (No, thank you), and "Kangaete okimasu" (I will consider it) can be interpreted as either affirmative or negative. Since the way these are received varies depending on the context—such as meetings, customer service, internal communications, or emails—the model needs to understand the background of the utterance as well.
There are also challenges in terms of data. Compared to English, the amount of high-quality publicly available Japanese text is limited. Not only everyday conversations and news articles, but data from specialized fields necessary for practical use—such as law, medicine, administration, finance, manufacturing, and customer support—tend to be particularly difficult to collect. It is not only the small volume of data but also the bias in fields and writing styles that can easily lead to biases in the model's output.
Furthermore, evaluating Japanese LLMs is not easy. Compared to English, where there is a wealth of existing evaluation frameworks, designing metrics to measure practical quality in Japanese itself presents challenges. Not only accuracy, but also naturalness, politeness, consistency of style, contextual appropriateness, and avoiding a translated tone are important.
For example, even if the content of the response is correct, if the honorific language is unnatural, the tone is too formal for a business document, or conversely too casual, it becomes difficult to use in actual applications. Therefore, improving the quality of Japanese LLMs requires careful evaluation not only of benchmark scores but also of the naturalness when read by humans and whether the expressions are suitable for business use.
2. Comparison Between Global LLMs and Japan-Originated LLMs
2.1 Global Multilingual LLMs
The following are representative examples of global LLMs.
・OpenAI GPT
This is the LLM that supports OpenAI's ChatGPT service. GPT-5.5 was released in April 2026. Its agent capabilities have been greatly enhanced to quickly grasp user intent and autonomously complete tasks across multiple tools such as code creation and debugging, web research, data analysis, document and spreadsheet creation, and software operation. While maintaining latency at the same level as GPT-5.4, intelligence and token efficiency have also improved. However, since it is a commercial closed model, care must be taken regarding data handling and operational costs.
・Anthropic Claude
Anthropic officially released its latest flagship model, Claude Opus 4.7, in April 2026. It features enhanced capabilities for consistently completing complex software development and long, multi-stage tasks with self-verification, making it a reliable agent for entrusted tasks. Its high-resolution image understanding has also advanced, improving accuracy in practical visual tasks such as reading screenshots and charts. Additionally, an effort setting to adjust the depth of reasoning has been implemented, allowing users to better balance latency and quality according to their needs. Note that the top-tier Claude model, Mythos, is offered with limited availability due to cybersecurity safety reasons, making the Opus series effectively the highest-performance model available to the general public.
・Google Gemini
Google announced Gemini 3.1 Pro in February 2026. It can handle different formats of information such as long texts, PDFs, and images all at once, covering a wide range from summarization, organization, and comparison to explanations of complex problems, practical reasoning, and coding support. The delivery channels are broadly divided into the Gemini app and NotebookLM (for general users) and Gemini API and Vertex AI (for developers and enterprises), with some features in the latter still in Preview. Since the available models and limits vary depending on the plan and delivery channel, it is recommended to check the latest information for the relevant channel before implementation.
2.2 LLMs Specialized and Optimized for Japanese
Many Japan-originated LLMs are designed with Japanese as the primary language for their training corpora and evaluation benchmarks, placing emphasis on minimizing the awkwardness that often arises when adapting English models to Japanese. Examples of Japan’s first LLMs are as follows.
・ELYZA LLM
ELYZA continues to develop and socially implement Japanese-specialized models, and in March 2026, Llama-3.1-ELYZA-JP-70B was selected as a domestic LLM for trial use on the Digital Agency's government AI platform "Gennai." It is characterized by a strong focus on practical use, including the ability to handle the unique styles and fixed expressions of administrative documents, suppression of hallucinations, and operational design considering confidentiality. The government plans to start large-scale demonstrations within government agencies from May 2026, conduct trials on Gennai in August 2026, and proceed with evaluations for paid procurement from April 2027 onward. Since the base is Meta's Llama 3.1, license conditions (such as restrictions based on the number of monthly active users) must be taken into account. Additionally, implementation design must consider GPU costs for hosting the model in-house. Selection emphasizes conformity to administrative document formats, proprietary benchmarks, and security measures for handling confidential information, and similar requirement organization is required for in-house deployment.
・Rakuten Rakuten AI
Rakuten developed this ultra-large-scale model with support from the Ministry of Economy, Trade and Industry and NEDO's GENIAC program. It was announced in December 2025, and Rakuten AI 3.0 was released as open weight in March 2026. Based on DeepSeek V3, it employs a Mixture of Experts (MoE) architecture with approximately 700 billion parameters, achieving high performance by activating only about 40 billion parameters during inference. The license is Apache 2.0, allowing free commercial use, so internal use and integration into products are unrestricted. However, the model itself is ultra-large-scale, and on-premises operation requires significant GPU resources and inference costs. It is recommended to plan implementation strategies such as quantization and distributed inference in advance, and to design whether self-hosting or API usage is more practical.
・Fujitsu Takane
Fujitsu, jointly developed with Cohere, launched this enterprise-oriented LLM in September 2024. It excels in Japanese syntactic understanding and semantic comprehension, and is designed for operation in secure private environments such as on-premises or domestic data centers. In February 2026, it was reported that during a pilot project for public comment processing at central government ministries, automated classification of approval/disapproval and summarization were performed, completing the processing of approximately 120,000 characters in about 10 minutes. Rather than as a general-purpose chat AI, its primary use is for integration into business operations under strict requirements such as governance, auditing, and domestic data handling. When implementing, it is necessary to design with consideration for security requirements and operational frameworks, including audit logs, access control, and data handling.
3. Characteristics of Japanese LLMs Revealed by Various Evaluation Results
In the evaluation of Japanese LLMs, Japanese-specific benchmarks such as JGLUE and JMMLU are emphasized. These measure whether the model can understand Japanese as Japanese, evaluating aspects such as grammar and reading comprehension (JGLUE) and knowledge related to Japanese culture (JMMLU).
As an overall trend, Japanese-specialized models tend to excel in tasks involving expressions unique to Japanese, such as honorifics, idioms, and cultural backgrounds. On the other hand, global models like GPT and Gemini, which are trained on vast multilingual corpora, demonstrate superior reasoning ability and versatility, achieving high standards across a wide range of applications including translation, long-form summarization, and code generation.
For example, Fujitsu Takane has recorded world-class scores on JGLUE and excels in analyzing language structure. In contrast, global models such as GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro tend to have an advantage in advanced mathematics and science evaluations, as well as practical reasoning and coding scenarios.
It cannot be said unequivocally that "this is the best" among multiple benchmarks. It is necessary to choose a model according to the content of the task and the desired quality.
4. How to Choose the "Optimal Japanese LLM" by Use Case
Considering the characteristics of each LLM mentioned above, it is important not to compare Japanese LLMs uniformly based on performance superiority, but to select the appropriate model for each intended use case.
Below is a summary of representative options by intended use.
| Intended Use | Suitable LLM (Examples) | Reason |
|---|---|---|
| Research and Verification | OpenAI GPT | It has the most advanced inference, coding, and agent functions, and is often used as a research baseline. The API, tool integration, and evaluation ecosystem are also mature, resulting in high verification efficiency. |
| Business Chat & Document Summarization | Anthropic Claude | Excels at long-form reading comprehension and summarization, and is skilled at generating natural and consistent text in Japanese. Well suited for business chat applications such as meeting minutes summarization, internal FAQs, and document organization. |
| Enterprise Use & High-Security Environments | Fujitsu Takane | It is strongly focused on domestic closed environments and operations for government agencies and financial institutions, with strengths in Japanese business documents and syntactic analysis. It easily meets the procurement and governance requirements of Japanese companies. |
| Cost-focused PoC and trial implementation | Rakuten AI | It is easy to utilize open licenses, and relatively low-cost to proceed with implementation and customization in closed environments. It has high compatibility with PoC and small-scale trial implementations. |
5. "Data Quality" Supporting the Accuracy of Japanese LLMs
When LLMs become a topic, the large number of parameters and the emergence of new architectures are first noticed. However, when companies actually try to use Japanese LLMs in their own operations, they often face issues such as "the Japanese output is not as natural as expected" and "terminology in specialized fields is unstable."
The fundamental cause lies in the focus being placed solely on model selection, while the perspective of "how to ensure the quality of training data" is overlooked. In reality, the biggest variable affecting the output quality of an LLM is not the superiority of the architecture, but the accuracy and consistency of the data provided.
For example, even by adjusting only a very small portion of the model's overall parameters, thoroughly managing the quality of the training data can significantly improve output accuracy. This suggests that "creating good data" is a more cost-effective investment than "choosing a good model."
5.1 Challenges Unique to Japanese Data: Orthographic Variations, Ambiguity, and Unstructured Data
Why is Japanese inherently difficult for LLMs to handle? The reason is that Japanese data itself contains multiple structural challenges not found in other languages.
・Abundance of orthographic variations
In Japanese, words with the same meaning exist in multiple written forms. For example, "AI / artificial intelligence / ei-ai", "customer / dear customer", "start menu / start · menu"—due to combinations of kanji, hiragana, katakana, and full-width/half-width characters, even identical concepts are processed as different tokens.
・Context-dependent ambiguity
Japanese often omits subjects, has relatively free word order, and relies heavily on particles and context for interpreting meaning. Furthermore, the system of expressions such as honorifics, humble language, and polite speech includes the relationships between speakers, requiring very advanced understanding from the model. Since appropriate phrasing differs completely between business documents and chats, if the training data lacks consistency, output quality becomes unstable.
・Existence of Unstructured Data
Much of the data held by companies includes diagrams, images, and annotations. There is a large volume of business documents such as manuals, specifications, and design documents where text alone cannot accurately convey the context. Therefore, processing flows based solely on text are prone to information loss and breaks in context, which leads to a decrease in the output accuracy of LLMs.
5.2 High-quality Annotation and Data Structuring Support Accuracy
So, how should we address these challenges unique to the Japanese language? The key lies in "careful annotation incorporating human judgment" and "data structuring."
If ambiguous labels or incorrect tags are mixed into the training data that LLMs learn from, the model memorizes incorrect standards as "correct," which amplifies output variability. Since Japanese is a language with delicate interpretations of context and nuance, many areas remain where mechanical automatic labeling alone cannot cope. To achieve high-quality annotation, the following three elements are indispensable.
・Clear guidelines and rule design: If work proceeds with vague definitions and judgment criteria, interpretations will vary among individuals, resulting in a loss of overall data consistency. Especially in languages like Japanese, which have a wide range of expressions, verbalizing how to handle ambiguous cases in advance will ultimately affect the model's accuracy.
・Multi-stage checking system: By incorporating processes such as mutual reviews and re-verifications, it is possible to suppress label variations caused by individual differences and biases. Organizing cases with differing judgments through a consensus-building process also leads to improvements in the accuracy of the guidelines themselves.
・Text normalization and preprocessing: Cleansing processes before inputting data for training, such as unifying full-width and half-width characters, Unicode normalization, and absorbing variations in okurigana, are also important. By skillfully combining probabilistic judgment by LLMs with rule-based data pipelines that can reliably handle processing, the unification of notation can be carried out efficiently and accurately.
5.3 In RAG and Business-specific LLMs, Data Design Determines Outcomes
In enterprises where the adoption of RAG (Retrieval-Augmented Generation) and business-specific LLMs is accelerating, the quality of data design directly translates into performance differences. RAG is a mechanism that searches for relevant information from internal company documents and passes it as context to the LLM to generate responses. At first glance, it may seem that the model's capability is being tested, but in reality, no matter how excellent the model is, accurate answers cannot be reached unless the referenced document data is well organized.
Whether it is RAG or a business-specialized LLM, the difference in outcomes is determined not by "which model to choose," but by the data design aspect of "what kind of data, at what granularity and structure, is prepared and provided." If you aim to improve the accuracy of Japanese LLMs, it can be said that investment in ensuring data quality should be equal to or, in some cases, even greater than model selection.
6. Summary: Human Science Annotation Support
6.1 Human Science Solutions
Human Science has a track record of creating over 48 million pieces of labeled data and has supported AI development projects in various fields including natural language processing, healthcare, IT, manufacturing, and automotive. Without relying on crowdsourcing, it maintains a system staffed by directly contracted specialists, achieving both quality and security.
In addition to annotation and curation, we also support structuring document data and data preparation for building generative AI, LLM, and RAG. We have a dedicated security room that meets ISMS standards, ensuring the safe handling of highly confidential data.
6.2 For Companies Facing These Challenges
●The accuracy of Japanese LLMs is struggling to improve
●There is variability in annotation quality
●Want to focus internal resources on model development
●Looking for a contractor who can safely handle confidential data
Regarding annotation and data preparation for Japanese LLMs, consultations from the consideration stage are also possible.
Please feel free to contact us first.










