化けがな: Japanese character translation from unpaired data

5 min readJun 1, 2023

The Japanese writing system is a mixed writing system which contains two syllabic scripts hiragana, and katakana. While their grammatical roles are not analogous to the English upper and lower case alphabet, for the purposes of this project, we can treat them as such — more specifically, two different sets of symbols with a 1:1 mapping which represent the same set of sounds that make up words.

Handwriting recognition, and character recognition in general is not a new problem in machine learning; the MNIST dataset is one of the most widely used datasets for beginner computer vision projects.

The main inspiration for this project comes from a 2021 video from Tom7, where he created several exotic fonts using an ML model which converts letters from upper to lower case, and vice versa. In this project, we embark on a similar endeavor to translate between hiragana and katakana symbols.

This project uses datasets ETL-4 and ETL-5, which contain about 6,000 and 10,000 hiragana and katakana symbols respectively (approximately 150 samples per character). Unlike the conversion between pre-defined upper and lower case typefaces, this results in an unpaired dataset, a challenge that we overcome using a dual learning framework. Due to limited access to compute resources, I opted to build upon an architecture called TextCaps, based on CapsNet, both of which are lightweight and proven on character recognition tasks.

Model Training

Classifier — Encoder

Generally, the outputs of classifier networks are (n x 1) tensors, where n is the number of possible classes, and the ith element in the tensor represents the predicted probability that the input belongs to the ith class. Our capsule-based classifier outputs (n x d) tensors; we use each d-dimensional tensor as a representation of the output image, which cannot be done with just a single float number. We perform classification by calculating the norm of all n tensors; with the predicted class having the largest norm.

Each classifier is trained independently using an image as input and its labeled class. The training is regularized by decoder networks which seek to recreate the original image, forcing the d-dimensional capsules to meaningfully encode the original image. These decoders are discarded at the end of the classifier training.

Our classifiers follow the TextCaps architecture, which is made up of several convolution layers followed by a dynamic routing capsule layer.

Training step for the classifier network.

Generator

The key to overcoming the unpaired data problem is by recognizing that we have two parallel, symmetrical datasets. Dual learning is used to leverage two well trained models each in their own domain, into two models that can translate between the domains in a closed loop.

We hold fixed our previously trained classifiers (discriminators)— one for hiragana (H) symbols, and one for katakana (K) symbols.

Now, we need two generators — one from hiragana to katakana (HK) and another for katakana to hiragana (KH). These take the encoded capsule representation, and use it to generate the appropriate character image.

Let’s say we have a hiragana input image h of the character お. The generated katakana image k’, would be expected to look like オ. One training step would go as follows:

Step 1: Given hiragana input image h, use HK to generate katakana image k’.
Step 2: Use classifier K on k’; Calculate cross entropy loss based on whether k was classified properly. This loss enforces that HK can generate characters that look like the correct katakana.
Step 3: Use k as input to KH to generate h’. Calculate reconstruction loss between h and h’, this loss enforces that KH and HK retain the structure of the original image.
Steps 4–6: Repeat symmetrically using an input image from the katakana dataset k.

One dual training step starting from one hiragana image.

Our generators comprised of a simple fully connected layer to take the capsule representations as input, followed by several convolutional layers.

Results

Classifier Accuracy

The below are the results of the classifier training, as evaluated on a 10% holdout set across 46 classes:

Hiragana Classifier: 80% accuracy
Katakana Classifier: 99% accuracy

The discrepancy in accuracy can be partially attributed to the fact that the ETL-4 dataset is a little older, and contains images with more noise, where the frame pitch containing the actual character is smaller in size. While more effort can be put into individually tuning the two classifiers and applying different preprocessing techniques, we opt to move forward to the qualitative results of the image generation.

Generative Accuracy

The below table shows a sample of inputs, generated outputs, and the result of feeding the generated output back into the complementary generator to recreate the original image. The images are compared with references in typeface.

Writing Style

Our model is effectively an encoder-decoder network, which means that it should theoretically be able to learn different writing styles for the same character, and encode them into the capsule representation. In the below image, we see that a set of 3 different images for the same character results in 3 different generated outputs.

化けがな

Now that we’ve learned about hiragana, katakana, and seen a few examples of both, it’s time to discuss the title of this project, 化けがな (bakegana). The name comes from two Japanese words:

化け, which means “transformation”, characterizing the transformation of the characters which our model is able to perform, and
お化け, which means “ghost”, from the white ghost-like outputs which seem to fade into the darkness of the black background.

The code for this project can be found at https://github.com/alexcleung/bakegana.