Poster

Final Paper: https://tinyurl.com/csci1470-oracle

Github Link: https://github.com/dxhw/CSCI1470-Final-Oracle.git

Title:

Summarizes the main idea of your project. We will use tensor flow to classify oracle bone script, the oldest form of written Chinese characters, into modern Chinese using the HUST-OBS dataset introduced by Wang et al (2024). HUST-OBS uses 5 sources of oracle bone script and was validated via a ResNet-50 model. Wang et al (2024) aggregates the five sources of oracle bone characters together. We believe the characters from different sources may have substantial variation even if they should be labeled the same. We propose looking into a model’s ability to generalize across these sources by holding back some of them entirely from training and finding its testing accuracy on them. This exploration has interesting implications in the deciphering of source-specific unknown characters.

Who:

Names and logins of all your group members.
Michelle Ding - mding16
Doren Hsiao-Wecksler - dhsiaowe
Vivian Li - vli18
Louise Weng - lweng1

Introduction:

What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. If you are doing something new, detail how you arrived at this topic and what motivated you. What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? etc.

Oracle Bone Script (OBS) (甲骨文) is the oldest form of written Chinese dating back to the Shang Dynasty 3000 years ago. Our project attempts to classify images of OBS into modern Chinese characters by using a dataset created by Wang et al. (2024): the HUST-OBS dataset. The HUST-OBS dataset was created by aggregating character images from 5 different sources—two books (sources X and L), two websites (sources G and Y), and a database created for the study of OBS (source H). The same character can appear very different, so there is a question of whether image-classification is generalizable across sources.

The existing work by Wang et al. used a ResNet-50 model and a straightforward 8:2 train/test split for their analysis, training, and testing on the aggregated sources. However, this project proposes a novel approach by altering the train/test split to examine the model's ability to generalize across different sources of characters, addressing the potential variations in character representations across these sources.

By training the model on certain sources and testing it on others, the research aims to uncover how well the model can generalize its learned features across different styles of the same characters, which is crucial for the interpretation of undeciphered characters or those with varying representations in different sources. One can imagine, for instance, a character that is deciphered in some handwritten styles, but not in others. Our project aims to show the likelihood that a ResNet-50 model trained on this data would be able to decipher this new representation form.

Related Work:

Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

Blog source: "CNN reveals the secrets of 3000-year-old oracle bones in China", Published By : INDIAai, Mar 17, 2020 https://indiaai.gov.in/news/cnn-reveals-the-secrets-of-3000-year-old-oracle-bones-in-china This blog details the efforts of a team from China’s Southwest University and Capital Normal University of Beijing that "applied a multi-regional convolutional neural network (CNN) to study oracle bones." These researchers used over 1400 tortoise shells and 300 ox bone rubbings as a dataset. Their deep learning techniques include two Conv-Pooling-ReLU layers and two fully connected layers + a multi-feature fusion subnet made up of four Auto-Encoding layers. The goal is to "identify, classify and hopefully assist in repairing ancient Chinese books and documents." Their model currently produces results that align with oracle bone experts.

Data:

What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing?

HUST-OBS Introduction: We will use the HUST-OBS dataset, which is a significant compilation of Oracle Bone Script (OBS) images. OBS is among the earliest forms of written Chinese, dating back 3,000 years, and it offers profound insights into the historical and cultural contexts of the Shang Dynasty. The dataset comprises 140,053 images in total, including 77,064 images of 1,588 deciphered scripts and 62,989 images of 9,411 undeciphered characters. This diverse collection, sourced from various origins, has been meticulously reviewed and corrected by experts in oracle bone studies, ensuring the reliability and authenticity of the data.

Preprocessing: For our project, we will only be using the 77,064 images of 1,588 deciphered scripts. There are 5 sources total (referencing from Wang et al. 2024 table 3):

SOURCE	TYPE	Letter	Number of Deciphered Images
New Compilation of Oracle Bone Scripts	Book	X	17609
Oracle Bone Script: Six Digit Numerical Code	Book	L	9609
YinQiWenYuan	Website	Y	1697
GuoXueDaShi	Website	G	16259
HWOBC	Database	H	31890

The original paper aggregated all 5 sources together and split them into a train and test set with an 8:2 ratio. For our project, we propose an alternative dataset processing approach that explores whether a model trained on some sources can be applied accurately to classify images from others. The choices of which sources to group together for train and test were primarily made based on creating reasonable train/test group sizes. Specifically, we were aiming for approximately an 8:2 split in train to test data size, but group 3 is slightly less even because we wanted to test more source combinations. The table below explains our 4 groups of datasets. Groups 1-3 are our own splits and Control is the train/test dataset used to validate Wang et al.

Group	Description	Size of Train Dataset	Size of Test Dataset	# Removed from Test Dataset	size_test/total_size	Num classes*
1	Train: ['H', 'X', 'L', 'Y'] Test: ['G']	60805	16254	4	0.211	1777
2	Train: ['H', 'G', 'L', 'Y'] Test: ['X']	59455	16791	125	0.22	1656
3	Train: ['H', 'X', 'G'] Test: ['L', 'Y']	65757	10309	153	0.136	1628
Control	Wang et al. train/test split	61662	15402	/	0.2	1588

Preprocessing steps:

We started with a JSON containing a list of all labels to image paths.
We processed the JSON to create 5 separate JSONs, one for each source.
Based on the number of image/label pairs for each source, we split up the sources into the 3 groups from the table above.
As sources contain different Chinese character labels, we need to make sure that the test groups do not contain characters that have not been trained on. So we removed a number of image/label pairs from the test set (see table "# removed from test set").
Then we began preprocessing the images and labels. In order to ensure that our results are comparable to Wang et al., we had to replicate their preprocessing steps with the same parameters. This preprocessing includes:

Image	Labels
Convert from grayscale to RGB	Converting labels to tensors
Add salt and pepper noise	Encoding labels from 0 to num_classes - 1 to ensure proper one-hot encoding
Noise removal, segmentation, and feature extraction	One hot encoding on labels
Random gaussian blur
Padding
Color adjustments (adding randomness to brightness, contrast, saturation, hue)
Random rotation
Normalization

Methodology:

What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

Our project uses a ResNet-50 built from scratch using TensorFlow. We aimed to replicate the ResNet-50 model first described in He et al.’s 2015 “Deep Residual Learning for Image Recognition” and also match the Pytorch-based model used for validation by Wang et al. This included some non-standard work in TensorFlow to mimic Pytorch, particularly concerning padding with Conv2D layers.

We have chosen a ResNet-50 model not only to align with Wang et al. but also because it is an architecture very well-suited for image recognition tasks. The ResNet architecture is built primarily from a set of residual blocks. These blocks contain shortcut connections that bypass one or more layers, effectively addressing the vanishing gradient problem by allowing gradients to flow through the network more easily during training. Our model, MyResNet50, implements a series of residual blocks, each consisting of convolutional layers with Batch Normalization and ReLU activation functions, enhancing the model's ability to learn from the complex, high-dimensional data inherent to oracle bone inscriptions.

We initialize our weights using a Glorot (Xavier) Uniform Initializer. Our optimizer is Adam (a difference from Wang et al. who use a basic Stochastic Gradient Descent optimizer with a Cosine Annealing Schedule) and our loss function is Categorical Cross Entropy. Swapping from the SGD optimizer to Adam allows us to train our model faster because it allows our model to converge more quickly without waiting for the learning rate to lower over the course of many epochs. Categorical Cross Entropy is the obvious loss function for our task as it is the standard loss function for classification-based tasks like ours.

Throughout the training epochs, the model is exposed to images of oracle bone script, and it learns to reduce classification error by updating weights via back-propagation. We maintain the model's robustness by evaluating its performance on a separate test set, ensuring the model's ability to generalize to unseen data. The accuracy metric guides us in monitoring the model's prediction capabilities, ensuring that we maintain a stringent standard for correct interpretation of the historical script.

Metrics:

What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals?

Our task is a simple classification task, so the primary metric for evaluating the performance of our model is accuracy, which is the proportion of correctly predicted characters from the total number of predictions made. We plan on using the train/test split from Wang et al.’s paper to analyze the baseline performance of our model before moving on to our new data. For our new train/test splits, more importantly than the accuracy values, our goal is to learn the ability for a standard ResNet-50 to generalize across these types of groups, which we can extrapolate to mean it may be able to classify an undeciphered character in a style that may or may not be represented in previously deciphered work. This generalization ability may be high or low, so there is not a good stable metric to use for measuring success in generalization. We do not necessarily aim for it to be perfect at this task, just to learn how successful it is. However, following our hypothesis, “success” would constitute the event where the accuracy when holding back the five different sources is lower than the results of our model when using the simple 8:2 train/test split, but goes up with training. (That is, it is clear that our test set has not leaked into our train set and we have sensible test-accuracy numbers coming out of our model, but there is some clear level of generalization ability being learned.)

Base Goal: Our model approaches the baseline performance of the ResNet-50 model used in Wang et al.'s study when tested on the original dataset. And shows some ability to generalize across the train and test splits with our novel splits.

Target Goal: Our model replicates the original test accuracy (94.3%) when using the same train/test split as the original study. It also shows substantial ability to generalize with source-based splits (>50% test accuracy).

Stretch Goal: Model improves upon Wang et al.’s model and not only performs well on validation split (>94.3% test accuracy) but also shows substantial ability to generalize with near-comparable results on our source-based splits (>80% accuracy)

Ethics:

Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)

What broader societal issues are relevant to your chosen problem space? The project's focus on deciphering Oracle Bone Script ties into broader societal issues concerning cultural heritage and historical understanding. By employing deep learning to interpret ancient scripts, the research contributes to preserving and disseminating knowledge about the Shang Dynasty, one of the earliest recorded periods in Chinese history. This endeavor not only aids historians and archaeologists in their research but also has implications for cultural preservation and education. It fosters a deeper connection with the past, offering insights into the language, rituals, and societal structures of ancient civilizations. Additionally, the project touches on the ethical use of AI in humanities, raising discussions about the integration of technology in traditional fields of study and the potential reshaping of historical research methodologies.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? The HUST-OBS dataset, while a valuable resource, raises several ethical considerations regarding data collection, labeling, and representativeness. Given that the dataset includes both deciphered and undeciphered characters, there's a significant responsibility to ensure the accuracy of labels, especially since they have been reviewed and corrected by experts. Any mislabeling or bias in interpretation could perpetuate inaccuracies in the understanding of OBS and, by extension, the Shang Dynasty's history.

Division of labor:

Briefly outline who will be responsible for which part(s) of the project.
Michelle Ding: Data Preprocessing & Analysis
Doren Hsiao-Wecksler: Image Preprocessing, Model Testing and Debugging, Model Training
Vivian Li: Poster Design
Louise Weng: Model Building and Model Research

Reflection (Check-in 3):

Introduction

We will use TensorFlow to classify oracle bone script, the oldest form of written Chinese characters, into modern Chinese using the HUST-OBS dataset introduced by Wang et al (2024). HUST-OBS uses 5 sources of oracle bone script and was validated via a ResNet-50 model. Wang et al (2024) aggregates the five sources of oracle bone characters together. We believe the characters from different sources may have substantial variation even if they should be labeled the same. We propose looking into a model’s ability to generalize across these sources by holding back some of them entirely from training and finding its testing accuracy on them. This exploration has interesting implications in the deciphering of source-specific unknown characters.

Challenges:

This hardest part of the project we encountered was the preprocessing element. Because all the data was aggregated, we had to work with a really large JSON file with no nested structure. It was difficult getting the JSON to split into the sources we needed (detailed on the devpost) but we managed to look into documentation on python’s json, csv, os packages to do so. There was also a challenge with file paths being in Chinese. When we processed the json, it encoded the paths into utf-8, so it was a challenge decoding the paths and figuring out which ones had this limitation. Overall, the preprocessing work was meaningful because we were able to understand the shape of our data and also understand how they related to the historical sources. We ended up making a google sheet that reflects this: https://docs.google.com/spreadsheets/d/1Tv5O81nddjk2me7LWUMDVhAj1yUDOyJHw5BtlFztKhE/edit?usp=sharing

Insights: Are there any concrete results you can show at this point?

Yes, we have preprocessed the data with json attached to each of the word and model to show. We have also moved the images all into Google Drive so that we can use Google Colab and have guaranteed that all of the images have been properly uploaded.

How is your model performing compared with expectations?

Our preprocessing showed us some good results for what the dataset looks like more concretely, which we had not been entirely sure about before. Our model will be mostly finished once we connect the dataset that's been preprocessed with the model code (they are currently in separate files).

Plan: Are you on track with your project?

Yes. We have not specifically assigned ourselves parts of the assignment, but generally, it has matched our division of labor section above.

What do you need to dedicate more time to? Finalizing our model and training it. We will need to train it multiple times (for the multiple splits that we want to do) so that will take a significant amount of time, but should be doable with the time we have left. We should be able to finalize the model quickly and then will use the remaining time to train and do analysis on the results of our model