แรงบันดาลใจ

ทางเราได้มีความสนใจการประยุคใช้ ML หรือ Machine learning ร่วมกับ ข้อมูลประเภทต่างๆ การที่มี TMLCC เกิดขึ้นก็เหมือนเราได้ท่องยุทธจักร เปิดหูเปิดตารับรู้ว่า MOFs มันคืออะไร และ เราสามารถประยุค MLใช้ในส่วนไหนได้บ้าง

ทำอะไรบ้าง

เอาจริงๆมันก็ data science Process ธรรมดาไม่หวือหวาอะไร แต่ที่สำคัญต้องรู้ว่า ข้อมูลที่ได้มาเป็นข้อมูลแบบไหน เกิด loss ไหม correlation หรือ สัมพันธ์กันยังไง ก่อนเข้า model ใช่ปะ ตอนเลือกโมเดล คิดไรออกก็ run ไปก่อน แล้วถ้าคิดว่ามันดีก็

“ Tune สิครับพี่น้อง”

จากนั้นก็ไป Validation ครับ ดีก็ส่งไม่ดี ก็ tune ต่อแหม่ ….. แล้ว ตอนส่งก็แหม่ "Failed เป็นตับจ้า"

โดยพวกเราทำการนั่งนึกว่าใช้ Model อะไรดี จนกระทั่งคิดๆว่าทำๆไปตามความถนัดนี่แหละ คือ

XGBoost: โดยส่วนตัว Tune ง่าย หาคำตอบเร็วเร่งเครื่องได้ดี รู้ว่า Feature ไหนสำคัญไม่สำคัญ
Random forest: มันคือการสุ่มต้นไม้ (Decision Tree) ตามจำนวนที่ต้องการ สุ่มอยู่นั่นแหละจนเต็มป่า (ตามชื่อ) แล้วก็เอาต้นเทพที่สุดในป่าออกมา …. “นี่สินะที่เรียกว่า สุดในป่า”
Neural Network: โมเดลรุ่นใหม่ ไทยนิยม (ต่างประเทศก็นิยม) แต่ว่าต้องให้เวลามันเยอะหน่อยถ้ามัน tune โครงสร้างเจอนะ ML ธรรมดา "tune 3 ชาติก็ไม่สู้"
Symbolic regression (Genetic programming):โมเดลนี้มันเซ็ท Rules หรือ สมการ มาก่อน ทำให้เราสามารถรู้ได้ว่า Rules ที่ให้ไปมัน sig ไหม กับ Data
Voting Regressor: ง่ายๆมันคือการเอา Model ที่คิดว่า "จะแม่น" ในที่นี้เราเลือกใช้ Gradient boosting, Random Forest, and Linear regression มาทำการ Vote ว่า model ไหนมันค่าใกล้ๆกับชุดข้อมูลก็ให้น้ำหนักหรือความสำคัญmodelนั้นมากหน่อย จากนั้นก็เอาค่าที่ทายใน model อื่นๆมา เฉลี่ยถ่วงน้ำหนักกันจนได้ค่า ทำนายใหม่

ความท้าทายที่เราพบเจอ

ด้วยเวลาที่กลุ่มนั้นทำงานประจำกันเป็นส่วนใหญ่ (แต่ผู้เขียน เป็น freelance อยู่นะ “จ้างได้ฉันหิว”) ทำให้ประชุมกันน้อย ทำให้มีเวลาในการทำน้อย พอรู้ว่าใกล้หมดช่วงเวลาแข่งก็ค่อยมาดูคลิปย้อนหลังทำให้รู้ว่ามันเยอะมากกกกก พอไฟเริ่มลามแล้วเราจึงมีไฟในการทำทันที submit ทำให้มีเวลาทำ feature selection น้อยมากจริงๆแล้วเรามีอะไรหลายๆอย่างอยากใส่เข้าไปอีกเยอะแต่ก็กลัวไม่ทันเพราะว่า ที่ run model เครื่องก็ ร้องโอดร้องโอยแล้ว

สิ่งที่ภูมิใจ

เราใช้เวลาทำ Model รวมถึง Clean data พร้อมทั้ง Tune เสร็จด้วย ภายในเวลา 1 วัน นิดๆ โดยปกติทำนานกว่านี้
จากความถ้าทายเรื่อง Feature ในด้าน MOF เป็น 0 ทำให้ ไม่รู้ว่าต้องตัดตัวไหนหรือใช่ค่าไหนกันแน่ เราจึงค้นคว้ามันเพิ่มเติม

สิ่งที่เราได้เรียนรู้

รับรู้เรื่องการมีอยู่ของ MOF (ไม่ใช่ UFO นะ ดูผ่านๆมันก็เหมือนอยู่นะ)
ได้แลกเปลี่ยนความรู้ในทีม
ได้ใช้เวลาที่เหลืออยู่อย่างคุ้มค่า
ได้ใช้ความรู้ที่มีมาทดสอบสนาม
ได้ความรู้จาก " ทีมอื่น " 555

ทีมของเราจะทำอะไรต่อ

จริงๆที่เขียนภาษาไทยเพราะว่าอยากให้อ่านกันง่ายๆ แล้ว in ไปกับการแข่งขันนะครับ ส่วนสิ่งที่จะทำต่อไปคือ นอนให้พอกิน อาหารให้ครบ 5 หมู่ครับ

OFFICIAL

Inspiration

It cannot be denied that global warming issue is one of the biggest trends these days. With higher technology & bigger data people have, the higher chance of finding ways to help delaying world temperature. Carbon capture through Metal Organic Framework (MOF) are one of the technologies that can help this situation. We believe that by using Machine Learning with suitable method can help us understand and predict carbon capture capability of each MOF to find the most effective compound.

What it does

The created model will predict CO2 working capacity based on given parameters by following methods:

XGBoost: the method will correct model itself every data shuffle; therefore, fitting data can be done quickly.
Random forest: the method will random decision tree and vote the best solution to reduce over-fitting issue.
Neural Network: the method is to simplify the complexity of correlation and select related parameters to create the model.
Symbolic regression (Genetic programming): the method will create empirical equation from raw data which helps to explain in mathematical perspective.
Voting Regressor: the method will vote the best method that we selected (Gradient boosting, Random Forest, and Linear regression) to apply on each situation to find the best outcomes.

How we built it

Exploratory data Analysis

To check data type & Missing value.
Correlation

Data cleaning & preparation

Drop missing value and incorrect data.
Features selection (Combining both correlation and chemistry theoretical knowledge)
Transform category data by One-Hot encoder

Modelling

Split data to be train & test set
Training model with various algorithm/technique such as XGBoost, Randomforest, Neural Network, Genetic programming and Voting Regressor (Gradient boosting, Random Forest and Linear regression).
Model validation by using R2 (linear regression) & Log of mean absolute error (MAE)

Model selection

-Choose model by Evaluate from mean square error

Challenges we ran into

Our team member has no background in pure chemistry and MOFs. However, resources provided by the competition organizer is sufficient but time consuming in a short period. Thus, we’ve separated responsibility for our team members who has background in chemical engineering and data science to be domain knowledge understanding about MOFs and machine learning modelling respectively.
Due to our time constraint (Most of our team member are working full time), thus we could manage to have a group discussion about 1-2 times a week while we see working as a team is more powerful and effective.
Modelling

3.1 Features selection: we managed to apply both chemistry of MOFs and statistical knowledge to understand how each feature affect CO2 working capacity.
3.2 Slow find tune model because there is limited time.

Accomplishments that we're proud of

what to be proud of

It took us 1 day to create and tune the model. Usually, it takes longer time for this process.
How can we know if any parameters should be selected as features for our model with little(zero) understanding about MOF? So, we had researched to understand it more (at least better than zero 😊 and we are more confident).

What we learned

We have learned a lot about MOFs and opportunity for applying machine learning in chemistry.
Exchanging knowledge in the team
Make the most of your remaining time.
Use the knowledge gained to test the field.
Gain knowledge from other teams.

What's next for Wonderland

Participating in this competition helps our team to learn more about the MOF in Chemistry point of view as well as Machine learning research opportunity. As the competition goal is to find the best prediction of CO2 working capacity, “data cleansing” is a crucial step to filter out unrelated parameters from the model. This process requires high-level knowledge in Chemistry to make sure that the model is applicable per theoretical reference. In our case, conduct an analysis on functional group and found that some of them have high correlation with CO2 working capacity while others have not. Thus, we strongly believe that statical analysis can help to identify potential parameters (both chemical & physical properties) which impact CO2 working capacity. After achieving competition goal, we saw an opportunity to use Machine learning to design MOF in variety of ways through selection of organic likers, metal node, topology and functional group since all input are provided.