Inspiration

I've been working on tabular datasets in the past few years, and managed to build a rough AutoML system that beat the 'auto sklearn' solution to some extend. After I met PyTorch, I was deeply attracted by its simplicity and power, but I failed to find a satisfying solution for tabular datasets which was 'carefree' enough. So I decided to take advantage of my knowledges and build one myself, and here comes the carefree-learn, which aims to provide out of the box tools to train neural networks on tabular datasets with PyTorch.

What it does

Here's the documents that covers most of the following statements.

carefree-learn provides high level APIs for PyTorch to simplify the training on tabular datasets. It features:

  • A scikit-learn-like interface with much more 'carefree' usages. In fact, carefree-learn provides an end-to-end pipeline on tabular datasets, including AUTOMATICALLY deal with:
    • Detection of redundant feature columns which can be excluded (all SAME, all DIFFERENT, etc).
    • Detection of feature columns types (whether a feature column is string column / numerical column / categorical column).
    • Imputation of missing values.
    • Encoding of string columns and categorical columns (Embedding or One Hot Encoding).
    • Pre-processing of numerical columns (Normalize, Min Max, etc.).
    • And much more...
  • Can either fit / predict directly from some numpy arrays, or fit / predict indirectly from some files locate on your machine.
  • Easy-to-use saving and loading. By default, everything will be wrapped into a zip file!
  • Distributed Training, which means hyper-parameter tuning can be very efficient in carefree-learn.
  • Supports many convenient functionality in deep learning, including:
    • Early stopping.
    • Model persistence.
    • Learning rate schedulers.
    • And more...
  • Some 'translated' machine learning algorithms, including:
    • Trainable (Neural) Naive Bayes
    • Trainable (Neural) Decision Tree
  • Some brand new techniques which may boost vanilla Neural Network (NN) performances on tabular datasets, including:
  • Highly customizable for developers. We have already wrapped (almost) every single functionality / process into a single module (a Python class), and they can be replaced or enhanced either directly from source codes or from local codes with the help of some pre-defined registration functions provided by carefree-learn.
  • Full utilization of the WIP ecosystem cf*, such as:
    • carefree-toolkit: provides a lot of utility classes & functions which are 'stand alone' and can be leveraged in your own projects.
    • carefree-data: a lightweight tool to read -> convert -> process ANY tabular datasets. It also utilizes cython to accelerate critical procedures.

To try carefree-learn, you can install it with pip install carefree-learn.

How I built it

I structured the carefree-learn backend in three modules: Model, Pipeline and Wrapper:

  • Model: In carefree-learn, a Model should implement the core algorithms.
    • It assumes that the input data in training process is already 'batched, processed, nice and clean', but not yet 'encoded'.
      • Fortunately, carefree-learn pre-defined some useful methods which can encode categorical columns easily.
    • It does not care about how to train a model, it only focuses on how to make predictions with input, and how to calculate losses with them.
  • Pipeline: In carefree-learn, a Pipeline should implement the high-level parts, as listed below:
    • It assumes that the input data is already 'processed, nice and clean', but it should take care of getting input data into batches, because in real applications batching is essential for performance.
    • It should take care of the training loop, which includes updating parameters with an optimizer, verbosing metrics, checkpointing, early stopping, logging, etc.
  • Wrapper: In carefree-learn, a Wrapper should implement the preparation and API part.
    • It should not make any assumptions to the input data, it could already be 'nice and clean', but it could also be 'dirty and messy'. Therefore, it needs to transform the original data into 'nice and clean' data and then feed it to Pipeline. The data transformations include:
      • Imputation of missing values.
      • Transforming string columns into categorical columns.
      • Processing numerical columns.
      • Processing label column (if needed).
    • It should implement some algorithm-agnostic methods (e.g. predict, save, load, etc.).

It is worth mentioning that carefree-learn uses registrations to manage the code structure.

Challenges I ran into

Most of the challenges I ran into was to build a system. I need to make sure that users can use it easily, and developers can also extend it without spending too much efforts. This took me days to design & refactor.

The second challenge was the data processing module (carefree-data). Since the target of carefree-learn is to fit (almost) any tabular datasets with high performance, I need to implement a whole bunch of data processing methods into carefree-data, in an automatic manner. This again took me days to design & optimize.

Another challenge was the multiprocessing part. Using CUDA and multiprocessing is not easy, especially when I needed to do some fine grained logging within the multiprocessing process. This aaagain took me days to experiment & resolve.

Accomplishments that I'm proud of

I've made training NNs on tabular datasets really easy now:

import cflearn
m = cflearn.make()
# fit np.ndarray
m.fit(x_np, y_np, x_cv_np, y_cv_np)
m.predict(x_test_np)
# fit python lists
m.fit(x_list, y_list, x_cv_list, y_cv_list)
m.predict(x_test_list)
# fit files
m.fit(x.txt, x_cv=x_cv.txt)
m.predict(x_test.txt)

Although the demand of working with tabular datasets is not that large, I'll be very happy if carefree-learn could help someone who needs it.

I'm also proud that I've written some documents for carefree-learn.

What I learned

How to build an easy-to-use (Deep Learning?) system :) How to write documents :D How to make videos XD

What's next for carefree-learn

The next step is to make some benchmark testing and optimize carefree-learn's performance. I'm pretty sure it can reach a satisfying level with some tuned default settings.

And, as always, bug fixing XD

Built With

Share this project:

Updates

posted an update

Updates(2020.08.25)

This update implemented two commonly used ensemble methods (bagging, adaboost). However, the related codes are still W.I.P, so the main purpose of this update is to show the potential of carefree-learn:

ensemble = cflearn.Ensemble(TaskTypes.CLASSIFICATION, config)
results = ensemble.adaboost(train_file)
predictions = results.pattern.predict(test_file)

Log in or sign up for Devpost to join the conversation.

posted an update

Updates(2020.08.16)

This update is mainly about miscellaneous fixes, but I've also introduced a toy example to reveal the power of carefree-learn - the famous Titanic competition!

Here are the source codes:

import os
import cflearn

from cfdata.tabular import *

file_folder = os.path.dirname(__file__)


def test():
    train_file = os.path.join(file_folder, "train.csv")
    test_file = os.path.join(file_folder, "test.csv")
    data_config = {"label_name": "Survived"}
    hpo = cflearn.tune_with(
        train_file,
        model="tree_dnn",
        temp_folder="__hpo__",
        task_type=TaskTypes.CLASSIFICATION,
        data_config=data_config,
        num_parallel=0
    )
    results = cflearn.repeat_with(
        train_file,
        **hpo.best_param,
        models="tree_dnn",
        temp_folder="__repeat__",
        num_repeat=10, num_jobs=0,
        data_config=data_config
    )
    ensemble = cflearn.EnsemblePattern(results.patterns["tree_dnn"])
    predictions = ensemble.predict(test_file).ravel()
    x_te, _ = results.transformer.data.read_file(test_file, contains_labels=False)
    id_list = DataTuple.with_transpose(x_te, None).xT[0]
    # Score : achieved ~0.79
    with open("submissions.csv", "w") as f:
        f.write("PassengerId,Survived\n")
        for test_id, prediction in zip(id_list, predictions):
            f.write(f"{test_id},{prediction}\n")


if __name__ == '__main__':
    test()

As you can see, carefree-learn doesn't need explicit data-preprocessing - it can take files as inputs and predict with files directly! More over, some common practises, such as hyper parameter tuning (cflearn.tune_with) and ensembling (cflearn.repeat_with & cflearn.EnsemblePattern), can be completed in a few lines of codes. These APIs also hide some other common practises (such as cross validation) under the hood, so the final performance is quite promising (I can achieve ~0.79 and the best one achieved 0.81+, which is almost the SOTA performance among other (more complicated) neural network solutions 1 2 3 4).

Log in or sign up for Devpost to join the conversation.

posted an update

Updates (2020.08.01)

Experiments

Experiments is much more powerful and much easier to use now:

import cflearn
import numpy as np

from cfdata.tabular import *

def main():
    x, y = TabularDataset.iris().xy
    experiments = cflearn.Experiments()
    experiments.add_task(x, y, model="fcnn")
    experiments.add_task(x, y, model="fcnn")
    experiments.add_task(x, y, model="tree_dnn")
    experiments.add_task(x, y, model="tree_dnn")
    results = experiments.run_tasks(num_jobs=2)
    # {'fcnn': [Task(fcnn_0), Task(fcnn_1)], 'tree_dnn': [Task(tree_dnn_0), Task(tree_dnn_1)]}
    print(results)
    ms = {k: list(map(cflearn.load_task, v)) for k, v in results.items()}
    # {'fcnn': [FCNN(), FCNN()], 'tree_dnn': [TreeDNN(), TreeDNN()]}
    print(ms)
    # experiments could be saved & loaded easily
    saving_folder = "__temp__"
    experiments.save(saving_folder)
    loaded = cflearn.Experiments.load(saving_folder)
    ms_loaded = {k: list(map(cflearn.load_task, v)) for k, v in loaded.tasks.items()}
    # {'fcnn': [FCNN(), FCNN()], 'tree_dnn': [TreeDNN(), TreeDNN()]}
    print(ms_loaded)
    assert np.allclose(ms["fcnn"][1].predict(x), ms_loaded["fcnn"][1].predict(x))

if __name__ == '__main__':
    main()

We can see that experiments.run_tasks returns a bunch of Tasks, which can be easily transfered to models through cflearn.load_task.

It is important to wrap the codes with main() on some platforms (e.g. Windows), because running codes in parallel will cause some issues if we don't do so. Here's an explaination.

Benchmark

Benchmark class is implemented for easier benchmark testing:

import cflearn
import numpy as np

from cfdata.tabular import *

def main():
    x, y = TabularDataset.iris().xy
    benchmark = cflearn.Benchmark(
        "foo",
        TaskTypes.CLASSIFICATION,
        models=["fcnn", "tree_dnn"]
    )
    benchmarks = {
        "fcnn": {"default": {}, "sgd": {"optimizer": "sgd"}},
        "tree_dnn": {"default": {}, "adamw": {"optimizer": "adamw"}}
    }
    msg1 = benchmark.k_fold(3, x, y, num_jobs=2, benchmarks=benchmarks).comparer.log_statistics()
    """
    ~~~  [ info ] Results
    ===============================================================================================================================
    |        metrics         |                       acc                        |                       auc                        |
    --------------------------------------------------------------------------------------------------------------------------------
    |                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
    --------------------------------------------------------------------------------------------------------------------------------
    |    fcnn_foo_default    |    0.780000    | -- 0.032660 -- |    0.747340    |    0.914408    |    0.040008    |    0.874400    |
    --------------------------------------------------------------------------------------------------------------------------------
    |      fcnn_foo_sgd      |    0.113333    |    0.080554    |    0.032780    |    0.460903    |    0.061548    |    0.399355    |
    --------------------------------------------------------------------------------------------------------------------------------
    |   tree_dnn_foo_adamw   | -- 0.833333 -- |    0.077172    | -- 0.756161 -- | -- 0.944698 -- | -- 0.034248 -- | -- 0.910451 -- |
    --------------------------------------------------------------------------------------------------------------------------------
    |  tree_dnn_foo_default  |    0.706667    |    0.253684    |    0.452983    |    0.924830    |    0.060007    |    0.864824    |
    ================================================================================================================================
    """
    # save & load
    saving_folder = "__temp__"
    benchmark.save(saving_folder)
    loaded_benchmark, loaded_results = cflearn.Benchmark.load(saving_folder)
    msg2 = loaded_results.comparer.log_statistics()
    assert msg1 == msg2

if __name__ == '__main__':
    main()

Misc

  • Integrated trains.
  • Integrated Tracker from carefree-toolkit.
  • Integrated native amp from PyTorch.
  • Implemented FocalLoss.
  • Implemented cflearn.zoo.

  • Introduced CI.
  • Fixed some bugs.
  • Simplified some APIs.
  • Optimized some default settings.

What's next

I've already done some experiments on some benchmark datasets with Benchmark and already achieved satisfying performance. However, large scale benchmark testing is not done yet, limited by my lack of GPU cards XD

So the next step is to do large scale benchmark testing and optimize carefree-learn's performance, in a much more general way.

In the mean time, I'll do some research and implement some SOTA methods on tabular datasets (e.g. Deep Sparse Network, β-LASSO MLP, ...)

And, as always, bug fixing XD

Log in or sign up for Devpost to join the conversation.