Binary Matrix Decomposition

Introduction

It may sound cool because it uses the words ‘Matrix’ and ‘Decomposition', but it’s actually the inverse process of matrix multiplication. We try to implement different statistical / data analysis techniques inside mongoDB. Maybe it's not exactly what MongoDB is made for, but we try some things out of the box, inside the database. First we considered implementing neural networks in mongoDB. Other options were bootstrapping and randomization tests, but we decided to try to implement binary matrix decomposition, also called binary matrix factorization (BMF).

We often find data that can be expressed as a binary matrix. These matrices can be very large and are often sparse (ie. lots of zero's, a small number on 1's). Facebook likes, netflix movie ratings, spotify playlists,... can all be seen as large sparse binary matrices.

BMF is a data mining technique to discover clusters or categories in the data. Both the rows (which often represent users, clients, ...) as well as the columns (products, web pages, objects, movies ...) are grouped in the same set of categories. This allows us to better interpret the data and understand where there's room to improve. The discrepancies between our category model and the original data shows us which users might be interested in new products or personalized recommendations.[^1]

How I built it

BMF itself is a difficult problem (NP-hard). There are several algorithms that try to find a solution that is good enough, reachable in a reasonable amount of time, with limited memory. But since the data itself can be very large, BMF is often not a simple task. Storing the matrices in a database and implementing the factorization algorithm as database operations (in this case mongoDB pipeline commands) might help us overcome the memory constraints. At this moment, our implementation is more proof of concept than a full working analysis tool. But the fact that it seems possible to implement these types of algorithms in mongoDB itself is already an achievement.

Dataformat

We didn't store the matrix data in a typical JSON matrix format. The matrices can be very large, and we don't want to run into the document memory limitations. Since the matrices are often sparse, it's possible to save memory by only storing the 1-values in a simple coordinate format. There are several other options to store sparse matrices in a condensed way [^2].

 db.bmfdata.find({},{"_id":false})
{ "col" : 2, "row" : 1, "value" : 1 }
{ "col" : 6, "row" : 1, "value" : 1}
{ "col" : 7, "row" : 1, "value" : 1}
{ "col" : 8, "row" : 1, "value" : 1}
{ "col" : 9, "row" : 1, "value" : 1}
{ "col" : 10, "row" : 1, "value" : 1}
{ "col" : 11, "row" : 1, "value" : 1}
{ "col" : 6, "row" : 2, "value" : 1}
{ "col" : 7, "row" : 2, "value" : 1}
{ "col" : 8, "row" : 2, "value" : 1}
{ "col" : 10, "row" : 2, "value" : 1}
{ "col" : 11, "row" : 2, "value" : 1}
{ "col" : 6, "row" : 3, "value" : 1}
{ "col" : 7, "row" : 3, "value" : 1}
{ "col" : 8, "row" : 3, "value" : 1}
{ "col" : 9, "row" : 3, "value" : 1}
{ "col" : 10, "row" : 3, "value" : 1}
{ "col" : 11, "row" : 3, "value" : 1}
{ "col" : 2, "row" : 4, "value" : 1}
{ "col" : 3, "row" : 4, "value" : 1}
...

In our analysis, we use the same example dataset P. Miettinen uses to explain his ASSO algorithm [^3]. We created a simple node.js/express data server that shows us the different matrices in JSON format. A D3 script converts this into a typical visualisation of a binary matrix. Data matrix

Not only the data, but also the intermediary matrices and the factorsolution use the same coordinate format. With this format we can perform a boolean matrix multiplication by using a rather simple pipeline command.

db.solution.aggregate([
    {$match: {matrix:"B"}},
    {$lookup:
            {from:"solution",
            pipeline:[ {$match: {matrix:"C"}}],
            as:"C"}
    },
    {$unwind:
        {path:"$C"}
    },
    {$match: {$expr :{$eq:["$col", "$C.row"]} }},
     {$group:{
         _id:{row:"$row",col:"$C.col"},
         count:{$sum:1},
         value:{$avg:1}
         }
     },
     {$project:
             { _id:0 ,  row: "$_id.row" , col: "$_id.col", value:"$value" ,count:"$count"   }
          } ,
     {$out:"product"}
])

Algorithm

There are several ways to perform BMF. We chose to implement the ASSO algorithm [^3]. This algorithm starts with calculating a so called association matrix, which is used as the basis to identify column factors.

Association matrix

The construction of the corresponding row factor and the selection of the best coverage is done in a greedy fasion. The candidate column x row vectors that covers most of the 'not yet covered' cells in the original data are selected as the next rank solution. The pipeline command to create the column and corresponding row vectors and select the best is the main part of our implementation.

// Calculate Cover  ( group row vectors )
db.bmfdata.aggregate(
[
    {$match:  {$expr : {$ne:["$covered", true]} } },
    {$lookup:
            {from:"association",
             let:{assorow:"$row"},
             pipeline:[
                {$match :
                    {$expr :
                        {$and:
                            [{$eq:["$binary", true]},{$eq:["$$assorow", "$row"]}]
                        }
                    }
                }],
             as:"asso"}
    },
     {$unwind:
         {path:"$asso"}
     },
     {$group:{
         _id:{assocol:"$asso.col", col:"$col"},
         count:{$sum:1},
         }
     },
     {$project:
             { _id:0 ,  row: "$_id.col",  col: "$_id.assocol" , count:"$count"   }
      },
      {$sort:{"row":1, "col":1}},
      {$lookup:
            {from:"coveredCounts",
             let:{row:"$row"},
             pipeline:[
                 {$match :
                     {$expr :{$eq:["$$row", "$col"]}}
                 }]
             ,as:"covered"}
         },
        {$unwind:
                {path:"$covered",
                preserveNullAndEmptyArrays:true
                }
        },

      {$lookup:
          {from:"assoColCounts",
           let:{col:"$col"},
           pipeline:[
               {$match :
                   {$expr :{$eq:["$$col", "$col"]}}
               }]
           ,as:"colCount"}
       },
       {$unwind:
                {path:"$colCount"}
        },
       {$project:
            { row: "$row",  col: "$col" , count:"$count",
              colCount:"$colCount.count",
              covered:"$covered.covered",
              relative:      {$subtract:["$count", {$multiply:[2,{$subtract:[{$subtract:["$colCount.count","$count"]},"$covered.covered"]}]}]},
              include:{$gt:[ {$subtract:["$count", {$multiply:[2,{$subtract:[{$subtract:["$colCount.count","$count"]},"$covered.covered"]}]}]} ,0]}
            }
       },
       {$match :
              {include : true}
       },
       {$group:{
         _id:{col:"$col"},
         count:{$sum:1},
         sum:{$sum:"$count"},
         rowVector:{$addToSet: "$row" }
         }
     },
     {$project:
             { col: "$_id.col",  count: "$count" , sum:"$sum",rowVector:"$rowVector"}
            },
     {$out:"factorCover"}
]
)
db.factorCover.find().sort({sum:-1})

> db.factorCover.find().sort({sum:-1})
{ "_id" : { "col" : 10 }, "col" : 10, "count" : 5, "sum" : 12, "rowVector" : [ 9, 6, 1, 7, 8 ] }
{ "_id" : { "col" : 11 }, "col" : 11, "count" : 4, "sum" : 7, "rowVector" : [ 8, 7, 9, 6 ] }
{ "_id" : { "col" : 9 }, "col" : 9, "count" : 4, "sum" : 4, "rowVector" : [ 8, 7, 9, 6 ] }
{ "_id" : { "col" : 1 }, "col" : 1, "count" : 1, "sum" : 1, "rowVector" : [ 2 ] }

By iterating this pipeline command, and saving the resulting vectors to the solution collection, we can extract new vectors until the difference between our solution and the original data is minimal.

Factor extraction

// Select highest cover factor
db.factorCover.find().sort({sum:-1}).limit(1)

var factor = db.factorCover.find().sort({sum:-1}).limit(1).toArray()[0]
db.solution.remove({"rank": {$gte:rank}})
// save B solution
db.association.find({col:factor.col, binary:true}).forEach(function(d){
    db.solution.insert({"matrix":"B", "row":d.row,"col":rank,value:1,rank:rank})
    }
)
// save C solution
factor.rowVector.forEach(function(d){
        db.solution.insert({"matrix":"C", "row":rank,"col":d,value:1,rank:rank})
    }
)

// Multiply solution   // specify rank??
db.solution.aggregate([
    {$match: {matrix:"B"}},
    {$lookup:
            {from:"solution",
            pipeline:[ {$match: {matrix:"C"}}],
            as:"C"}
    },
    {$unwind:
        {path:"$C"}
    },
    {$match: {$expr :{$eq:["$col", "$C.row"]} }},
     {$group:{
         _id:{row:"$row",col:"$C.col"},
         count:{$sum:1},
         value:{$avg:1}
         }
     },
     {$project:
             { _id:0 ,  row: "$_id.row" , col: "$_id.col", value:"$value" ,count:"$count"   }
          } ,
     {$out:"product"}
])


// cover
db.product.find().forEach(function(d){
    db.bmfdata.updateOne({"row":d.row,"col":d.col},{$set:{covered:true}})
    }
)

db.bmfdata.find().forEach(function(d){
    db.product.updateOne({"row":d.row,"col":d.col},{$set:{covered:true}})
    }
)

db.bmfdata.aggregate(
[
     {$group:{
          _id:{col:"$col"},
          count:{$sum:1},
          covered:{$sum:  {$cond:["$covered", 1,0]}}
         }
     },
    {$project:
        { _id:0 ,  col: "$_id.col", count:1,  covered:1  }
     } ,
    {$out:"coveredCounts"}
    ]
)


var info ={
 "rank":rank,
 "covered":db.bmfdata.find({covered:{$eq:true}}).count(),
 "notCovered":db.bmfdata.find({covered:{$ne:true}}).count(),
 "overCovered":db.product.find({covered:{$ne:true}}).count()
}
db.rank.insert(info)
rank++

Future

After successfully implementing the BMF algorithm, we'll need to prove the value of it with real data. Our plans for the future:

Test the implementation with larger datasets.
Implementing later improvements to the original algorithm (like the Minimum Description Length criterium)
Implementing other datamining techniques.
Trying to implement Neural Networks in a similar DB approach.

Conclusion

It started out as a challenge, just to see how far we could go in implementing data techniques inside a MongoDB. More a proof of concept than a real application. But now that we see it working this might open up a whole new approach in analyzing data.

[^1]: matrix factorization and netflix recommendations https://www.youtube.com/watch?v=ZspR5PZemcs [^2]: sparse matrix storage https://medium.com/@jmaxg3/101-ways-to-store-a-sparse-matrix-c7f2bf15a229 [^3]: ASSO algorithm https://people.mpi-inf.mpg.de/~pmiettin/slides/BooleanMatrixFactorizationsForDataMining_NRC_slides.pdf https://people.mpi-inf.mpg.de/~pmiettin/papers/miettinen14mdl4bmf.pdf