Yandex open sources CatBoost, a gradient boosting machine learning librar


Artificial intelligence is now powering a growing number of
computing functions, and today the developer community today is
getting another AI boost, courtesy of Yandex. Today, the
Russian search giant — which, like its US counterpart Google,
has extended into a myriad of other business lines, from mobile
to maps
and more
— announced the the launch of CatBoost, an open source
machine learning library based on gradient boosting — the
branch of ML that is specifically designed to help “teach”
systems when you have a very sparse amount of data, and
especially when the data may not all be sensorial (such as
audio, text or imagery), but includes transactional or
historical data, too.

CatBoost is making its debut in two ways today. (I think ‘Cat’,
by the way, is a shortening of ‘category’, not your feline
friend, although Yandex is enjoying the play on words. If you
visit the CatBoost site you will see what I mean.)

First, Yandex says that it is starting to use the new framework
itself across its own services, to replace MatrixNet,
which is the machine learning algorithm that up to now has been
used at the company for everything, from ranking tasks, weather
forecasting, Yandex.taxi services (which are now being spun off
into a
$3.7 billion joint venture with Uber across Russian
markets
) and recommendations. The switchover from MatrixNet
to CatBoost is happening now and will continue in the months
ahead.

Second, Yandex is offering the CatBoost library as a free
service, released under an Apache license, to any and all who
need or want to use gradient-boosting tech in their own
programs. “This is the pinnacle of a lot of years of work,”
Misha Bilenko, Yandex’s head of machine intelligence and
research said in an interview. “We have been
using a lot of open source machine learning tools ourselves, so
it’s good karma to give something back.” He mentioned Google’s

move to open source Tensorflow
back
in 2015
and the establishment and growth of
Linux
as two inspirations here.

Bilenko added that there are “no plans” to commercialise
CatBoost or close it off in any other proprietary way. “It’s
not a question of competitors,” he said. “We’d be glad to have
competitors use it as it’s foundational.”

The move is a significant contribution from Yandex into the
open source world, and just as Google has continued to expand
and update Tensorflow, the idea is that today’s CatBoost
release is the first iteration that will be updated and
developed further, Bilenko told me. Today, the library has
three main features:

“Reduced overfitting” which Yandex says helps you get better
results in a training program. It is “based on a proprietary
algorithm for constructing models that differs from the
standard gradient-boosting scheme.”

“Categorical features support” in which your training results
are improved while letting you use of non-numeric factors,
“instead of having to pre-process your data or spend time and
effort turning it to numbers.”

It also uses an API interface that lets you use CatBoost from
the command line or via API for Python or R, including
tools for formula analysis and training visualisation.

While there are a number of other libraries out there to help
with gradient boosting or other solutions to help train machine
learning systems (XGBoost being one),
Bilenko argued that the benefit of CatBoost and other
frameworks put out there by large companies like Yandex is that
they are “battle tested” for accuracy.

“The dirty secret with a lot of machine
learning code is that it requires pretty extensive tuning,” he
said. “Ours requires little and provides pretty good
performance out of the box. That is a key
differentiator.”

Leave a Reply

%d bloggers like this: