Creating your own estimator in scikit-learn
I had an interesting problem in my work and I finally had to get to something I'd been thinking for some time now.
My problem consists of using Recurrent Neural Networks (which were implemented in Lua here), to which I had to input some text files preprocessed by Python. It produces output to stdout, which was then parsed and preprocessed again by Python as an input to another Python package, which again produces some output files, which were again parsed by Python :D . That's what I called fun. Of course, all parts of processes had some parameters which needed to be tuned by running it on the validation set and then tested on testing set. It seems unbearable to do it manually. Hence, I decided to create my own estimator using scikit-learn
and then use Pipeline
and GridSearchCV
for automatizing whole process and parameter tuning.
In this little example I will just give summary and an example of creating your own estimator. It is based on informations on this site: Rolling your own estimator (scikit-learn docs) .
The example there is not very representative and so I will try to come out with something more clear.
You should be acknowledged what the objects are (that what you define with class keyword).
Building an object
Decide what type of object you want to create. You need to choose one of these: Classifier, Clusterring, Regressor and Transformer. The classifier is self-explanatory -- you give some input X and get the class of which it probably belongs (e.g. Naive Bayes Classifier). An example of Regressor is e.g. Linear Regression. You give it some input X and get estimations of variable Y. Then there is clusterring, where I'm not going into more details now. The last one, Transformer, is for transforming the data -- it takes X and returns changed X. An example of this might be PCA.
After you decided which one suits to your needs you subclass BaseEstimator
and an appropriate class for your type (one of ClassifierMixin, RegressorMixin, ClusterMixin, TransformerMixin
).
Now you need of course to decide what parameters an estimator receives. Decide what input it takes and what output it returns. It's all you need for now.
Abiding scikit-learn rules
It's good to know some additional rules.
- All arguments of
__init__
must have default value, so it's possible to initialize the classifier just by typingMyClassifier()
- No confirmation of input parameters should be in
__init__
method! That belongs to fit method. - All arguments of
__init__
method should have the same name as they will have as the attributes of created object (see Simplification part about automatizing this). - Do not take data as argument here! It should be in fit method.
Before building your estimator, you should abide following rules.
get_params
and set_params
All estimators must have get_params
and set_params
functions. They are inherited when you subclass BaseEstimator
and I would recommend not to override these function (just not state them in definition of your classifier).
Fit method
In fit method you should implement all the hard work. At first you should check the parameters. Secondly, you should take and process the data.
You'll almost surely want add some new attributes to your object which are created in fit method. These should be ended by _
at the end, e.g. self.fitted_
.
And finally you should return self
. This is again for compatibility reasons with common interface of scikit-learn.
Response y
vector
There might be cases when you do not need to input a response vector (as in example below). Nevertheless, for implementation reasons you need to add this vector to your definitions in case you want to use GridSearch
and so. It's good to initialize it with None
value.
Additional requirements, score and GridSearch
As usually, everything which should be hidden to user should start with _
.
For correct function of GridSearch, it is usually necessary to override score
method to your needs. Why? GridSearch needs to recognize if given model is better or not. And he does it very simply by calling score
method and then use rule bigger is better. Hence, whatever crazy scoring function you have, you basically must represented in some metric in numbers.
An example of MeanClassifier
Let's say that we'll want to create a classifier which will classify an input to two classes: to class 0 if a value is smaller than some number and 1 if the value is bigger or equal than some number. What about the "some number"? Let's say that it will be the mean + intValue of exactly 20 input values in a list. Our classifier will hence take one vector X (of length 5) and every record will be assigned 1 or 0 (resp. True or False). Furthermore it is somehow unsupervised classifier -- we do input a response y
vector.
Additionally, let's say we want to find a configuration of intValue
with given input X
in which there will be the biggest number of True
values (which in our case means bigger then mean + intValue
, so basically we want intValue
as low as possible, but forget about that for now :D ).
Here is the code:
from sklearn.base import BaseEstimator, ClassifierMixin
class MeanClassifier(BaseEstimator, ClassifierMixin):
"""An example of classifier"""
def __init__(self, intValue=0, stringParam="defaultValue", otherParam=None):
"""
Called when initializing the classifier
"""
self.intValue = intValue
self.stringParam = stringParam
# THIS IS WRONG! Parameters should have same name as attributes
self.differentParam = otherParam
def fit(self, X, y=None):
"""
This should fit classifier. All the "work" should be done here.
Note: assert is not a good choice here and you should rather
use try/except blog with exceptions. This is just for short syntax.
"""
assert (type(self.intValue) == int), "intValue parameter must be integer"
assert (type(self.stringParam) == str), "stringValue parameter must be string"
assert (len(X) == 20), "X must be list with numerical values."
self.treshold_ = (sum(X)/len(X)) + self.intValue # mean + intValue
return self
def _meaning(self, x):
# returns True/False according to fitted classifier
# notice underscore on the beginning
return( True if x >= self.treshold_ else False )
def predict(self, X, y=None):
try:
getattr(self, "treshold_")
except AttributeError:
raise RuntimeError("You must train classifer before predicting data!")
return([self._meaning(x) for x in X])
def score(self, X, y=None):
# counts number of values bigger than mean
return(sum(self.predict(X)))
And now you should be able to normally use GridSearch
:
from sklearn.grid_search import GridSearchCV
X_train = [i for i in range(0, 100, 5)]
X_test = [i + 3 for i in range(-5, 95, 5)]
tuned_params = {"intValue" : [-10,-1,0,1,10]}
gs = GridSearchCV(MeanClassifier(), tuned_params)
# for some reason I have to pass y with same shape
# otherwise gridsearch throws an error. Not sure why.
gs.fit(X_test, y=[1 for i in range(20)])
gs.best_params_ # {'intValue': -10} # and that is what we expect :)
Simplification of init method
The initialization of init method becomes very boring when having a lot of attributes, since you will see the following a lot:
def __init__(self, arg1, arg2, arg3, ..., argN):
self.arg1 = arg1
self.arg2 = arg2
.
.
.
self.argN = argN
this task can be made automatic by using inspect
module and setattr
. Import inspect
by import inspect
and then you can use:
def __init__(self, arg1, arg2, arg3, ..., argN):
# print("Initializing classifier:\n")
args, _, _, values = inspect.getargvalues(inspect.currentframe())
values.pop("self")
for arg, val in values.items():
setattr(self, arg, val)
# print("{} = {}".format(arg,val)