Does Scikit support ordinal logistic regression?

Transfer of categorical data to the Sklearn decision tree


There have been several posts on coding categorical data into Sklearn Decision Trees, but we got these from the Sklearn documentation

Some advantages of decision trees are:

(...)

Can handle both numeric and categorical data. Other techniques typically specialize in analyzing data sets that have only one type of variable. See the algorithms for more information.

However, run the following script

gives the following error:

I know that in R it is possible to pass categorical data with Sklearn. Is that possible?

Reply:


Contrary to the accepted answer, I would prefer to use the tools provided by Scikit-Learn for this purpose. The main reason for this is that they can be easily integrated into a pipeline.

Scikit-Learn itself offers very good classes for dealing with categorical data. Instead of writing your custom function, consider using one that specially developed for this purpose .

Note the following code from the documentation:

This will automatically code them into numbers for your machine learning algorithms. This now also supports going back to strings of integers. You can do this by simply invoking like this:

This would return.

Also note that for many other classifiers, aside from decision trees like logistic regression or SVM, you'll want to encode your categorical variables using one-hot coding. Scikit-learn also supports this through the class.

Hope that helps!







(This is just a reformatting of my 2016 comment above ... it still applies.)

The accepted answer to this question is misleading.

Currently, sklearn decision trees do not process categorical data - see problem # 5442.

The recommended approach to using label encoding is converting to integers, the than numeric treated become . If your categorical data is not ordinal, this is no good - you are getting splits that don't make sense.

Using a is the only valid way that allows arbitrary divisions that do not depend on the label order but are computationally intensive.






(..)

Can handle both numeric and categorical data.

This just means that you can use

  • the DecisionTreeClassifier class for classification problems
  • the DecisionTreeRegressor class for regression.

In any case, you have to code categorical variables once before adapting a tree with sklearn, as follows:



For nominal categorical variables, I would not use, but or instead, because there is usually no order in these types of variables.


Sklearn decision trees do not handle conversion of categorical strings to numbers. I suggest you find a function (maybe this one) in Sklearn that does this, or manually write code like:




We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.