How do I understand Xgboost intuitively



An integrated algorithm idea

Second, the basic idea of ​​XGBoost

Third, MacOS installs XGBoost

Fourth, you use Python to implement the XGBoost algorithm

IncompetitionThe XGBoost algorithm is widely used in, and using this algorithm usually improves the accuracy of our model. What did it do from start to finish after it worked so well? And how does it do that?

First, let's intuitively understand what XGBoost is.The XGBoost algorithm is linked to the decision tree algorithm. The decision tree algorithm was mentioned on another blog by me:


In the decision tree, we know that a sample is divided left or right and eventually reaches the leaf node to perform a classification task. In fact, you can also do regression tasks.

Look at the left side of the legend above: There are 5 examples. Now I want to see if these 5 people are ready to play games. These 5 people are now divided into leaf nodes.Different weight elements are assigned to different leaf nodes.A positive number indicates that the person is ready to play the game and a negative number indicates that the person is not ready to play the game. Hence, we can use the combination of leaf knots and weights to fully assess whether the current person is ready or not ready to play the game. The weight of the little boy's leaf node over "Tree1" is +2 (can be understood as a score).

Using a single decision tree generally doesn't seem very effective or too absolute. Usually we use an integrated method, meaning a tree may not be very effective. What about two trees?

If you look at "Tree2" on the right side of the legend, it differs from the left in that it uses a different index regardless of age or gender. You can also consider using the Computer Frequency Division attribute. Together, these two trees help us decide if that person is ready to play games. The weight of the little boy in "Tree1" is +2 and the weight of "Tree2" is +0.9, so the final weight of the little boy is +2.9 (can be understood as a score of +2.9). Grandpa's final weights are also obtained through the same process.

When we usually do classification or regression tasks, we need to think that the expression effect may not be very good if we choose to use a classifier. Then we have to consider such an integrated idea. The above figure only names two classifiers. In fact, there can be increasingly complex weak classifiers that are combined into one strong classifier.

What is the built-in representation of XGBoost? How to predict? What is the goal of finding the optimal solution? You can see at a glance the instructions in the image below.


In XGBoostEach tree is added individuallyEveryone else hopes that the effect can be improved. The following figure shows the integrated representation (core) of XGBoost.

At the beginning the tree is 0, and adding a tree is equivalent to adding a function, and adding a second tree is equivalent to adding another function ... and so on.Here we need to make sure that adding new features can improve the overall expression effect. Improving the expressive effect means that after adding a new tree, the value of the objective function (that is, the loss) decreases.

If there are too many leaf nodes, the risk of overfitting is greater. Therefore, the number of leaf nodes should be limited here. So add one to the original objective functionThe penalty point is "Omega (ft)".


Here is a simple example to see how the penalty term "Omega (ft)" is calculated:

There are a total of 3 leaf nodes with weights of 2, 0.1 and -1, respectively. Bring them into "Omega (ft)" to get the formula in the legend above. The penalty and the value of "Lambda" are given artificially.

Complete objective function of the XGBoost algorithmSee the formula below, which is formed by adding a separate loss function and the regulated penalty term "Omega (ft)".

The derivation of the objective function is not presented in detail in this article. The process is: give the objective function a partial derivative of the weights, get a weight that the objective function can minimize, and plug that weight into the objective function. The result of this generation isThe minimum objective function value after solvingAs follows:

Where the first derivative and the second derivative in the third formulaHistorical dataBoth can be calculated as long as the two parameters are specified in the main function. This is a certain value. Here is an intuitive example to look at this process.

(One more word here: Obj represents, when we specify the structure of a tree, how much is reduced at the most at the target, we can call it the structure evaluation,The smaller the score, the better)

For every expansion we still have toList all possible solutions. For a particular segmentation, we need to calculate the sum of the derivatives of the left subtree and the sum of the derivatives of the right subtree (i.e. the first red box in the figure below) and then compare it to the previous one (use the loss to check if the loss is after the split and the loss before the split changed and how much has changed, traverse all partitions and select the partition with the most changes as the most appropriate.


Install XGBoost with pip

The first step is to install HomeBrew.

HomeBrew is package management software for Mac, similar to apt-get on Linux

The second step is to install llvm

The third step is to install clang-omp

Someone mentioned that clang-omp was removed from HomeBrew. If you can't find clang-omp then you can give it a try

The fourth step is to install XGBoost

Test it and you are done!


The pima-indians-diabetes.csv file contains 8 columns with numerically independent variables and the 9th column with a binary classification dependent variable from 0-1 that was imported into Python for exploration using the XGBoost algorithm, and the accuracy of the predicted data is 77.95%.

Results output:

The main function in Python's XGBoost package is XGBClassifier (), which contains a variety of parameters. You can also pay attention to plot_importance (). I will update it later.