Python Pandas Tree Structure
Decision tree algorithm in python Decision Tree Algorithm implementation with scikit learnOne of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.As in the previous article we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikit-learn machine learning packages for balance scale dataset.The summarizing way of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set. We will program our classifier in Python language and will use its sklearn library.Decision tree algorithm prerequisitesBefore get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on article.Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced. The greatness of using Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.Before get started let’s quickly look into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.
Hi, I have a large multi-level XML document of a complicated structure, without any namespace definition. I would like to generate a simplified tree view of its structure, so that every possible element from the XML is shown and only once.
Assumptions we make while using Decision tree. In the beginning, the whole training set is considered at the root. Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model. Records are distributed recursively on the basis of attribute values. Order to placing attributes as root or internal node of the tree is done by using some statistical approach.Decision Tree Algorithm Pseudocode.
Place the best attribute of our dataset at the root of the tree. Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.While building our decision tree classifier, we can improve its accuracy by tuning it with different parameters. But this tuning should be done carefully since by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model. Sklearn Library InstallationPython’s sklearn library holds tons of modules that help to build predictive models.
It contains tools for data splitting, pre-processing, feature selection, tuning and algorithms, etc. It is similar to Caret library in R programming.For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most.Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like, decision tree, etc. Balance Scale Data Set DescriptionBalance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. We will try to build a classifier for predicting the Class attribute.
The index of target attribute is 1st.1.: 3 (L, B, R)2. Left-Weight: 5 (1, 2, 3, 4, 5)3. Left-Distance: 5 (1, 2, 3, 4, 5)4. Right-Weight: 5 (1, 2, 3, 4, 5)5. Right-Distance: 5 (1, 2, 3, 4, 5) IndexVariable NameVariable Values1.Class Name( Target Variable)“R”: balance scale tip to the right“L”: balance scale tip to the left“B”: balance scale be balanced2.Left-Weight1, 2, 3, 4, 53.Left-Distance1, 2, 3, 4, 54.Right-Weight1, 2, 3, 4, 55.Right-Distance1, 2, 3, 4, 5The above table shows all the details of data. Balance Scale Problem StatementThe problem we are going to address is To model a classifier for evaluating balance tip’s direction. Decision Tree classifier implementation in Python with sklearn LibraryThe modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classified the balance scale’s tip direction.
Python packages used. NumPy. NumPy is a Numeric Python module. It provides fast mathematical functions. Numpy provides robust data structures for efficient computation of multi-dimensional arrays & matrices.
We used numpy to read data files into numpy arrays and data manipulation. Pandas. Provides DataFrame Object for data manipulation. Provides reading & writing data b/w different files. DataFrames can hold different types data of multidimensional arrays. Scikit-Learn. It’s a machine learning library.
It includes various machine learning algorithms. We are using its.
traintestsplit,. DecisionTreeClassifier,. accuracyscore algorithms.If you haven’t setup the machine learning setup in your system the below posts will helpful.Importing Python Machine Learning LibrariesThis section involves importing all the libraries we are going to use. Knowing brother bts vietsub full. We are importing numpy and sklearn traintestsplit, DecisionTreeClassifier & accuracyscore modules. From sklearn import treeNumpy arrays and pandas dataframes will help us in manipulating data. As discussed above, sklearn is a machine learning library.
The crossvalidation’s traintestsplit method will help us by splitting data into train & test set.The tree module will be used to build a Decision Tree Classifier. Accutacyscore module will be used to calculate accuracy metrics from the predicted class variables. Data ImportFor importing the data and manipulating it, we are going to use pandas dataframes. First of all, we need to download the dataset. You can download the dataset from.
All the data values are separated by commas.After downloading the data file, we will use Pandas readcsv method to import data into pandas dataframe. Since our data is separated by commas “,” and there is no header in our, so we will put header parameter’s value “None” and sep parameter’s value as “,”. Xtrain, Xtest, ytrain, ytest = traintestsplit ( X, Y, testsize = 0.3, randomstate = 100 )The above snippet will split data into training and test set. Xtrain, ytrain are training data & Xtest, ytest belongs to the test dataset.The parameter testsize is given value 0.3; it means test sets will be 30% of whole dataset & training dataset’s size will be 70% of the entire dataset. Randomstate variable is a pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of randomstate.
Decision Tree TrainingNow we fit Decision tree algorithm on training data, predicting labels for validation dataset and printing the accuracy of the model using various parameters.DecisionTreeClassifier: This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters are:. criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain.
Pandas Library In Python 3
By default, it takes “gini” value. splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value. maxfeatures: It defines the no.
Of features to consider when looking for the best split. We can input integer, float, string & None value. If an integer is inputted then it considers that value as max features at each split. If float value is taken then it shows the percentage of features at each split. If “auto” or “sqrt” is taken then maxfeatures=sqrt(nfeatures). If “log2” is taken then maxfeatures= log2(nfeatures).
If None, then maxfeatures=nfeatures. By default, it takes “None” value. maxdepth: The maxdepth parameter denotes maximum depth of the tree. It can take any integer value or None.
If None, then nodes are expanded until all leaves are pure or until all leaves contain less than minsamplessplit samples. By default, it takes “None” value. minsamplessplit: This tells above the minimum no. Of samples reqd. To split an internal node. If an integer value is taken then consider minsamplessplit as the minimum no.
If float, then it shows percentage. By default, it takes “2” value. minsamplesleaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider minsamplesleaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value. maxleafnodes: It defines the maximum number of possible leaf nodes.
If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value. minimpuritysplit: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.Let’s build classifiers using criterion as gini index & information gain. We need to fit our classifier using fit. We will plot our decision tree classifier’s visualization too.
Decision Tree Classifier with criterion gini index. 'L', 'L', 'L', 'R', 'R', 'R' , dtype = object )Calculating Accuracy ScoreThe function accuracyscore will be used to print accuracy of Decision Tree algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points.
Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.
ytrue,. ypred,. normalize,.
sampleweight.Out of these 4, normalize & sampleweight are optional parameters. The parameter ytrue accepts an array of correct labels and ypred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value. Accuracy for Decision Tree classifier with criterion as gini index. Accuracy is 511ConclusionIn this article, we have learned how to model the decision tree algorithm in Python using the Python machine learning library scikit-learn.
In the process, we learned how to split the data into train and test dataset. To model decision tree classifier we used the information gain, and gini index split criteria. In the end, we calucalte the accuracy of these two decision tree models. Follow us: I hope you like this post. If you have any questions, then feel free to comment below.
If you want me to write on one particular topic, then do tell it to me in the comments below. Hi Eli,It was a great intuition to think about the continuous and categorical variables while modeling the machine learning models. When it comes to the decision tree, it matters a lot.As at each node level, we were using the feature value (Categorical or continuous value) to calculate the cutoff value and the final decision will depend on the cutoff value.In case of categorical variables, we will end up only a few cases where in case of continues variable we will be having a high set of values.