Discovering Random Forest and CNNs

Have you heard of Random Forest and CNNs? What is? what is it for? how are they used? This is a brief description of these concepts, after all, ignorance is an option for a few. :)

Random Forest is a machine learning algorithm (Machine Learning) that is very easy and flexible to be a user, it is necessary to obtain excellent results, but for this, you have to be especially careful with your hyperparameters in order to avoid the Overfitting and other problems.

In practice, this supervised type algorithm creates a combination of decision trees, their forest, most of the time the bagging method that applies the concept that the greatest combination of models can obtain greater accuracy and more stable results.

One of the great advantages of using this algorithm is that it can be used for classification problems and also for regression. A typical application of this solution and in recommendation systems where, depending on the user’s choices and questions, we have to recommend a product or service. Another advantage of its use and the possibility of obtaining the importance of each characteristic (features) in relation to the model, with this we can decrease the number of features we will be avoiding overfitting.

As already mentioned, Random Forest is a set of decision trees with some differences, being mainly the use of randomness instead of calculations of information gain and gini index, another difference is the presence of a subset to avoid overfitting.

In this algorithm, the hyperparameters are very important and can increase the predictive power of the model and increase the model’s speed. With the use of sklean and hyperparameters, we can control the number of trees built, the maximum number of features to be used, the minimum number of trees. leaves that must exist in a given tree, the number of processors the algorithm can use, makes the result of the model replicable, and cross-validation for the random forest.

One of the recurring problems is the use of overfitting because of enough trees in the forest, the classifier will not over-adjust the model. Another problem and number of trees that make the model slow, with many trees being used for better accuracy that will improve fastening, in real life, the use of this algorithm in real-time scenarios should be thought out very well and tested always.

CNNs or Convolutional Neural Networks, mainly image recognition is adopted, a typical classification problem and will arise from the study of Hubel and Wiesel’s neurons in 1962, starting from the concept of filtering lines, curves, and edges and in each added layer transform this filtering into a more complex image.

The first step in its use for image classification is to transfer these images to the RGB format, allowing them to be interpreted as three-dimensional arrays, usually using three channels with the values of each pixel.

The next step is to use convolution or filters, or kernel, depending on the literature, to determine the most striking features of each image, the chosen output depth of each layer determines the number of filters adopted, the greater, the greater the details that will be analyzed.

Another resource that we can use in CNNs is the activation functions that can interfere a lot in the accuracy of the model, in addition to bringing non-linearity to the model to enable better learning of the neural network.

The pooling layer is used to simplify the information from the previous layer, a method widely used to define this layer is max pooling, where data summarization takes place and overfitting is avoided.

And finally, we have the “Fully connected” layer where its input is the output of the previous layer and its output is the neurons that will be used in the image classification.

References:

https://medium.com/machina-sapiens/o-algoritmo-da-floresta-aleat%C3%B3ria-3545f6babdf8

https://medium.com/neuronio-br/entendendo-redes-convolucionais-cnns-d10359f21184