Machine Learning Tooling - Great selection of tools to make your Machine Learning life easier
(CATEGORICAL)
NOMINAL
categorical variables (pet: cat / dog / hamster) = characteristics
ORDINAL
variables can be ordered (level of education)
(NUMERICAL)
DISCRETE
variables can take only certain values
CONTINOUS
fully arithmetical variables, can be anything
IF your data is ORDINAL, it is sufficient to use simple integer encoding (0, 1, 2, 3 etc..) BUT! More often than not, such solution would slow down the ML algorhitm (reinterpretation, in-between predictors), that's why we use ONE HOT ENCODING (having a boolean column for each category)
between boolean values, usual for one hot encoded tables
sum(e1 != e2 for e1, e2 in zip(a, b)
scipy.spatial.distance.hamming(a, b)
between real valued vectors, usual for tables
np.linalg.norm(x - y)
np.sqrt(np.sum(np.square(x-y)))
between real valued vectors, preferable for uniform grids and integer feature spaces like those rectangular streets in Manhattan you got 4 directions of movement
sum(abs(e1-e2) for e1, e2 in zip(a, b))
scipy.spatial.distance.cityblock(a, b)
great exploratory tool generalization of the Euclidean and Manhattan.
sum(abs(e1-e2)p for e1, e2 in zip(a,b))(1/p)
scipy.spatial.minkowski_distance(a, b, p) Manhattan p = 1, Euclidean p = 2 and everything between
POWERFUL SUPERVISED CLASSIFIERS
- Essentialy - chains of boolean categories.
- The more factors, the more SPLITS. The more SPLITS, the more DEEP is the tree.
- Can do either classification or regression
- At the end of the tree, in the sub-trees, are the LEAFS (leaf nodes) that make the prediction based on previous splits.
- Quite good at mapping non-linear relationships
root node : entire sample that gets further divided
pruning : removing sub-nodes of a decision node
purity : subset composed of only a single class is considered pure
entropy : quanitifes the randomness (disorder) within a set of class values. Used to calculate the homogeneity (impurity) of a sample. Completely homogenous - 0, equally divided - 1.If group's entropy is high, it is very diverse and cannot give us much info about other items that belong to the same group.
gini impurity : used at each node to decide which feature is best to split on. all cases in the node fall into a single target category - 0. The closer to 0 the better.
- N tree estimators built on randomly sampled training data
- Random subsets of features when splitting nodes
- Final prediction is the average of trees not correlated with each other.
STRENGHTS
- Performs well on most problems
- Handles noisy or missing data
- Reduces risk of overfitting
- Categorical&continous features
- Selects only most important features
- Resistant to large datasets
WEAKNESSES
- Not easily interpretable
- Needs thorough tuning
- Like RFC, but not so random. Kind of like tree-breeding.
- Incorporates back-propagation through a cost function to evaluate parameters for the next tree being created, based on the weakest data points in the previous tree.-
- Abstract concept of a machine, that works as a LINEAR CLASSIFIER.
- Combination of KNN and Linear Regression
- Uses a boundary called hyperplane to partition data into groups
- Considers data as either linearly separable or not
- If data is not linearly separable it maps the problem into a higher dimension called ALTITUDE through a procces called KERNEL TRICK. That serves the case when data is accumulated in the centre of XY.
LINEAR KERNEL : simply a dot product xi * xj, good for linearly separable data
POLYNOMIAL KERNEL: (dot product + 1)^alpha good for non lin sep data
SIGMOID KERNEL: tanh(kappa * dotproduct - delta)
GAUSSIAN RBF KERNEL: usually a first try kernel good for starters
C parameter - inverse of tolerance of separation margin. The bigger the value, the lesser the margin.
STRENGHTS:
- Universal application
- Resistant to noisy data and overfitting
- Easier to use than neural net
- High accuracy in data mining
WEAKNESSES:
- Requires thorough tuning
- Slow training
- Complex black box
- Predicts value/label of input based on fitted data
- Input must be subject to same dimensions as fitted data
- Algorithm picks K neighbours of the input based on chosen proximity measure (closest points)
- Mean/mode of the K neighbours is the predicted value/label of the input
- Used in spam filtering
- Naive because it doesn't consider the order of words. However, still effective.
- Zipf's law for ranking on the words in a frequency table