Skip to content

Latest commit

 

History

History
158 lines (99 loc) · 5.06 KB

MachineLearning.md

File metadata and controls

158 lines (99 loc) · 5.06 KB

Handy Repos

Machine Learning Tooling - Great selection of tools to make your Machine Learning life easier

Handy knowledge

CURSE OF DIMENSIONALITY - THE LESS THE BETTER (PCA and LDA say hello!)

Major data categories:

(CATEGORICAL)

NOMINAL

categorical variables (pet: cat / dog / hamster) = characteristics

ORDINAL

variables can be ordered (level of education)

(NUMERICAL)

DISCRETE

variables can take only certain values

CONTINOUS

fully arithmetical variables, can be anything

Encoding categorical data

IF your data is ORDINAL, it is sufficient to use simple integer encoding (0, 1, 2, 3 etc..) BUT! More often than not, such solution would slow down the ML algorhitm (reinterpretation, in-between predictors), that's why we use ONE HOT ENCODING (having a boolean column for each category)

Distances

Hamming (bitwise)

between boolean values, usual for one hot encoded tables

sum(e1 != e2 for e1, e2 in zip(a, b)

scipy.spatial.distance.hamming(a, b)

Euclidean

between real valued vectors, usual for tables

np.linalg.norm(x - y)

np.sqrt(np.sum(np.square(x-y)))

Manhattan

between real valued vectors, preferable for uniform grids and integer feature spaces like those rectangular streets in Manhattan you got 4 directions of movement

sum(abs(e1-e2) for e1, e2 in zip(a, b))

scipy.spatial.distance.cityblock(a, b)

Minkowski Distance

great exploratory tool generalization of the Euclidean and Manhattan.

sum(abs(e1-e2)p for e1, e2 in zip(a,b))(1/p)

scipy.spatial.minkowski_distance(a, b, p) Manhattan p = 1, Euclidean p = 2 and everything between

MODELS:

UNSUPERVISED

DECISION TREES (CART)

POWERFUL SUPERVISED CLASSIFIERS

  • Essentialy - chains of boolean categories.
  • The more factors, the more SPLITS. The more SPLITS, the more DEEP is the tree.
  • Can do either classification or regression
  • At the end of the tree, in the sub-trees, are the LEAFS (leaf nodes) that make the prediction based on previous splits.
  • Quite good at mapping non-linear relationships

root node : entire sample that gets further divided

pruning : removing sub-nodes of a decision node

purity : subset composed of only a single class is considered pure

entropy : quanitifes the randomness (disorder) within a set of class values. Used to calculate the homogeneity (impurity) of a sample. Completely homogenous - 0, equally divided - 1.If group's entropy is high, it is very diverse and cannot give us much info about other items that belong to the same group.

gini impurity : used at each node to decide which feature is best to split on. all cases in the node fall into a single target category - 0. The closer to 0 the better.

RandomForestClassifier

  • N tree estimators built on randomly sampled training data
  • Random subsets of features when splitting nodes
  • Final prediction is the average of trees not correlated with each other.

STRENGHTS

  • Performs well on most problems
  • Handles noisy or missing data
  • Reduces risk of overfitting
  • Categorical&continous features
  • Selects only most important features
  • Resistant to large datasets

WEAKNESSES

  • Not easily interpretable
  • Needs thorough tuning

Gradient Boosting

  • Like RFC, but not so random. Kind of like tree-breeding.
  • Incorporates back-propagation through a cost function to evaluate parameters for the next tree being created, based on the weakest data points in the previous tree.-

Support vector machine (SVM)

  • Abstract concept of a machine, that works as a LINEAR CLASSIFIER.
  • Combination of KNN and Linear Regression
  • Uses a boundary called hyperplane to partition data into groups
  • Considers data as either linearly separable or not
  • If data is not linearly separable it maps the problem into a higher dimension called ALTITUDE through a procces called KERNEL TRICK. That serves the case when data is accumulated in the centre of XY.

LINEAR KERNEL : simply a dot product xi * xj, good for linearly separable data

POLYNOMIAL KERNEL: (dot product + 1)^alpha good for non lin sep data

SIGMOID KERNEL: tanh(kappa * dotproduct - delta)

GAUSSIAN RBF KERNEL: usually a first try kernel good for starters

C parameter - inverse of tolerance of separation margin. The bigger the value, the lesser the margin.

STRENGHTS:

  • Universal application
  • Resistant to noisy data and overfitting
  • Easier to use than neural net
  • High accuracy in data mining

WEAKNESSES:

  • Requires thorough tuning
  • Slow training
  • Complex black box

SUPERVISED

KNeighborsClassifier

  • Predicts value/label of input based on fitted data
  • Input must be subject to same dimensions as fitted data
  • Algorithm picks K neighbours of the input based on chosen proximity measure (closest points)
  • Mean/mode of the K neighbours is the predicted value/label of the input

Naive Bayes Classifier

  • Used in spam filtering
  • Naive because it doesn't consider the order of words. However, still effective.
  • Zipf's law for ranking on the words in a frequency table