# ㄴㅋㄹㅇ🌅

Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications, Addison-Welsley

Agile Manifesto

• Individuals and interactions over processes and tools
• Working software over comprehensive documentation
• Customer collaboration over contract negotiation
• Responding to change over following plan

Math

• `O(2)` is denotes `x^n, where n > 2` and `x` is small. This is mention what to ignore
• `p-hacking` — if you have repeated trials (~20) it has high chance you will get result with `p<0.05`
• Bonferonni correction — when repeated trials divide p by n.

Feature Encoding

• n-grams — combinations of multiple words. e.g. `White_House` instead of `White` and `House`
• information loss (data processing inequality) — when process data, you end up with less or equal amount of information, unless you join with extra data

Visualization

• consider audience. Less is more.
• auto correlation plot for time series — X is lag; Y is how much y is correlated wit y - lag

Metrics

• Jacard denominator is size of the union
• cosine denominator is geometric mean of size of the sets, if vector represents membership. Larger than Jaccard.
• minhash — memory efficient, estimate of Jacard. use N hash functions. count how many times minimum element appears.

Classification

• boosting in random forest — each tree predicts whatever previous tree did not, should be lead to more independent trees

Clustering

• Leading Eigenvalue — when you don’t have metric but have graph, algorithm
• modularity — assignments into groups such that most interaction withing groups and not between groups
• Greedy Louvain — scalable algorithm for graph modularity

Causal Inference

• Bayesian network — `P(X3|X2,X1) = P(X3|X2)` also `P(X1,X3|X2) = P(X1|X2)*P(X3|X2)` for graph `X1 -> X2 -> X3`
• average treatment effect = `E_test[Y] - E_control[Y]`
• confounding — hot weather leads to more lemonade; how weather leads to more crime; is lemonade leads to crime?
• intervention is different from conditional observation, it disrupts other dependencies in graph
• `do(X_i)` operation — intervention, applies to the DAG of variables that influence each other — delete all edges in `G` that point to `X_i` and set the value of `X = x_i`
• Robins G-Formula: `P(X_i|do(X_i=x_i)) = Σ P(X_j|X_i,Z) P(Z)` — estimate distribution of `X_j` under intervention of `X_i`. definition of `Z`: not descendant of `X_i`; blocks every path between `X_i` and `X_j` that contains arrow into `X_i`
• to test for causality need to break influence from other variables. Make new dataset with variable fixed to certain value. Collect for all possible values of variable.
• `d-separation` — .. something do to with independence and Markovity… need to learn more.

Training

• Fitting `sin(x)`. Need NN with many parameters. Why? How many parameters do humans need to understand `sin(x)` perfectly. Need to thing more about it.
• When training loss starts to go lower then validation loss model starts to overfit.

Loss functions

• MSE - mean squared error
• MAE - mean absolute error
• MSE loss is conditional expectation of function: `f(x_i) = E[Y|X=x_i]`
• MAE loss is conditional median of function: `f(x_i) = Median(Y|X=x_i)`
• MSE has good confidence intervals
• MAE deals well with extreme values
• Huber loss is when deviation of `a - f(x)` is less then `d` then it is MSE, otherwise it is MAE. It combines good sides of MSE and MAE.

Software

• parallelisation will reduce time complexity only by constant factor of N, this does not help much when N groups out of bound
• separate cache service (as in process like Redis) is useful when multiple processes need to access cache. Caching in process memory would not work since processes do not share memory.
• conflict-free replicated data types — “increment by 1”
• events are good when multiple readers