Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications, Addison-Welsley

Agile Manifesto

- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Customer collaboration over contract negotiation
- Responding to change over following plan

Math

`O(2)`

is denotes`x^n, where n > 2`

and`x`

is small. This is mention what to ignore`p-hacking`

— if you have repeated trials (~20) it has high chance you will get result with`p<0.05`

- Bonferonni correction — when repeated trials divide p by n.

Feature Encoding

- n-grams — combinations of multiple words. e.g.
`White_House`

instead of`White`

and`House`

*information loss*(*data processing inequality*) — when process data, you end up with less or equal amount of information, unless you join with extra data

Visualization

- consider audience. Less is more.
- auto correlation plot for time series — X is lag; Y is how much y is correlated wit y - lag

Metrics

- Jacard denominator is size of the union
- cosine denominator is geometric mean of size of the sets, if vector represents membership. Larger than Jaccard.
- minhash — memory efficient, estimate of Jacard. use N hash functions. count how many times minimum element appears.

Classification

- boosting in random forest — each tree predicts whatever previous tree did not, should be lead to more independent trees

Clustering

- Leading Eigenvalue — when you don’t have metric but have graph, algorithm
- modularity — assignments into groups such that most interaction withing groups and not between groups
- Greedy Louvain — scalable algorithm for graph modularity

Causal Inference

- Bayesian network —
`P(X3|X2,X1) = P(X3|X2)`

also`P(X1,X3|X2) = P(X1|X2)*P(X3|X2)`

for graph`X1 -> X2 -> X3`

- average treatment effect =
`E_test[Y] - E_control[Y]`

- confounding — hot weather leads to more lemonade; how weather leads to more crime; is lemonade leads to crime?
- intervention is different from conditional observation, it disrupts other dependencies in graph
`do(X_i)`

operation — intervention, applies to the DAG of variables that influence each other — delete all edges in`G`

that point to`X_i`

and set the value of`X = x_i`

- Robins G-Formula:
`P(X_i|do(X_i=x_i)) = Σ P(X_j|X_i,Z) P(Z)`

— estimate distribution of`X_j`

under intervention of`X_i`

. definition of`Z`

: not descendant of`X_i`

; blocks every path between`X_i`

and`X_j`

that contains arrow into`X_i`

- to test for causality need to break influence from other variables. Make new dataset with variable fixed to certain value. Collect for all possible values of variable.
`d-separation`

— .. something do to with independence and Markovity… need to learn more.

Training

- Fitting
`sin(x)`

. Need NN with many parameters. Why? How many parameters do humans need to understand`sin(x)`

*perfectly*. Need to thing more about it. - When training loss starts to go lower then validation loss model starts to overfit.

Loss functions

- MSE - mean squared error
- MAE - mean absolute error
- MSE loss is conditional expectation of function:
`f(x_i) = E[Y|X=x_i]`

- MAE loss is conditional median of function:
`f(x_i) = Median(Y|X=x_i)`

- MSE has good confidence intervals
- MAE deals well with extreme values
*Huber*loss is when deviation of`a - f(x)`

is less then`d`

then it is MSE, otherwise it is MAE. It combines good sides of MSE and MAE.

Software

- parallelisation will reduce time complexity only by constant factor of N, this does not help much when N groups out of bound
- separate cache service (as in process like Redis) is useful when multiple processes need to access cache. Caching in process memory would not work since processes do not share memory.
*conflict-free*replicated data types — “increment by 1”- events are good when multiple readers