Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications, Addison-Welsley
Agile Manifesto
- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Customer collaboration over contract negotiation
- Responding to change over following plan
Math
O(2)
is denotes x^n, where n > 2
and x
is small. This is mention what to ignore
p-hacking
— if you have repeated trials (~20) it has high chance you will get result with p<0.05
- Bonferonni correction — when repeated trials divide p by n.
Feature Encoding
- n-grams — combinations of multiple words. e.g.
White_House
instead of White
and House
- information loss (data processing inequality) — when process data, you end up with less or equal amount of information, unless you join with extra data
Visualization
- consider audience. Less is more.
- auto correlation plot for time series — X is lag; Y is how much y is correlated wit y - lag
Metrics
- Jacard denominator is size of the union
- cosine denominator is geometric mean of size of the sets, if vector represents membership. Larger than Jaccard.
- minhash — memory efficient, estimate of Jacard. use N hash functions. count how many times minimum element appears.
Classification
- boosting in random forest — each tree predicts whatever previous tree did not, should be lead to more independent trees
Clustering
- Leading Eigenvalue — when you don’t have metric but have graph, algorithm
- modularity — assignments into groups such that most interaction withing groups and not between groups
- Greedy Louvain — scalable algorithm for graph modularity
Causal Inference
- Bayesian network —
P(X3|X2,X1) = P(X3|X2)
also P(X1,X3|X2) = P(X1|X2)*P(X3|X2)
for graph X1 -> X2 -> X3
- average treatment effect =
E_test[Y] - E_control[Y]
- confounding — hot weather leads to more lemonade; how weather leads to more crime; is lemonade leads to crime?
- intervention is different from conditional observation, it disrupts other dependencies in graph
do(X_i)
operation — intervention, applies to the DAG of variables that influence each other — delete all edges in G
that point to X_i
and set the value of X = x_i
- Robins G-Formula:
P(X_i|do(X_i=x_i)) = Σ P(X_j|X_i,Z) P(Z)
— estimate distribution of X_j
under intervention of X_i
. definition of Z
: not descendant of X_i
; blocks every path between X_i
and X_j
that contains arrow into X_i
- to test for causality need to break influence from other variables. Make new dataset with variable fixed to certain value. Collect for all possible values of variable.
d-separation
— .. something do to with independence and Markovity… need to learn more.
Training
- Fitting
sin(x)
. Need NN with many parameters. Why? How many parameters do humans need to understand sin(x)
perfectly. Need to thing more about it.
- When training loss starts to go lower then validation loss model starts to overfit.
Loss functions
- MSE - mean squared error
- MAE - mean absolute error
- MSE loss is conditional expectation of function:
f(x_i) = E[Y|X=x_i]
- MAE loss is conditional median of function:
f(x_i) = Median(Y|X=x_i)
- MSE has good confidence intervals
- MAE deals well with extreme values
- Huber loss is when deviation of
a - f(x)
is less then d
then it is MSE, otherwise it is MAE. It combines good sides of MSE and MAE.
Software
- parallelisation will reduce time complexity only by constant factor of N, this does not help much when N groups out of bound
- separate cache service (as in process like Redis) is useful when multiple processes need to access cache. Caching in process memory would not work since processes do not share memory.
- conflict-free replicated data types — “increment by 1”
- events are good when multiple readers