Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications, Addison-Welsley
- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Customer collaboration over contract negotiation
- Responding to change over following plan
O(2) is denotes
x^n, where n > 2 and
x is small. This is mention what to ignore
p-hacking — if you have repeated trials (~20) it has high chance you will get result with
- Bonferonni correction — when repeated trials divide p by n.
- n-grams — combinations of multiple words. e.g.
White_House instead of
- information loss (data processing inequality) — when process data, you end up with less or equal amount of information, unless you join with extra data
- consider audience. Less is more.
- auto correlation plot for time series — X is lag; Y is how much y is correlated wit y - lag
- Jacard denominator is size of the union
- cosine denominator is geometric mean of size of the sets, if vector represents membership. Larger than Jaccard.
- minhash — memory efficient, estimate of Jacard. use N hash functions. count how many times minimum element appears.
- boosting in random forest — each tree predicts whatever previous tree did not, should be lead to more independent trees
- Leading Eigenvalue — when you don’t have metric but have graph, algorithm
- modularity — assignments into groups such that most interaction withing groups and not between groups
- Greedy Louvain — scalable algorithm for graph modularity
- Bayesian network —
P(X3|X2,X1) = P(X3|X2) also
P(X1,X3|X2) = P(X1|X2)*P(X3|X2) for graph
X1 -> X2 -> X3
- average treatment effect =
E_test[Y] - E_control[Y]
- confounding — hot weather leads to more lemonade; how weather leads to more crime; is lemonade leads to crime?
- intervention is different from conditional observation, it disrupts other dependencies in graph
do(X_i) operation — intervention, applies to the DAG of variables that influence each other — delete all edges in
G that point to
X_i and set the value of
X = x_i
- Robins G-Formula:
P(X_i|do(X_i=x_i)) = Σ P(X_j|X_i,Z) P(Z) — estimate distribution of
X_j under intervention of
X_i. definition of
Z: not descendant of
X_i; blocks every path between
X_j that contains arrow into
- to test for causality need to break influence from other variables. Make new dataset with variable fixed to certain value. Collect for all possible values of variable.
d-separation — .. something do to with independence and Markovity… need to learn more.
sin(x). Need NN with many parameters. Why? How many parameters do humans need to understand
sin(x) perfectly. Need to thing more about it.
- When training loss starts to go lower then validation loss model starts to overfit.
- MSE - mean squared error
- MAE - mean absolute error
- MSE loss is conditional expectation of function:
f(x_i) = E[Y|X=x_i]
- MAE loss is conditional median of function:
f(x_i) = Median(Y|X=x_i)
- MSE has good confidence intervals
- MAE deals well with extreme values
- Huber loss is when deviation of
a - f(x) is less then
d then it is MSE, otherwise it is MAE. It combines good sides of MSE and MAE.
- parallelisation will reduce time complexity only by constant factor of N, this does not help much when N groups out of bound
- separate cache service (as in process like Redis) is useful when multiple processes need to access cache. Caching in process memory would not work since processes do not share memory.
- conflict-free replicated data types — “increment by 1”
- events are good when multiple readers