Based on original questions of Vincent Granville, Brian Caffo, Roger Peng, Jeff Leek, Andrew NG, DataIku, Yelp, Facebook, Quora, my dear friend Italo Farfan Verà and me. Voluntarily, some questions are left unanswered, hints can be found in other questions as well.
Feel free to contact me: LinkedIn or dahan1jonathan@gmail.com.
It’s a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.
Examples: leave-one-out cross validation, K-fold cross validation
How to do it right?
Bias-variance trade-off for k-fold cross validation:
Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set (\(n-1\) observations).
But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation
Typically, we choose \(k=5\) or \(k=10\), as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.
Common metrics in regression:
Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable.
The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE.
\(RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}\)
\(MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i-\hat{y}_i|\)
Root Mean Squared Logarithmic Error
RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE)
\(RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\)
Where \(p_i\) is the ith prediction, \(a_i\) the ith actual response, \(log(b)\) the natural logarithm of \(b\).
Weighted Mean Absolute Error
The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products)
\(WMAE = \frac{1}{\sum{w_i}} \sum_{i=1}^n w_i | y_i - \hat{y}_i |\)
Common metrics in classification:
Recall / Sensitivity / True positive rate:
High when FN low. Sensitive to unbalanced classes.
\(Sensitivity=\frac{TP}{TP+FN}\)
Precision / Positive Predictive Value
High when FP low. Sensitive to unbalanced classes.
\(Precision=\frac{TP}{TP+FP}\)
Specificity / True Negative Rate
High when FP low. Sensitive to unbalanced classes.
\(Specificity=\frac{TN}{TN+FP}\)
Accuracy
High when FP and FN are low. Sensitive to unbalanced classes (see “Accuracy paradox”)
\(Accuracy=\frac{TP+TN}{TN+TP+FP+FN}\)
ROC / AUC
ROC is a graphical plot that illustrates the performance of a binary classifier (\(Sensitivity\) Vs \(1-Specificity\) or \(Sensitivity\) Vs \(Specificity\)). They are not sensitive to unbalanced classes.
AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP)
Logarithmic loss
Punishes infinitely the deviation from the true value! It’s better to be somewhat wrong than emphatically wrong!
\(log loss = -\frac{1}{N}\sum_{i=1}^n {(y_i\log(p_i) + (1 - y_i)\log(1 - p_i))}\)
Misclassification Rate
\(Misclassification={1 \over n}\sum_i I(y_i \neq \hat y_i)\)
F1-Score
Used when the target variable is unbalanced. \(F_1Score=2\frac{Precision\times Recall}{Precision+Recall}\)
Ridge regression:
The Lasso:
In contrast with global optimum: the optimal solution among all others
K-means clustering context:
It’s proven that the objective cost function will always decrease until a local optimum is reached.
Results will depend on the initial random cluster assignment
Determining if you have a local optimum problem:
Tendency of premature convergence
Different initialization induces different optima
Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost
Validation using \(R^2\):
- % of variance retained by the model
- Issue: \(R^2\) is always increased when adding variables
- \(R^2=\frac{RSS_tot-RSS_res}{RSS_tot}=\frac{RSS_reg}{RSS_tot}=1-\frac{RSS_res}{RSS_tot}\)
Analysis of residuals:
- Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable’s observations)
- Scatter plots residuals Vs predictors
- Normality of errors
- Etc. : diagnostic plots
Out-of-sample evaluation: with cross-validation
See question 3. “How to define/select metrics? Do you know compound metrics?”.
When using Precision/Recall curves.
Used for:
Statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.
Reduce the data from \(n\) to \(k\) dimensions: find the \(k\) vectors onto which to project the data so as to minimize the projection error.
Algorithm:
1) Preprocessing (standardization): PCA is sensitive to the relative scaling of the original variable
2) Compute covariance matrix \(\Sigma\)
3) Compute eigenvectors of \(\Sigma\)
4) Choose \(k\) principal components so as to retain \(x\)% of the variance (typically \(x=99\))
Applications:
1) Compression
- Reduce disk/memory needed to store data
- Speed up learning algorithm. Warning: mapping should be defined only on training set and then applied to test set
Limitations:
- PCA is not scale invariant
- The directions with largest variance are assumed to be of most interest
- Only considers orthogonal transformations (rotations) of the original variables
- PCA is only based on the mean vector and covariance matrix. Some distributions (multivariate normal) are characterized by this but some are not
- If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances
False positive
Improperly reporting the presence of a condition when it’s not in reality. Example: HIV positive test when the patient is actually HIV negative
False negative
Improperly reporting the absence of a condition when in reality it’s the case. Example: not detecting a disease when the patient has this disease.
When false positives are more important than false negatives:
- In a non-contagious disease, where treatment delay doesn’t have any long-term consequences but the treatment itself is grueling
- HIV test: psychological impact
When false negatives are more important than false positives:
- If early treatment is important for good outcomes
- In quality control: a defective item passes through the cracks!
- Software testing: a test to catch a virus has failed
“Natural language processing”!
Major tasks:
- Machine translation
- Question answering: “what’s the capital of Canada?”
- Sentiment analysis: extract subjective information from a set of documents, identify trends or public opinions in the social media
- Information retrieval
Counts:
Distributions:
Too many features, dimension-reduction? PCA?
In terms of performance on several data sets:
- “Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?”
- “Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets”, A. Lacoste and F. Laviolette
- “Statistical Comparisons of Classifiers over Multiple Data Sets”, Janez Demsar
In terms of performance on a given data set:
- One wants to choose between two learning algorithms
- Need to compare their performances and assess the statistical significance
One approach (Not preferred in the literature):
- Multiple k-fold cross validation: run CV multiple times and take the mean and sd
- You have: algorithm A (mean and sd) and algorithm B (mean and sd)
- Is the difference meaningful? (Paired t-test)
Sign-test (classification context):
Simply counts the number of times A has a better metrics than B and assumes this comes from a binomial distribution. Then we can obtain a p-value of the \(H_o\) test: A and B are equal in terms of performance.
Wilcoxon signed rank test (classification context):
Like the sign-test, but the wins (A is better than B) are weighted and assumed coming from a symmetric distribution around a common median. Then, we obtain a p-value of the \(H_o\) test.
Other (without hypothesis testing):
- AUC
- F-Score
- See question 3
Kolmogorov-Smirnov test:
- Non-parametric test
- Compare a sample with a reference probability distribution or compare two samples
- Quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution
- Or between the empirical distribution functions of two samples
- Null hypothesis (two-samples test): samples are drawn from the same distribution
- Can be modified as a goodness of fit test
- In our case: cumulative percentages of good, cumulative percentages of bad
Collaborative filtering:
- Technique used by some recommender systems
- Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this user’s ratings against other users’ and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user
n-grams:
- Contiguous sequence of n items from a given sequence of text or speech
- “Andrew is a talented data scientist”
- Bi-gram: “Andrew is”, “is a”, “a talented”.
- Tri-grams: “Andrew is a”, “is a talented”, “a talented data”.
- An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
- More concisely, n-gram model: \(P(X_i|X_{i-(n-1)}...X_{i-1})\): Markov model
- N-gram model: each word depends only on the \(n-1\) last words
Issues:
- when facing infrequent n-grams
- solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
- Methods: Good-Turing, Backoff, Kneser-Kney smoothing
Cosine distance:
- How similar are two documents?
- Perfect similarity/agreement: 1
- No agreement : 0 (orthogonality)
- Measures the orientation, not magnitude
Given two vectors A and B representing word frequencies:
\(\text{cosine-similarity}(A,B) = \frac{\left<A,B\right>}{||A||\cdot||B||}\)
How do you define good?
- good data: data relevant regarding the project/task to be handled
- good model: model relevant regarding the project/task
- good model: a model that generalizes on external data sets
Is there a universal good model?
- No, otherwise there wouldn’t be the overfitting problem!
- Algorithm can be universal but not the model
- Model built on a specific data set in a specific organization could be ineffective in other data set of the same organization
- Models have to be updated on a somewhat regular basis
Are there any models that are definitely not so good?
- “all models are wrong but some are useful” George E.P. Box
- It depends on what you want: predictive models or explanatory power
- If both are bad: bad model
Example with linear regression:
- F-statistic (ANOVA)
\(F=\frac{\frac{RSS_1-RSS_2}{p_2-p_1}}{\frac{RSS_2}{n-p_2}}\)
\(p_1\): number of parameters of model 1
\(p_2\): number of parameters of model 2
\(n\): number of observations
Under the null hypothesis that model 2 doesn’t provide a significantly better fit than model 1, \(F\) will have an \(F\) distribution with \((p_2-p_1,n-p_2)\) degrees of freedom. The null hypothesis is rejected if the \(F\) calculated from the data is greater than the critical value of the \(F\) distribution for some desired significance level.
Others: AIC/BIC (regression), cross-validation: assessing test error on a test/validation set
data reduction techniques other than PCA?:
Partial least squares: like PCR (principal component regression) but chooses the principal components in a supervised way. Gives higher weights to variables that are most strongly related to the response
step-wise regression?
- the choice of predictive variables are carried out using a systematic procedure
- Usually, it takes the form of a sequence of F-tests, t-tests, adjusted R-squared, AIC, BIC
- at any given step, the model is fit using unconstrained least squares
- can get stuck in local optima
- Better: Lasso
step-wise techniques:
- Forward-selection: begin with no variables, adding them when they improve a chosen model comparison criterion
- Backward-selection: begin with all the variables, removing them when it improves a chosen model comparison criterion
Better than reduced data:
Example 1: If all the components have a high variance: which components to discard with a guarantee that there will be no significant loss of the information?
Example 2 (classification):
- One has 2 classes; the within class variance is very high as compared to between class variance
- PCA might discard the very information that separates the two classes
Better than a sample:
- When number of variables is high relative to the number of observations
Predict \(y\) from \(x\): 1) + 2)
Estimate the standard error of predictors: 1) + 2) + 3)
Get an unbiased estimation of \(y\) from \(x\): 1) + 2) + 3) + 4)
Make probability statements, hypothesis testing involving slope and correlation, confidence intervals: 1) + 2) + 3) + 4) + 5)
Note:
- Common mythology: linear regression doesn’t assume anything about the distributions of \(x\) and \(y\)
- It only makes assumptions about the distribution of the residuals
- And this is only needed for statistical tests to be valid
- Regression can be applied to many purposes, even if the errors are not normally distributed
Collinearity/Multicollinearity:
- In multiple regression: when two or more variables are highly correlated
- They provide redundant information
- In case of perfect multicollinearity: \(\beta=(X^TX)^{-1}X^Ty\) doesn’t exist, the design matrix isn’t invertible
- It doesn’t affect the model as a whole, doesn’t bias results
- The standard errors of the regression coefficients of the affected variables tend to be large
- The test of hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory (Type II error)
- Leads to overfitting
Remove multicollinearity:
- Drop some of affected variables
- Principal component regression: gives uncorrelated predictors
- Combine the affected variables
- Ridge regression
- Partial least square regression
Detection of multicollinearity:
- Large changes in the individual coefficients when a predictor variable is added or deleted
- Insignificant regression coefficients for the affected predictors but a rejection of the joint
hypothesis that those coefficients are all zero (F-test)
- VIF: the ratio of variances of the coefficient when fitting the full model divided by the variance of the coefficient when fitted on its own
- rule of thumb: \(VIF>5\) indicates multicollinearity
- Correlation matrix, but correlation is a bivariate relationship whereas multicollinearity is multivariate
R squared/Adjusted R squared:
- \(R^2=\frac{RSS_tot-RSS_res}{RSS_tot}=\frac{RSS_reg}{RSS_tot}=1-\frac{RSS_res}{RSS_tot}\)
- Describes the percentage of the total variation described by the model
- \(R^2\) always increases when adding new variables: adjusted \(R^2\) incorporates the model’s degrees of freedom
F test:
- Evaluate the hypothesis \(H_o\): all regression coefficients are equal to zero Vs \(H_1\): at least one doesn’t
- Indicates that \(R^2\) is reliable
RMSE:
- Absolute measure of fit (whereas \(R^2\) is a relative measure of fit)
Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests)
Purity measure: information gain, Gini coefficient, Chi Squared values
Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold
Pruning: reduced error pruning, out of bag error pruning (ensemble methods)
Gini
\(Gini=1-\sum_{j}p_j^2\)
Information Gain/Deviance
\(InformationGain=\sum_{j}p_jlog_2p_j\)
Better than Gini when \(p_j\) are very small: multiplying very small numbers leads to rounding errors, we can instead take logs.
Random forest? (Intuition):
- Underlying principle: several weak learners combined provide a strong learner
- Builds several decision trees on bootstrapped training samples of data
- On each tree, each time a split is considered, a random sample of \(m\) predictors is chosen as split candidates, out of all \(p\) predictors
- Rule of thumb: at each split \(m=\sqrt{p}\)
- Predictions: at the majority rule
Why is it good?
- Very good performance (decorrelates the features)
- Can model non-linear class boundaries
- Generalization error for free: no cross-validation needed, gives an unbiased estimate of the generalization error as the trees is built
- Generates variable importance
\(\log(odds)=\log(\frac{P(y=1|x)}{P(y=0|x)})=\) is a linear function of the input features
Minimization objective/Cost function:
\(J(\beta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\beta(x^{i}))+(1-y^{i})\log(1-h_\beta(x^{i}))\)
Where: \(h_{\beta}(x)=g(\beta^{T}x)\) and \(g(z)=\frac{1}{1+e^{-z}}\) (sigmoid function)
Intuition:
- if \(y_i=0\), \(J(\beta)=log(1-h_\beta(x)_i)\), will converge to \(\infty\) as \(h_\beta(x)_i\) becomes far from 0
- Converse: when \(y_i=1\), \(J(\beta)=log(h_\beta(x)_i)\), will converge to \(\infty\) as \(h_\beta(x)_i\) becomes far from 1
Interpretation of the coefficients: the increase of \(\log{odds}\) for the increase of one unit of a predictor, given all the other predictors are fixed.
Illustrative example: compare the proportion of an inscribed hypersphere with radius \(r\) and dimension d to that of a hypercube with edges of length \(2r\)
- Volume of such a sphere is \(V_{sphere}=\frac{2r^d\pi^{d/2}}{d\Gamma(d/2)}\)
- The volume of the cube is: \(V_{cube}=2r^d\)
As d increases (space dimension), the volume of hypersphere becomes insignificant relative to the volume of the hypercube:
\[\lim_{d\to\infty}\frac{V_{sphere}}{V_{cube}}=\frac{\pi^{d/2}}{d2^{d-1}\Gamma(d/2)}=0\]
- Nearly all of the dimensional space is far away from the center
- It consists almost entirely of the corners of the hypercube, with no middle!
Pick an end of rope. Of the remaining \(2n-1\) ends of rope, only one end creates a loop (the other end of the same piece of rope). There are then \(n-1\) untied pieces of rope. The rest of the time, two separates pieces of rope are tied together and there are effectively \(n-1\) untied pieces of rope. The recurrence is therefore:
Clearly, \(L_1=1\) so:
Thanks to Brian Tung.
Statistical significance can be accessed using hypothesis testing:
- Stating a null hypothesis which is usually the opposite of what we wish to test (classifiers A and B perform equivalently, Treatment A is equal of treatment B)
- Then, we choose a suitable statistical test and statistics used to reject the null hypothesis
- Also, we choose a critical region for the statistics to lie in that is extreme enough for the null hypothesis to be rejected (p-value)
- We calculate the observed test statistics from the data and check whether it lies in the critical region
Common tests:
- One sample Z test
- Two-sample Z test
- One sample t-test
- paired t-test
- Two sample pooled equal variances t-test
- Two sample unpooled unequal variances t-test and unequal sample sizes (Welch’s t-test)
- Chi-squared test for variances
- Chi-squared test for goodness of fit
- Anova (for instance: are the two regression models equals? F-test)
- Regression F-test (i.e: is at least one of the predictor useful in predicting the response?)
Examples:
1) Natural language
- Given some corpus of natural language - The frequency of any word is inversely proportional to its rank in the frequency table
- The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
- “The” accounts for 7% of all word occurrences (70000 over 1 million)
- “of” accounts for 3.5%, followed by “and”…
- Only 135 vocabulary items are needed to account for half the English corpus!
Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
- Skewed distribution
- Which metrics to use? Accuracy paradox (classification), F-score, AUC
- Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
- Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach
The CLT states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. i.e: the sampling distribution of the sample mean is normally distributed.
- Used in hypothesis testing
- Used for confidence intervals
- Random variables must be iid: independent and identically distributed
- Finite variance
Types:
- Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
- Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
- Data: “cherry picking”, when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
- Studies: performing experiments and reporting only the most favorable results
- Can lead to unaccurate or even erroneous conclusions
- Statistical methods can generally not overcome it
Why data handling make it worse?
- Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
- Missing data handling will increase this effect as it’s based on most HIV negative
-Prevalence estimates will be unaccurate
Differences between observational and experimental data:
- Observational data: measures the characteristics of a population by studying individuals in a sample, but doesn’t attempt to manipulate or influence the variables of interest
- Experimental data: applies a treatment to individuals and attempts to isolate the effects of the treatment on a response variable
Observational data: find 100 women age 30 of which 50 have been smoking a pack a day for 10 years while the other have been smoke free for 10 years. Measure lung capacity for each of the 100 women. Analyze, interpret and draw conclusions from data.
Experimental data: find 100 women age 20 who don’t currently smoke. Randomly assign 50 of the 100 women to the smoking treatment and the other 50 to the no smoking treatment. Those in the smoking group smoke a pack a day for 10 years while those in the control group remain smoke free for 10 years. Measure lung capacity for each of the 100 women.
Analyze, interpret and draw conclusions from data.
Outliers:
- An observation point that is distant from other observations
- Can occur by chance in any distribution
- Often, they indicate measurement error or a heavy-tailed distribution
- Measurement error: discard them or use robust statistics
- Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
- Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
- Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean
Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).
If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3
Identifying outliers:
- No rigid mathematical method
- Subjective exercise: be careful
- Boxplots
- QQ plots (sample quantiles Vs theoretical quantiles)
Handling outliers:
- Depends on the cause
- Retention: when the underlying model is confidently known
- Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)
Inlier:
- Observation lying within the general distribution of other observed values
- Doesn’t perturb the results but are non-conforming and unusual
- Simple example: observation recorded in the wrong unit (°F instead of °C)
Identifying inliers:
- Mahalanobi’s distance
- Used to calculate the distance between two random vectors
- Difference with Euclidean distance: accounts for correlations
- Discard them
Distribution: lognormal?
Test graphically with QQ plot: sample quantiles of \(\log(durations)\) Vs normal quantiles
Advantages:
- Cost
- Large coverage of population
- Captures individuals who may not respond to surveys
- Regularly updated, allow consistent time-series to be built-up
Disadvantages:
- Restricted to data collected for administrative purposes (limited to administrative definitions. For instance: incomes of a married couple, not individuals, which can be more useful)
- Lack of researcher control over content
- Missing or erroneous entries
- Quality issues (addresses may not be updated or a postal code is provided only)
- Data privacy issues
- Underdeveloped theories and methods (sampling methods…)
Lift:
It’s measure of performance of a targeting model (or a rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. Lift is simply: target response/average response.
Suppose a population has an average response rate of 5% (mailing for instance). A certain model (or rule) has identified a segment with a response rate of 20%, then lift=20/5=4
Typically, the modeler seeks to divide the population into quantiles, and rank the quantiles by lift. He can then consider each quantile, and by weighing the predicted response rate against the cost, he can decide to market that quantile or not.
“if we use the probability scores on customers, we can get 60% of the total responders we’d get mailing randomly by only mailing the top 30% of the scored customers”.
KPI:
- Key performance indicator
- A type of performance measurement
- Examples: 0 defects, 10/10 customer satisfaction
- Relies upon a good understanding of what is important to the organization
More examples:
Marketing & Sales:
- New customers acquisition
- Customer attrition
- Revenue (turnover) generated by segments of the customer population
- Often done with a data management platform
IT operations:
- Mean time between failure
- Mean time to repair
Robustness:
- Statistics with good performance even if the underlying distribution is not normal
- Statistics that are not affected by outliers
- A learning algorithm that can reduce the chance of fitting noise is called robust
- Median is a robust measure of central tendency, while mean is not
- Median absolute deviation is also more robust than the standard deviation
Model fitting:
- How well a statistical model fits a set of observations
- Examples: AIC, \(R^2\), Kolmogorov-Smirnov test, Chi 2, deviance (glm)
Design of experiments:
The design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation.
In its simplest form, an experiment aims at predicting the outcome by changing the preconditions, the predictors.
- Selection of the suitable predictors and outcomes
- Delivery of the experiment under statistically optimal conditions
- Randomization
- Blocking: an experiment may be conducted with the same equipment to avoid any unwanted variations in the input
- Replication: performing the same combination run more than once, in order to get an estimate for the amount of random error that could be part of the process
- Interaction: when an experiment has 3 or more variables, the situation in which the interaction of two variables on a third is not additive
80/20 rule:
- Pareto principle
- 80% of the effects come from 20% of the causes
- 80% of your sales come from 20% of your clients
- 80% of a company complaints come from 20% of its customers
Quality assurance:
- A way of preventing mistakes or defects in manufacturing products or when delivering services to customers
- In a machine learning context: anomaly detection
Six sigma:
- Set of techniques and tools for process improvement
- 99.99966% of products are defect-free products (3.4 per 1 million)
- 6 standard deviation from the process mean
Root cause analysis:
- Method of problem solving used for identifying the root causes or faults of a problem
- A factor is considered a root cause if removal of it prevents the final undesirable event from recurring
Identify a cause vs. a correlation:
- Correlation: statistical measure that describes the size and direction of a relationship between two or more variables. A correlation between two variables doesn’t imply that the change in one variable is the cause of the change in the values of the other variable
- Causation: indicates that one event is the result of the occurrence of the other event; there is a causal relationship between the two events
- Differences between the two types of relationships are easy to identify, but establishing a cause and effect is difficult
Example: sleeping with one’s shoes on is strongly correlated with waking up with a headache. Correlation-implies-causation fallacy: therefore, sleeping with one’s shoes causes headache.
More plausible explanation: both are caused by a third factor: going to bed drunk.
Identify a cause Vs a correlation: use of a controlled study
- In medical research, one group may receive a placebo (control) while the other receives a treatment If the two groups have noticeably different outcomes, the different experiences may have caused the different outcomes
When data is skewed
Estimate a population mean:
- General formula is \(ME=t\times \frac{S}{\sqrt n}\) or \(ME=z\times \frac{s}{\sqrt n}\)
- \(ME\) is the desired margin of error
- \(t\) is the t score or z score that we need to use to calculate our confidence interval
- \(s\) is the standard deviation
Example: we would like to start a study to estimate the average internet usage of households in one week for our business plan. How many households must we randomly select to be 95% sure that the sample mean is within 1minute from the true mean of the population? A previous survey of household usage has shown a standard deviation of 6.95 minutes.
Estimate a proportion:
- Similar: \(ME=z \times \sqrt{\frac{p(1-p)}{n}}\)
Example: a professor in Harvard wants to determine the proportion of students who support gay marriage. She asks “how large a sample do I need?”
She wants a margin of error of less than 2.5%, she has found a previous survey which indicates a proportion of 30%.
\(n=\frac{0.3\times 0.7}{0.025^2}\)
Selection bias:
- An online survey about computer use is likely to attract people more interested in technology than in typical
Under coverage bias:
- Sample too few observations from a segment of population
Survivorship bias:
- Observations at the end of the study are a non-random set of those present at the beginning of the investigation
- In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist
Notes:
- Randomization: in randomized control trials, research participants are assigned by chance, rather than by choice to either the experimental group or the control group.
- Random sampling: obtaining data that is representative of the population of interest
Bayes rule: \(P(Actu_+|Pred_+)=\frac{P(Pred_+|Actu_+)\times P(Actu_+)}{P(Pred_+|Actu_+)\times P(Actu_+)+P(Pred_+|Actu_-)P(Actu_-)}\)
We have: \(\frac{sensitivity\times prevalence}{sensitivity\times prevalence + (1-specificity)\times (1-prevalence)}=\frac{0.997\times 0.001}{0.997\times 0.001 + 0.15\times 0.999}=0.62\)
One-sided test, assume a Poisson distribution
Ho: lambda=0.01 ; H1:lambda>0.01
R code:
ppois(10,1787*0.01)
## [1] 0.03237153
\(X \~N(1020,50)\) Our new quantile: \(z=\frac{1200-1020}{50}=3.6\) R code:
pnorm(3.6,lower.tail=F)
## [1] 0.0001591086
\(X \~Poisson(\lambda=2.5 \times t)\)
R code:
ppois(3,lambda=2.5*4)
## [1] 0.01033605
Quick:
- Intervals take the form \(p \pm z \times \sqrt{\frac{1}{n} \times p \times (1-p)}\)
- We know that \(p(1-p)\) is maximized at \(\frac{1}{2}\) and \(z=1.96\) is the relevant quantile for a 95% confidence interval
- So: \(p \pm \frac{1}{\sqrt{n}}\) is a quick estimate for \(p\)
- Here: \(\frac {1}{\sqrt{100}}=0.1\) so 95% of the intervals would be \([46,66]\)
- It’s not enough!
1100+c(-1,1)*qt(0.975,df=9-1)*10
## [1] 1076.94 1123.06
2*3/qt(0.975,df=8)
## [1] 2.601903
t confidence interval for the difference of the means assuming equal variances:
\((new-old) \pm t' \times sp \times \sqrt{(\frac{1}{n_1})^2+(\frac{1}{n_2})^2}\)
“Premature optimization is the root of all evil”; Donald Knuth
Parallel processing: for instance in R with a single machine.
- doParallel and foreach package
- doParallel: parallel backend, will select n-cores of the machine
- for each: assign tasks for each core
- using Hadoop on a single node
- using Hadoop on multi-node
Faster algorithm:
- In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code
- Data structure: affect performance
- Caching: avoid unnecessary work
- Improve source code level
For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated “something” and then had a conditional jump which tested if it was true while FOR had unconditional jump.
Telemedicine
- Heart patients wear specialized monitor which gather information regarding heart state
- The collected data is sent to an electronic implanted device which sends back electric shocks to the patient for correcting incorrect rhythms
Product restocking
- Vending machines are capable of messaging the distributor whenever an item is running out of stock
R
- Focuses on better, user friendly data analysis, statistics and graphical models
- The closer you are to statistics, data science and research, the more you might prefer R
- Statistical models can be written with only a few lines in R
- The same piece of functionality can be written in several ways in R
- Mainly used for standalone computing or analysis on individual servers
- Large number of packages, for anything!
Python
- Used by programmers that want to delve into data science
- The closer you are working in an engineering environment, the more you might prefer Python
- Coding and debugging is easier mainly because of the nice syntax
- Any piece of functionality is always written the same way in Python
- When data analysis needs to be implemented with web apps
- Good tool to implement algorithms for production use
Hash tables:
- Average case \(O(1)\) lookup time
- Lookup time doesn’t depend on size
Even in terms of memory:
- \(O(n)\) memory
- Space scales linearly with number of elements
- Lots of dictionaries won’t take up significantly less space than a larger one
In-database analytics:
- Integration of data analytics in data warehousing functionality
- Much faster and corporate information is more secure, it doesn’t leave the enterprise data warehouse
Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines
The star schema is a traditional database schema with a central (fact) table (the “observations”, with database “keys” for joining with satellite tables, and with several fields encoded as ID’s). Satellite tables map ID’s to physical name or description and can be “joined” to the central fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster.
Lookup tables:
- Array that replace runtime computations with a simpler array indexing operation
Data acquisition
Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes
Data preparation
Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction.
Hypothesis & modelling
Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting
Evaluation & interpretation
Steps 2 to 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include step5 and be performed in a pre-production.
Deployment
Operations
Regular maintenance and operations. Includes performance tests to measure model performance, and can alert when performance goes beyond a certain acceptable threshold
Optimization
Can be triggered by failing performance, or due to the need to add new data sources and retraining the model or even to deploy new versions of an improved model
Note: with increasing maturity and well-defined project goals, pre-defined performance can help evaluate feasibility of the data science project early enough in the data-science life cycle. This early comparison helps the team refine hypothesis, discard the project if non-viable, change approaches.
You want to update an algorithm when:
- You want the model to evolve as data streams through infrastructure
- The underlying data source is changing
- Example: a retail store model that remains accurate as the business grows
- Dealing with non-stationarity
Some options:
- Incremental algorithms: the model is updated every time it sees a new training example
Note: simple, you always have an up-to-date model but you can’t incorporate data to different degrees.
Sometimes mandatory: when data must be discarded once seen (privacy)
- Periodic re-training in “batch” mode: simply buffer the relevant data and update the model every-so-often
Note: more decisions and more complex implementations
How frequently?
- Is the sacrifice worth it?
- Data horizon: how quickly do you need the most recent training example to be part of your model?
- Data obsolescence: how long does it take before data is irrelevant to the model? Are some older instances
more relevant than the newer ones?
Economics: generally, newer instances are more relevant than older ones. However, data from the same month, quarter or year of the last year can be more relevant than the same periods of the current year. In a recession period: data from previous recessions can be more relevant than newer data from different economic cycles.
All visuals elements in charts and graphs that are not necessary to comprehend the information represented, or that distract the viewer from this information
Examples of unnecessary elements include:
- Unnecessary text
- Heavy or dark grid lines
- Ornamented chart axes
- Pictures
- Background
- Unnecessary dimensions
- Elements depicted out of scale to one another
- 3-D simulations in line or bar charts
Other answer:
- Depends on the context
- Is error acceptable? Fraud detection, quality assurance
Big data is high volume, high velocity and/or high variety information assets that require new forms of processing
- Volume: big data doesn’t sample, just observes and tracks what happens
- Velocity: big data is often available in real-time
- Variety: big data comes from texts, images, audio, video…
Difference big data/business intelligence:
- Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc.
- Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors
Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc
Wide: each different variable in a separate column
Fam_id fam_inc96 fam_inc97 fam_inc98
Long Vs Wide:
- Data manipulations are much easier when data is in the wide format: summarize, filter
- Program requirements
Pareto rule:
- 80% of the effects come from 20% of the causes
- 80% of the sales come from 20% of the customers
Computer science: “simple and inexpensive beats complicated and expensive” - Rod Elder
Finance, rule of 72:
- Estimate the time needed for a money investment to double
- 100$ at a rate of 9%: 72/9=8 years
Rule of three (Economics):
- There are always three major competitors in a free market within one industry
Google API (Google Analytics, Picasa), Twitter API (interact with Twitter functions), GitHub API, LinkedIn API (users data)…
Bad visualization:
- Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items
- Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment
- 3D charts: can distort perception and therefore skew data
- Using a solid line in a line chart: dashed and dotted lines can be distracting
Good visualization:
- Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better
- Adding a trend line (regression line) to a scatter plot help the reader highlighting trends