Association analysis examines how variables relate to each other, quantifying the strength, direction, and nature of relationships. Understanding associations is fundamental for identifying patterns, making predictions, and uncovering causal mechanisms in business analytics.
Correlation Analysis
Correlation measures the strength and direction of linear relationships between two variables, producing coefficients ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.
Types of Correlation Coefficients:
Pearson Correlation (r): - Measures linear relationships between continuous variables - Assumes normally distributed data - Sensitive to outliers - Interpretation: r = 0.7-1.0 (strong), 0.4-0.7 (moderate), 0.1-0.4 (weak)
Spearman Rank Correlation (ρ): - Measures monotonic relationships (not necessarily linear) - Non-parametric alternative to Pearson - Works with ordinal data - Robust to outliers and non-normal distributions
Kendall’s Tau (τ): - Measures ordinal associations - Better for small sample sizes - More robust than Spearman for tied ranks
Key Considerations: - Correlation ≠ Causation: Strong correlation doesn’t imply one variable causes changes in another - Spurious correlations: Unrelated variables may correlate due to confounding factors or coincidence - Non-linear relationships: Correlation coefficients may miss curved or complex patterns - Range restrictions: Limited variable ranges can artificially reduce correlation
Business example: Analyzing the correlation between advertising spend and sales revenue (r = 0.82) suggests a strong positive linear relationship, but further analysis is needed to establish causality and account for seasonal effects, competitor actions, and other factors.
Regression Analysis
Regression models relationships between variables to predict outcomes, estimate effects, and understand dependencies. It quantifies how changes in independent variables (predictors) relate to changes in dependent variables (outcomes).
Simple Linear Regression: - Models relationship between one predictor and one outcome - Equation: Y = β₀ + β₁X + ε - β₀ (intercept): predicted Y when X = 0 - β₁ (slope): change in Y for one-unit increase in X - ε (error term): unexplained variation
Multiple Linear Regression: - Models relationships with multiple predictors - Equation: Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε - Controls for confounding variables - Allows estimation of partial effects
Model Evaluation Metrics:
R-squared (R²): - Proportion of variance explained by the model (0 to 1) - Higher values indicate better fit - Can be inflated by adding predictors
Adjusted R-squared: - Penalizes model complexity - Preferred for comparing models with different numbers of predictors
Root Mean Squared Error (RMSE): - Average prediction error in original units - Lower values indicate better predictions
Residual Analysis: - Examining prediction errors to validate assumptions - Checking for patterns, heteroscedasticity, and normality
Key Assumptions: - Linearity: Relationship between variables is linear - Independence: Observations are independent - Homoscedasticity: Constant variance of errors - Normality: Residuals are normally distributed - No multicollinearity: Predictors are not highly correlated (for multiple regression)
Types of Relationships:
Positive Association: - Both variables increase together - Example: Education level and income
Negative Association: - One variable increases as the other decreases - Example: Price and demand
No Association: - Variables are unrelated - Changes in one don’t predict changes in the other
Non-linear Association: - Curved or complex relationships - May require polynomial regression or transformation
Practical Applications in Business:
Predictive Modeling: - Sales forecasting based on historical data - Customer lifetime value prediction - Demand estimation
Risk Assessment: - Credit scoring models - Insurance premium calculation - Investment risk evaluation
Optimization: - Pricing strategies - Resource allocation - Marketing mix optimization
Causal Inference: - Treatment effect estimation - A/B testing analysis - Policy impact evaluation
Business example: A subscription-based software company builds a multiple regression model to predict customer churn:
Churn Probability = β₀ + β₁(Login_Frequency) + β₂(Support_Tickets) + β₃(Feature_Usage) + β₄(Contract_Length) + ε
Results show: - Login frequency has a strong negative effect (β₁ = -0.35): more active users are less likely to churn - Support tickets have a positive effect (β₂ = 0.18): customers with issues are more likely to leave - Feature usage has a moderate negative effect (β₃ = -0.22) - Contract length has a strong negative effect (β₄ = -0.41): longer commitments reduce churn
The model achieves R² = 0.67, explaining 67% of churn variance. These insights inform retention strategies: improving onboarding to increase feature adoption, proactively addressing support issues, and incentivizing longer contracts.