Analytical Skills for Business (WS 2025/26)

Business Administration (M. A.)

Author
Affiliations

© Benjamin Gross

Hochschule Fresenius - University of Applied Science

Email: benjamin.gross@ext.hs-fresenius.de

Website: https://drbenjamin.github.io

Published

05.11.2025 12:38

Abstract

This document holds the course material for the Analytical Skills for Business course in the Master of Business Administration program. It discusses version control systems such as Git and GitHub for efficient team collaboration, offers an overview of no-code and low-code tools for data analytics including Tableau, Power BI, QlikView, makeML, PyCaret, RapidMiner, and KNIME, and introduces key programming languages such as R, Python, and SQL alongside essential programming concepts like syntax, libraries, variables, functions, objects, conditions, and loops. In addition, it covers working with modern development environments, including Unix-like systems, containers, APIs, Jupyter, and RStudio, and sets expectations for project submissions and evaluation.

Introduction

Computer science is the study of computers and computation, spanning theoretical and algorithmic foundations, the design of hardware and software, and practical uses of computing to process information. It encompasses core areas such as

  • algorithms and data structures
  • computer architecture
  • programming languages and software engineering
  • databases and information systems
  • networking and communications
  • graphics and visualization
  • human-computer interaction
  • intelligent systems.

The field draws on mathematics and engineering—using concepts like

  • binary representation
  • Boolean logic
  • complexity analysis

to reason about what can be computed and how efficiently.

Emerging in the 1960s as a distinct discipline, computer science now sits alongside computer engineering, information systems, information technology, and software engineering within the broader computing family. Its reach is inherently interdisciplinary, intersecting with domains from the natural sciences to business and the social sciences. Beyond technical advances, the discipline engages with societal and professional issues, including

  • reliability
  • security
  • privacy
  • intellectual property

in a networked world (Britannica, 2025).

Implementing version control systems

Version control systems are essential tools for managing code, tracking changes, and facilitating collaborative development in modern development projects (Çetinkaya-Rundel & Hardin, 2021). These systems enable teams to work efficiently on shared codebases while maintaining a complete history of all modifications, ensuring reproducibility and accountability in data analysis workflows.

Core Concepts

Version control systems provide systematic approaches to managing changes in documents, programs, and other collections of information:

  • Repository: A central storage location containing all project files and their complete revision history
  • Commit: A snapshot of the project at a specific point in time, representing a set of changes
  • Branch: An independent line of development allowing parallel work on different features
  • Merge: The process of integrating changes from different branches back together

Source: https://uidaholib.github.io/get-git/1why.html

Git: Distributed Version Control

Git is a distributed version control system that tracks changes in files and coordinates work among multiple contributors. It was created by Linus Torvalds (creator of Linux) in 2005 and has since become the de facto standard for version control in software development. Key characteristics include:

Local Repository: Each user maintains a complete copy of the project history, enabling offline work and faster operations.

Staging Area: An intermediate area where changes are prepared before being committed to the repository.

Branching and Merging: Lightweight branching allows for experimental development without affecting the main codebase. Merging integrates changes from different branches. In Open Source projects often Pull Requests are used to propose and discuss changes before merging. In the coporate world often Merge Requests are used. There is a difference between merging and rebasing. Merging creates a new commit that combines the histories of two branches, while rebasing rewrites the commit history to create a linear sequence of changes as you see on the figure below.

Source: https://i0.wp.com/digitalvarys.com/wp-content/uploads/2020/03/Git-Merge-and-Rebase.png?resize=1536%2C843&ssl=1

Distributed Workflow: No single point of failure, as every user has a complete backup of the project.

GitHub: Cloud-Based Collaboration Platform

GitHub is a web-based hosting service for Git repositories that adds collaboration features and project management tools:

  • Remote Repositories: Centralized storage accessible from anywhere with internet connectivity.
  • Pull Requests: Structured code review process for integrating changes.
  • Issue Tracking: Built-in project management for tracking bugs and feature requests.
  • Actions and CI/CD: Automated workflows for testing and deployment.
  • Documentation: Integrated wiki and README support for project documentation.

The combination of Git and GitHub creates a powerful ecosystem for collaborative analytics projects, ensuring code quality, facilitating peer review, and maintaining comprehensive project documentation (G. GeeksforGeeks, 2024).

See also Collaborating with Git and GitHub by Prof. Dr. Huber about using git and Github for collaboration.

For students Github offers a free educational plan with additional features! Included you will find access to Github Copilot, a AI based code completion tools like. These are powerful gadgets to support your coding activities, but not a replacement for learning programming and coding by yourself as you can see on this image:

Source: LinkedIn

Comparison of Git and GitHub

On this image you can see the integration of Git in GitHub:

Source: https://marce10.github.io/ciencia_reproducible/intro_a_git_y_github.html

Business Analytics Applications

In business analytics contexts, version control systems provide:

  • Reproducible Analysis: Complete tracking of analytical scripts and data processing steps
  • Collaborative Research: Multiple analysts can work simultaneously on different aspects of projects
  • Model Versioning: Systematic management of machine learning models and their evolution
  • Data Governance: Audit trails for compliance and regulatory requirements
  • Backup and Recovery: Protection against data loss and accidental modifications

A lot of documentation available

For more coverage of version control concepts, implementation strategies, and best practices, see:

Understanding version control systems is fundamental for modern business analytics, enabling collaborative development, ensuring reproducibility, and maintaining professional standards in data science projects.

Overview on Programming languages

  • R: R is a programming language and free software environment used for statistical computing and graphics supported by the R Foundation for Statistical Computing. It is widely used among statisticians and data miners for developing statistical software and data analysis.
  • Python: Python is a versatile programming language widely used in data science and analytics. It has a rich ecosystem of libraries such as Pandas, NumPy, and Matplotlib that facilitate data manipulation, analysis, and visualization. It can also be used to develop software as it is a general-purpose programming language which utilizes an object-oriented programming paradigm.
  • SQL: SQL (Structured Query Language) is the standard language for managing and querying relational databases. It is essential for data extraction, transformation, and loading (ETL) or extraction, loading and transformation (ELT) processes in analytics workflows. Modern databases like PostgreSQL, MySQL, and SQLite use SQL for data manipulation and retrieval and just have slightly different dialects.

Elements of programming languages

  • Syntax: the set of rules that defines the combinations of symbols that are considered to be correctly structured programs in that language.
  • Libraries: collections of pre-written code that users can call upon to save time and effort.
  • Variables: named storage locations in a program that hold values.
  • Functions: reusable blocks of code that perform a specific task.
  • Objects: instances of classes that encapsulate data and behavior.
  • Conditions: statements that control the flow of execution based on certain criteria.
  • Loops: constructs that repeat a block of code multiple times.

Development environments

  • Unix-like systems:Most popular Unix like system is Linux. It is understood as a family of open source Unix-like operating systems based on the Linux kernel and mostly used in form of distributions such as Ubuntu, Debian, Fedora, CentOS and Alpine Linux. The latter is known for its simplicity and efficiency and is often used as the base image for Docker containers (container host os).
  • Containers: Docker is a production ready containerization service. On hub.docker.com you can store images that could be pulled. It acts as the main public registry for Docker images. Container are a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
  • APIs: (Application Programming Interface): An API is a set of definitions and protocols for building and integrating application software. It allows different software systems to communicate with each other. APIs are used to enable the integration of different systems, allowing them to share data and functionality. Examples include RESTful APIs, SOAP APIs, and GraphQL APIs.
  • Jupyter: Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It supports various programming languages, including Python, R, and Julia. Jupyter is widely used for data analysis, machine learning, and scientific computing.
  • IDEs: (Integrated Development Environments): IDEs are software applications that provide comprehensive facilities to computer programmers for software development. They typically include a code editor, a debugger, and build automation tools. Examples of popular IDEs include:
    • RStudio: RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio provides a user-friendly interface for writing and debugging R code, as well as tools for data visualization and reporting.
      1. Install R from CRAN
      2. Install RStudio from RStudio
      3. Install git (if not already installed) from git-scm
      4. Follow the lecture to clone the Course GitHub repository and open the project in RStudio https://github.com/DrBenjamin/Analytical-Skills-for-Business
    • Visual Studio Code (VS Code): VS Code is a free source-code editor made by Microsoft for Windows, Linux and macOS. It includes support for debugging, embedded Git control, syntax highlighting, intelligent code completion, snippets, and code refactoring. It is highly customizable, allowing users to change the theme, keyboard shortcuts, preferences, and install extensions that add additional functionality.

Overview on no-code and low-code tools for data analytics

  • n8n

    • Follow these short introductions to n8n workflows:

      • Data downloading (import it with the context menu on the top right of the workflow page (…) - Import from File... or Import from URL...):
        1. Create an account on n8n.cloud
        2. Create a new workflow
        3. Add a new node and select “HTTP Request”
        4. Configure the node to make a GET request to https://minio.seriousbenentertainment.org:9000/data/Business_Report%20-%202025.csv
        5. Add a new node and select “Write Binary File”
        6. Configure the node to write the data to your local disk
      • Mediapipe (import it with the context menu on the top right of the workflow page (…) - Import from File... or Import from URL...):
        1. Create a new workflow
        2. Add a new node and select “HTTP Request” - configure the node to make a GET request to https://minio.seriousbenentertainment.org:9000/data/input.mp4
        3. Add a new node Extract from File
        4. Add a new node Build JSON
        5. Add a new node and select “MediaPipe”
        6. Add a new node HTTP Request - configure the node to make a POST request to http://212.227.102.172:8000/mediapipe
        7. Add a new node Github - configure the node to commit the results to the Github repository
  • Streamlit on Snowflake

    • Follow these short introductions to Snowflake:

      1. Create a free trial account on Snowflake
      2. Choose the data tutorial and follow the steps.
  • QGIS

  • Follow these short introductions to QGIS:

    1. Download and install QGIS from here
    2. Download the sample data from:
      • here and unzip them into one folder - it needs to contain all these files and sub-folders: Day2Data Day4Data Malawi-healthsites mw_districts mw_shp Malawi.qgz
    3. Open QGIS and open the project file Malawi.qgs
    4. Open Department of Forestry website and download some or all of the zip files.
    5. Add the layers via the menu Layer -> Add layer -> Add Vector Layer and select the shapefiles from the unzipped folders.
  • Tableau

    • Follow these short introductions to Tableau:

      1. Download and install Tableau Desktop which is free for students or use Tableau Public online.
      2. Download the sample data from here and open it in Tableau.
  • Power BI

    • Follow these short introductions to Power BI:

      1. We are using Power BI within MS Teams (it would also be available here to use as Power BI Desktop Application).
      2. Open an existing Power BI project and make yourself familiar with the interface.
  • KNIME

    • Follow these short introductions to KNIME:

      1. Download and install KNIME Analytics Platform from here
      2. Create a new KNIME workflow and add a CSV Reader node to read data from a CSV file. Configure the node to use Custom/KNIME URL and add https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv in the URL field. Configure the node to properly read column headers and data types.
      3. Explore the Data - Add a “Statistics” node to understand the data distribution - Use the “Missing Value” node to identify missing values in columns like Age, Cabin, and Embarked
      4. Handle Missing Values
        • For the Age column: use “Missing Value” node with mean or median imputation
        • For the Cabin column: create a new binary feature indicating whether cabin information is available
        • For the Embarked column: use mode imputation or remove the few rows with missing values
      5. Data Transformation
        • Use “Column Filter” to remove unnecessary columns (e.g., PassengerId, Name, Ticket)
        • Apply “String Manipulation” for categorical encoding
        • Use “Normalizer” node for numerical features if needed
      6. Export Cleaned Data
        • Use “CSV Writer” node to save the cleaned dataset
        • Verify the output and check data quality

Descriptive statistics

Descriptive statistics summarizes and presents the main features of a dataset so you can understand what the data look like before modeling or inference. It organizes raw values into clear numerical summaries and visuals, without making probabilistic claims about a wider population. In analytics projects, this first pass helps you validate data quality, spot outliers, and communicate patterns to stakeholders.

See Types of Data and Types of Variables to build an data wrangling foundation for descriptive statistics.

What we typically summarize:

1. Central Tendency

Central tendency measures identify the typical or central value in a dataset.

  • The mean (arithmetic average) is sensitive to extreme values, making it suitable for symmetric distributions.
  • The median (middle value when sorted) is robust to outliers and preferred for skewed data.
  • The mode (most frequent value) is useful for categorical data and can identify the most common category or value. In business analytics, choosing the appropriate measure depends on data distribution and the presence of outliers.
# Loading necessary libraries
library(tidyverse)

# Creating a sample dataset
sample_data <- tibble(
  values = c(1, 2, 2, 3, 4, 5, 100)  # Example data with an outlier
)

# Calculating central tendency measures
central_tendency <- sample_data %>%
  summarise(
    mean = mean(values),
    median = median(values),
    mode = as.numeric(names(sort(table(values), decreasing = TRUE)[1]))
  )

# Printing the results
print(central_tendency)
# A tibble: 1 × 3
   mean median  mode
  <dbl>  <dbl> <dbl>
1  16.7      3     2

2. Variability

Variability measures quantify data spread around the center.

  • The range (maximum minus minimum) provides a simple spread indicator but is sensitive to outliers.
  • The Variance measures average squared deviations from the mean, while
  • The standard deviation (square root of variance) expresses spread in original units, facilitating interpretation.
  • The interquartile range (IQR) (difference between 75th and 25th percentiles) is robust to outliers and describes middle 50% spread. In business contexts, variability indicates risk, consistency, or process stability.
# Calculating variability measures
variability <- sample_data %>%
  summarise(
    range = max(values) - min(values),
    variance = var(values),
    sd = sd(values),
    iqr = IQR(values)
  )

# Printing the results
print(variability)
# A tibble: 1 × 4
  range variance    sd   iqr
  <dbl>    <dbl> <dbl> <dbl>
1    99    1351.  36.8   2.5

3. Distribution Shape

Distribution shape characteristics reveal asymmetry and tail behavior.

  • The Skewness measures distribution asymmetry: positive skewness indicates a right tail (mean > median), negative skewness indicates a left tail (mean < median), and zero indicates symmetry.
  • The Kurtosis measures tail heaviness and peakedness: high kurtosis indicates heavy tails with more outliers, low kurtosis indicates light tails.

These measures help identify data transformations needed and potential outliers in business analytics.

# Loading moments package for skewness and kurtosis
library(tidyverse)
library(moments)

# Calculating distribution shape measures
distribution_shape <- sample_data %>%
  summarise(
    skewness = skewness(values),
    kurtosis = kurtosis(values)
  )

# Printing the results
print(distribution_shape)
# A tibble: 1 × 2
  skewness kurtosis
     <dbl>    <dbl>
1     2.04     5.15

4. Frequencies and Percentiles

Frequency analysis counts occurrences and calculates proportions, revealing data distribution patterns.

  • Percentiles (quantiles) divide data into equal parts: quartiles (25%, 50%, 75%), deciles (10% intervals), or any custom percentile.

These measures identify data position and help detect outliers. In business, percentiles benchmark performance (e.g., top 10% customers) and support decision thresholds.

# Loading moments package for skewness and kurtosis
library(tidyverse)

# Calculating frequencies and percentiles
frequencies <- sample_data %>%
  count(values) %>%
  mutate(proportion = n / sum(n))

# Printing frequency table
print(frequencies)
# A tibble: 6 × 3
  values     n proportion
   <dbl> <int>      <dbl>
1      1     1      0.143
2      2     2      0.286
3      3     1      0.143
4      4     1      0.143
5      5     1      0.143
6    100     1      0.143
# Calculating various percentiles
percentiles <- sample_data %>%
  summarise(
    q25 = quantile(values, 0.25),
    q50 = quantile(values, 0.50),  # median
    q75 = quantile(values, 0.75),
    q90 = quantile(values, 0.90),
    q95 = quantile(values, 0.95)
  )

# Printing percentiles
print(percentiles)
# A tibble: 1 × 5
    q25   q50   q75   q90   q95
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     2     3   4.5  43.0  71.5

What are common methods to summarize:

5. Numerical Methods

Numerical methods systematically compute summary statistics and organize data into interpretable structures.

  • The Summary metrics condense datasets into key statistics (mean, median, standard deviation, quartiles), enabling quick assessment.
  • The Frequency tables tabulate value counts or binned ranges, revealing distribution patterns. These methods form the foundation for exploratory data analysis and inform subsequent statistical modeling in business analytics.
# Loading necessary libraries
library(tidyverse)

# Creating a more comprehensive summary
summary_metrics <- sample_data %>%
  summarise(
    count = n(),
    min = min(values),
    q1 = quantile(values, 0.25),
    median = median(values),
    mean = mean(values),
    q3 = quantile(values, 0.75),
    max = max(values),
    sd = sd(values),
    variance = var(values)
  )

# Printing summary metrics
print(summary_metrics)
# A tibble: 1 × 9
  count   min    q1 median  mean    q3   max    sd variance
  <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     7     1     2      3  16.7   4.5   100  36.8    1351.
# Creating a frequency table with binned ranges
binned_data <- sample_data %>%
  mutate(bin = cut(values, breaks = c(0, 2, 4, 6, Inf), 
                   labels = c("0-2", "2-4", "4-6", ">6"))) %>%
  count(bin) %>%
  mutate(percentage = n / sum(n) * 100)

# Printing frequency table
print(binned_data)
# A tibble: 4 × 3
  bin       n percentage
  <fct> <int>      <dbl>
1 0-2       3       42.9
2 2-4       2       28.6
3 4-6       1       14.3
4 >6        1       14.3

6. Graphical Methods

Graphical methods visualize data distributions and relationships, making patterns immediately recognizable.

  • Histograms display continuous data frequency distributions across bins.
  • Bar and pie charts compare categorical data frequencies or proportions.
  • Box plots show median, quartiles, and outliers, facilitating distributional comparisons.
  • Scatter plots reveal bivariate relationships and correlation patterns. In business analytics, visualizations communicate insights to non-technical stakeholders and guide data-driven decisions.
# Loading necessary libraries
library(tidyverse)
library(gridExtra)
library(ggplot2)

# Creating a larger dataset for better visualizations
set.seed(42)
large_data <- tibble(
  continuous_var = c(rnorm(100, mean = 50, sd = 10), rnorm(20, mean = 80, sd = 5)),
  category = sample(c("A", "B", "C", "D"), 120, replace = TRUE),
  x_var = rnorm(120, mean = 30, sd = 8),
  y_var = 2 * rnorm(120, mean = 30, sd = 8) + rnorm(120, mean = 0, sd = 5)
)

# Creating multiple plots
library(gridExtra)

# Histogram for continuous data
p1 <- ggplot(large_data, aes(x = continuous_var)) +
  geom_histogram(bins = 15, fill = "steelblue", color = "white") +
  labs(title = "Histogram (Continuous Data)", x = "Value", y = "Frequency") +
  theme_minimal()

# Bar chart for categorical data
p2 <- ggplot(large_data, aes(x = category)) +
  geom_bar(fill = "coral") +
  labs(title = "Bar Chart (Categorical Data)", x = "Category", y = "Count") +
  theme_minimal()

# Box plot
p3 <- ggplot(large_data, aes(y = continuous_var)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Box Plot (Median, Quartiles, Outliers)", y = "Value") +
  theme_minimal()

# Scatter plot for bivariate relationship
p4 <- ggplot(large_data, aes(x = x_var, y = y_var)) +
  geom_point(color = "darkblue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Scatter Plot (Bivariate Relationship)", x = "X Variable", y = "Y Variable") +
  theme_minimal()

# Arranging plots in a grid
grid.arrange(p1, p2, p3, p4, ncol = 2)

Business analytics context:

  • For monthly revenue by region, the mean signals typical performance, the standard deviation shows volatility, a box plot quickly flags outliers, and a bar chart compares regions. These summaries guide prioritization (e.g., regions with high variability may require deeper investigation) and set baselines for forecasting and experimentation.

  • For an accessible overview of types, methods, and examples, see ResearchMethod.net.

Measures of centrality, dispersion, and concentration

Descriptive analytics

Descriptive analytics examines data structures based on the number of variables analyzed simultaneously, enabling appropriate analytical techniques and visualizations for different data complexities.

Univariate data

Univariate analysis examines one variable at a time to understand its distribution, central tendency, and variability. It answers questions like “What is typical?” and “How much variation exists?” Common techniques include summary statistics (mean, median, mode, standard deviation), frequency distributions, and visualizations such as histograms, box plots, and bar charts. Business example: Analyzing monthly sales revenue to identify typical performance and outliers.

Bivariate data

Bivariate analysis examines relationships between two variables to identify associations, correlations, or dependencies. It answers questions like “How do these variables relate?” and “Does one variable predict another?” Common techniques include correlation coefficients (Pearson, Spearman), cross-tabulation, simple linear regression, and visualizations such as scatter plots and grouped bar charts. Business example: Examining the relationship between advertising spend and sales revenue to determine campaign effectiveness.

Multivariate data

Multivariate analysis examines three or more variables simultaneously to understand complex relationships, interactions, and patterns. It answers questions like “How do multiple factors jointly influence outcomes?” and “What hidden patterns exist?” Common techniques include multiple regression, principal component analysis (PCA), cluster analysis, and visualizations such as correlation matrices and heatmaps. Business example: Analyzing customer demographics, purchase history, and behavioral data simultaneously to identify market segments and predict customer lifetime value.

Techniques for Scores, Rankings, Metrics, and Composite Indicators

Performance measurement in business analytics requires systematic approaches to quantify complex phenomena through structured numerical representations.

Constructing

Construction involves designing and calculating meaningful measures from raw data. This includes:

  • Score development: Creating numerical values that represent performance levels (e.g., credit scores, customer satisfaction scores)
  • Ranking systems: Ordering entities based on specific criteria (e.g., sales performance rankings, market share positions)
  • Metric formulation: Defining key performance indicators (KPIs) that align with business objectives (e.g., conversion rates, ROI, customer lifetime value)
  • Composite indicators: Combining multiple metrics into single summary measures (e.g., balanced scorecards, economic indices, sustainability ratings)

Business example: Constructing a customer health score by combining purchase frequency, average order value, and engagement metrics with weighted coefficients.

Interpreting

Interpretation translates numerical results into actionable business insights. This requires:

  • Contextualization: Understanding what scores mean relative to benchmarks, targets, or historical performance
  • Trend analysis: Identifying patterns and changes over time
  • Comparative analysis: Assessing performance relative to competitors or industry standards
  • Threshold identification: Determining critical values that trigger decisions or actions

Business example: Interpreting a Net Promoter Score (NPS) of 45 as “good” for the retail industry, but identifying a downward trend requiring investigation.

Evaluating

Evaluation assesses the quality, validity, and effectiveness of measurement systems. Key considerations include:

  • Validity: Does the measure capture what it intends to measure?
  • Reliability: Are results consistent and reproducible?
  • Sensitivity: Does the metric detect meaningful changes?
  • Actionability: Can insights drive concrete business decisions?
  • Bias assessment: Identifying and mitigating systematic errors or unfair representations

Business example: Evaluating whether employee performance scores fairly represent actual contributions or reflect demographic biases, leading to refined evaluation criteria.

Visualizing and Exploration of Data Types

Effective data visualization transforms raw data into visual representations that reveal patterns, trends, and insights. Different data types require specific visualization approaches and exploratory techniques.

Categorical Data

Categorical data represents discrete groups or categories without inherent numerical ordering (nominal) or with meaningful ordering (ordinal). Visualization and exploration focus on frequency distributions and group comparisons.

Common visualizations: - Bar charts: Compare frequencies or proportions across categories - Pie charts: Show part-to-whole relationships for categorical breakdowns - Stacked bar charts: Display multiple categorical variables simultaneously - Tree maps: Visualize hierarchical categorical data with nested rectangles - Mosaic plots: Show relationships between multiple categorical variables

Exploratory techniques: - Frequency tables and cross-tabulations - Mode identification - Chi-square tests for independence - Contingency analysis

Business example: Visualizing customer segments by region using bar charts, analyzing product category preferences with pie charts, or exploring the relationship between customer type and purchase channel using mosaic plots.

Numerical Data

Numerical data consists of quantitative measurements on continuous or discrete scales. Visualization emphasizes distributions, central tendencies, variability, and relationships between variables.

Common visualizations: - Histograms: Display distribution of continuous variables - Box plots: Show median, quartiles, and outliers for comparison - Violin plots: Combine box plots with density curves - Scatter plots: Reveal relationships between two numerical variables - Line graphs: Track changes over ordered sequences - Heatmaps: Display correlation matrices or intensity patterns

Exploratory techniques: - Summary statistics (mean, median, standard deviation) - Distribution analysis (skewness, kurtosis) - Outlier detection - Correlation analysis - Trend identification

Business example: Using histograms to analyze revenue distributions, box plots to compare sales performance across quarters, scatter plots to examine the relationship between marketing spend and customer acquisition, or heatmaps to visualize product sales correlations.

Time Series Data

Time series data represents sequential observations measured at successive time points. Visualization and exploration focus on temporal patterns, trends, seasonality, and cyclical behaviors.

Common visualizations: - Line charts: Display trends and patterns over time - Area charts: Show cumulative values or multiple series stacked - Candlestick charts: Represent open, high, low, close values (financial data) - Seasonal plots: Reveal recurring patterns by season or period - Autocorrelation plots: Identify temporal dependencies - Control charts: Monitor process stability and detect anomalies

Exploratory techniques: - Trend analysis (moving averages, smoothing) - Seasonality detection (seasonal decomposition) - Stationarity testing - Lag analysis and autocorrelation - Change point detection - Anomaly identification

Business example: Tracking monthly revenue trends with line charts, identifying seasonal sales patterns using seasonal decomposition, monitoring website traffic fluctuations with control charts, or analyzing stock price movements with candlestick charts to inform investment decisions.

Handling Messy Data

Real-world data is rarely clean or analysis-ready. Data quality issues compromise analytical accuracy and business decisions. Handling messy data involves identifying, understanding, and resolving various data quality problems through systematic cleaning and transformation processes.

Common Data Quality Issues

Missing Values: - Complete absence: Entire records or fields are missing - Implicit missingness: Represented as NULL, NA, empty strings, or placeholder values (e.g., 999, -1) - Patterns: Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)

Handling strategies: - Deletion (listwise or pairwise) when data is MCAR and missing percentage is low - Imputation: mean/median/mode for numerical data, forward/backward fill for time series - Model-based imputation: regression, k-nearest neighbors (KNN), multiple imputation - Flag creation: adding indicator variables to preserve information about missingness

Inconsistent Data: - Format variations: “New York” vs. “NY” vs. “new york” - Unit inconsistencies: mixing metric and imperial measurements - Date format variations: “MM/DD/YYYY” vs. “DD-MM-YYYY” - Encoding issues: character encoding problems (UTF-8, ASCII)

Handling strategies: - Standardization: converting to consistent formats and units - Text normalization: lowercasing, trimming whitespace, removing special characters - Date parsing with explicit format specifications - Encoding conversion and validation

Duplicate Records: - Exact duplicates: Identical records across all fields - Near duplicates: Records representing the same entity with slight variations - Logical duplicates: Same entity with different identifiers

Handling strategies: - Exact match removal using unique identifiers - Fuzzy matching for near duplicates (Levenshtein distance, Jaccard similarity) - Record linkage and deduplication algorithms - Master data management (MDM) approaches

Outliers and Anomalies: - Statistical outliers: Values beyond expected range (e.g., 3 standard deviations from mean) - Domain outliers: Values impossible or implausible in business context (negative ages, future dates) - Data entry errors: Typos or transposition errors

Handling strategies: - Detection: Z-score, IQR method, isolation forests, DBSCAN clustering - Investigation: determine if outliers are errors or valid extreme values - Treatment: removal, capping/winsorizing, transformation, or separate analysis - Domain validation: implementing business rules and constraints

Structural Issues: - Wrong data types: Numbers stored as text, dates as strings - Improper granularity: Data too aggregated or too detailed for analysis - Denormalized structures: Redundant or poorly organized data - Schema mismatches: Incompatible structures when merging datasets

Handling strategies: - Type conversion with validation - Aggregation or disaggregation to appropriate levels - Normalization and restructuring (wide-to-long, long-to-wide transformations) - Schema mapping and harmonization

Data Cleaning Workflow

1. Profiling and Assessment: - Examine data distributions, completeness, and patterns - Generate data quality reports - Identify specific issues and their prevalence

2. Validation: - Define business rules and constraints - Implement validation checks (range checks, format checks, logical consistency) - Flag violations for review

3. Cleaning: - Apply correction strategies systematically - Document all transformations - Maintain audit trails

4. Verification: - Validate cleaned data against quality criteria - Compare before/after statistics - Conduct spot checks and sample reviews

5. Documentation: - Record all cleaning decisions and rationale - Create data dictionaries - Document assumptions and limitations

Tools and Techniques in R

R packages for data cleaning: - tidyverse (dplyr, tidyr): Data manipulation and reshaping - janitor: Data cleaning utilities and tabulation - naniar, mice: Missing data visualization and imputation - stringr, stringi: Text cleaning and manipulation - lubridate: Date-time parsing and manipulation - validate, assertr: Data validation - dedupr, RecordLinkage: Deduplication

Business example: A retail company discovers that their customer database contains 15% missing email addresses, duplicate customer records with slight name variations (e.g., “John Smith” vs. “J. Smith”), inconsistent country codes (“USA” vs. “US” vs. “United States”), and impossible birth dates (e.g., year 1899). The data cleaning process involves: 1. Flagging missing emails for targeted collection campaigns 2. Using fuzzy matching to merge duplicate customer profiles while preserving transaction history 3. Standardizing country codes using ISO standards 4. Validating and correcting birth dates using business rules (customers must be 18+ years old) 5. Creating a cleaned, analysis-ready dataset that improves customer segmentation accuracy and marketing campaign targeting.

Association: Measuring Relationships Between Variables

Association analysis examines how variables relate to each other, quantifying the strength, direction, and nature of relationships. Understanding associations is fundamental for identifying patterns, making predictions, and uncovering causal mechanisms in business analytics.

Correlation Analysis

Correlation measures the strength and direction of linear relationships between two variables, producing coefficients ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

Types of Correlation Coefficients:

Pearson Correlation (r): - Measures linear relationships between continuous variables - Assumes normally distributed data - Sensitive to outliers - Interpretation: r = 0.7-1.0 (strong), 0.4-0.7 (moderate), 0.1-0.4 (weak)

Spearman Rank Correlation (ρ): - Measures monotonic relationships (not necessarily linear) - Non-parametric alternative to Pearson - Works with ordinal data - Robust to outliers and non-normal distributions

Kendall’s Tau (τ): - Measures ordinal associations - Better for small sample sizes - More robust than Spearman for tied ranks

Key Considerations: - Correlation ≠ Causation: Strong correlation doesn’t imply one variable causes changes in another - Spurious correlations: Unrelated variables may correlate due to confounding factors or coincidence - Non-linear relationships: Correlation coefficients may miss curved or complex patterns - Range restrictions: Limited variable ranges can artificially reduce correlation

Business example: Analyzing the correlation between advertising spend and sales revenue (r = 0.82) suggests a strong positive linear relationship, but further analysis is needed to establish causality and account for seasonal effects, competitor actions, and other factors.

Regression Analysis

Regression models relationships between variables to predict outcomes, estimate effects, and understand dependencies. It quantifies how changes in independent variables (predictors) relate to changes in dependent variables (outcomes).

Simple Linear Regression: - Models relationship between one predictor and one outcome - Equation: Y = β₀ + β₁X + ε - β₀ (intercept): predicted Y when X = 0 - β₁ (slope): change in Y for one-unit increase in X - ε (error term): unexplained variation

Multiple Linear Regression: - Models relationships with multiple predictors - Equation: Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε - Controls for confounding variables - Allows estimation of partial effects

Model Evaluation Metrics:

R-squared (R²): - Proportion of variance explained by the model (0 to 1) - Higher values indicate better fit - Can be inflated by adding predictors

Adjusted R-squared: - Penalizes model complexity - Preferred for comparing models with different numbers of predictors

Root Mean Squared Error (RMSE): - Average prediction error in original units - Lower values indicate better predictions

Residual Analysis: - Examining prediction errors to validate assumptions - Checking for patterns, heteroscedasticity, and normality

Key Assumptions: - Linearity: Relationship between variables is linear - Independence: Observations are independent - Homoscedasticity: Constant variance of errors - Normality: Residuals are normally distributed - No multicollinearity: Predictors are not highly correlated (for multiple regression)

Types of Relationships:

Positive Association: - Both variables increase together - Example: Education level and income

Negative Association: - One variable increases as the other decreases - Example: Price and demand

No Association: - Variables are unrelated - Changes in one don’t predict changes in the other

Non-linear Association: - Curved or complex relationships - May require polynomial regression or transformation

Practical Applications in Business:

Predictive Modeling: - Sales forecasting based on historical data - Customer lifetime value prediction - Demand estimation

Risk Assessment: - Credit scoring models - Insurance premium calculation - Investment risk evaluation

Optimization: - Pricing strategies - Resource allocation - Marketing mix optimization

Causal Inference: - Treatment effect estimation - A/B testing analysis - Policy impact evaluation

Business example: A subscription-based software company builds a multiple regression model to predict customer churn:

Churn Probability = β₀ + β₁(Login_Frequency) + β₂(Support_Tickets) + β₃(Feature_Usage) + β₄(Contract_Length) + ε

Results show: - Login frequency has a strong negative effect (β₁ = -0.35): more active users are less likely to churn - Support tickets have a positive effect (β₂ = 0.18): customers with issues are more likely to leave - Feature usage has a moderate negative effect (β₃ = -0.22) - Contract length has a strong negative effect (β₄ = -0.41): longer commitments reduce churn

The model achieves R² = 0.67, explaining 67% of churn variance. These insights inform retention strategies: improving onboarding to increase feature adoption, proactively addressing support issues, and incentivizing longer contracts.

Correlation vs. Regression

Correlation: - Symmetric relationship (X-Y same as Y-X) - No distinction between dependent and independent variables - Describes strength and direction only - Single coefficient summarizes relationship

Regression: - Asymmetric relationship (predicting Y from X) - Clear dependent and independent variables - Provides prediction equation - Estimates effect sizes and intercepts - Allows statistical inference about relationships

Both techniques complement each other in comprehensive data analysis, with correlation providing initial relationship exploration and regression enabling detailed modeling and prediction.

Inferential statistics

Basic concepts of statistical inference

Quantification of probability through random variables

Hypothesis testing

Hypothesis testing is a fundamental statistical method used to make inferences about population parameters based on sample data (Illowsky & Dean, 2018). It provides a systematic framework for evaluating claims about populations using sample evidence, enabling data-driven decision making in business contexts.

Core Concepts

A statistical hypothesis test involves formulating two competing hypotheses:

  • Null hypothesis (H₀): The status quo or default position, typically representing no effect or no difference
  • Alternative hypothesis (H₁ or Hₐ): The research hypothesis representing the effect or difference we seek to detect

The process involves calculating a test statistic from sample data and comparing it to a critical value or determining a p-value to make decisions about rejecting or failing to reject the null hypothesis (GeeksforGeeks, 2024).

Key Components

Test Statistics: Standardized measures that quantify how far sample data deviates from what would be expected under the null hypothesis.

Significance Level (α): The probability threshold for rejecting the null hypothesis, commonly set at 0.05 (5%).

P-value: The probability of observing test results at least as extreme as those obtained, assuming the null hypothesis is true.

Critical Region: The range of values for which the null hypothesis is rejected.

Types of Errors

  • Type I Error (α): Rejecting a true null hypothesis (false positive)
  • Type II Error (β): Failing to reject a false null hypothesis (false negative)
  • Statistical Power (1-β): The probability of correctly rejecting a false null hypothesis

Reference Materials

For comprehensive coverage of hypothesis testing concepts, methodologies, and applications, consult:

Understanding hypothesis testing is essential for making informed business decisions based on data analysis, forming the foundation for advanced statistical inference and predictive analytics in business contexts.

Confidence Intervals, P-Values, and Statistical Tests

Confidence intervals, p-values, and statistical tests are interconnected concepts in inferential statistics that work together to support evidence-based decision-making and quantify uncertainty in business analytics.

Confidence Intervals

Confidence intervals (CIs) provide a range of plausible values for population parameters based on sample data, expressing uncertainty around point estimates with a specified confidence level (typically 95%).

Interpretation: - A 95% CI means that if we repeatedly sampled from the population and calculated confidence intervals, approximately 95% of those intervals would contain the true population parameter - Wider intervals indicate greater uncertainty; narrower intervals suggest more precise estimates - The interval provides both a point estimate (usually the center) and a margin of error

Common Types: - CI for means: Estimates average values (e.g., average customer satisfaction score: 7.2 ± 0.5) - CI for proportions: Estimates percentages (e.g., conversion rate: 12% ± 2%) - CI for differences: Compares groups (e.g., A/B test difference: 3.5% ± 1.2%) - CI for regression coefficients: Quantifies predictor effects with uncertainty

Business application: A marketing team tests a new email campaign and observes a 15% click-through rate in a sample of 500 customers. The 95% CI is [12.3%, 17.7%], indicating they can be 95% confident the true population click-through rate lies within this range.

P-Values

P-values quantify the strength of evidence against the null hypothesis, representing the probability of observing results as extreme as or more extreme than those obtained, assuming the null hypothesis is true.

Interpretation Guidelines: - p < 0.01: Very strong evidence against H₀ (highly significant) - p < 0.05: Strong evidence against H₀ (statistically significant) - p < 0.10: Moderate evidence against H₀ (marginally significant) - p ≥ 0.10: Insufficient evidence to reject H₀ (not significant)

Key Considerations: - P-values do NOT measure the probability that the null hypothesis is true - P-values do NOT measure effect size or practical importance - Smaller p-values indicate stronger evidence, but don’t guarantee meaningful effects - Statistical significance (small p-value) ≠ practical significance (meaningful effect) - P-values can be influenced by sample size: large samples may yield significant p-values for trivial effects

Common Misinterpretations to Avoid: - ❌ “p = 0.03 means there’s a 3% chance the null hypothesis is true” - ❌ “p > 0.05 proves the null hypothesis is correct” - ✓ “p = 0.03 means that if the null hypothesis were true, we’d see results this extreme only 3% of the time”

Business application: Testing whether a new product feature increases user engagement, researchers obtain p = 0.02, indicating that if the feature had no effect, results this extreme would occur only 2% of the time—strong evidence the feature has an impact.

Common Statistical Tests

Statistical tests evaluate hypotheses about population parameters, each suited for specific data types and research questions.

Tests for Means:

t-test: - One-sample t-test: Compare sample mean to known value (e.g., Is average satisfaction score different from 7.0?) - Independent samples t-test: Compare means between two groups (e.g., Do customers in Region A spend more than Region B?) - Paired samples t-test: Compare means for related observations (e.g., Before/after treatment comparisons)

ANOVA (Analysis of Variance): - Compares means across three or more groups - Identifies whether at least one group differs significantly - Example: Comparing sales performance across four sales regions

Tests for Proportions:

Z-test for proportions: - Compare sample proportion to known value - Compare proportions between two groups - Example: Is conversion rate significantly higher for Treatment A vs. Control?

Chi-square test: - Tests independence between categorical variables - Example: Is customer satisfaction level related to product category?

Tests for Associations:

Correlation test: - Evaluates significance of correlation coefficients - Tests whether observed correlation differs from zero - Example: Is there a significant relationship between advertising spend and revenue?

Regression F-test: - Tests overall model significance - Evaluates whether predictors collectively explain variance - Example: Do customer demographics significantly predict purchase behavior?

Non-parametric Tests:

When data violates normality assumptions, use distribution-free alternatives: - Mann-Whitney U test: Non-parametric alternative to independent t-test - Wilcoxon signed-rank test: Non-parametric alternative to paired t-test - Kruskal-Wallis test: Non-parametric alternative to ANOVA

Interconnections

These concepts work together in hypothesis testing:

Relationship between CIs and p-values: - If a 95% CI for a difference excludes zero, the corresponding p-value will be < 0.05 - If a 95% CI includes zero, the p-value will be > 0.05 - CIs provide more information than p-values: they show effect size, direction, and precision

Relationship between p-values and significance level (α): - If p < α (commonly 0.05), reject the null hypothesis - If p ≥ α, fail to reject the null hypothesis - The α level defines Type I error rate (false positive rate)

Choosing the Right Test: Consider: 1. Data type: Continuous, categorical, ordinal? 2. Number of groups: One, two, or more? 3. Independence: Related or unrelated samples? 4. Assumptions: Normality, equal variances? 5. Research question: Difference, association, or prediction?

Business Decision Framework:

  1. Statistical Significance: Does the p-value indicate a real effect? (p < 0.05)
  2. Effect Size: Is the magnitude practically meaningful? (examine CIs and Cohen’s d)
  3. Business Impact: Does this translate to meaningful outcomes? (revenue, retention, satisfaction)
  4. Cost-Benefit: Do benefits outweigh implementation costs?

Example Integration: An e-commerce company tests a new checkout design: - Sample data: 1,000 users per version (old vs. new) - Conversion rates: Old = 12.0%, New = 14.5% - Statistical test: Two-proportion z-test - Results: - Difference: 2.5 percentage points - 95% CI: [0.8%, 4.2%] - p-value: 0.004 - Interpretation: - Statistically significant: p = 0.004 < 0.05 (strong evidence of a difference) - Effect size: 2.5 percentage point increase (20.8% relative improvement) - Precision: 95% confident the true improvement is between 0.8% and 4.2% - Business decision: Implement new design; with 100,000 monthly users, this represents an estimated 2,500 additional conversions monthly

This integrated approach ensures decisions are statistically sound, practically meaningful, and aligned with business objectives.

Inferential statistics

in the programming language R, translating theoretical knowledge into practical applications.

Predictive analytics

Data mining techniques

Regression analysis

Forecasting in predicting future business outcomes

Literature

All references for this course.

Essential Readings

Further Readings

Example Exam

To prepare for the exam, please review the following example exam with solutions:

Example Exam