Data Science and Data Analytics (WS 2025)

International Business Management (B. A.)

Author
Affiliations

© Benjamin Gross

Hochschule Fresenius - University of Applied Science

Email: benjamin.gross@ext.hs-fresenius.de

Website: https://drbenjamin.github.io

Published

19.12.2025 11:52

Abstract

This document provides the course material for Data Science and Data Analytics (B. A. – International Business Management). Upon successful completion of the course, students will be able to: recognize important technological and methodological advancements in data science and distinguish between descriptive, predictive, and prescriptive analytics; demonstrate proficiency in classifying data and variables, collecting and managing data, and conducting comprehensive data evaluations; utilize R for effective data manipulation, cleaning, visualization, outlier detection, and dimensionality reduction; conduct sophisticated data exploration and mining techniques (including PCA, Factor Analysis, and Regression Analysis) to discover underlying patterns and inform decision-making; analyze and interpret causal relationships in data using regression analysis; evaluate and organize the implementation of a data analysis project in a business environment; and communicate the results and effects of a data analysis project in a structured way.

Scope and Nature of Data Science

Let’s start this course with some definitions and context.

Definition of Data Science:

The field of Data Science concerns techniques for extracting knowledge from diverse data, with a particular focus on ‘big’ data exhibiting ‘V’ attributes such as volume, velocity, variety, value and veracity.

Maneth & Poulovassilis (2016)

Definition of Data Analytics:

Data analytics is the systematic process of examining data using statistical, computational, and domain-specific methods to extract insights, identify patterns, and support decision-making. It combines competencies in data handling, analysis techniques, and domain knowledge to generate actionable outcomes in organizational contexts (Cuadrado-Gallego et al., 2023).

Definition of Business Analytics:

Business analytics is the science of posing and answering data questions related to business. Business analytics has rapidly expanded in the last few years to include tools drawn from statistics, data management, data visualization, and machine learn- ing. There is increasing emphasis on big data handling to assimilate the advances made in data sciences. As is often the case with applied methodologies, business analytics has to be soundly grounded in applications in various disciplines and business verticals to be valuable. The bridge between the tools and the applications are the modeling methods used by managers and researchers in disciplines such as finance, marketing, and operations.

Pochiraju & Seshadri (2019)

There are many roles in the data science field, including (but not limited to):

Source: LinkedIn

For skills and competencies required for data science activities, see Skills Landscape.

Defining Data Science as an Academic Discipline

Data science emerges as an interdisciplinary field that synthesizes methodologies and insights from multiple academic domains to extract knowledge and actionable insights from data. As an academic discipline, data science represents a convergence of computational, statistical, and domain-specific expertise that addresses the growing need for data-driven decision-making in various sectors.

Data science draws from and interacts with multiple foundational disciplines:

  • Informatics / Information Systems:

    Informatics provides the foundational understanding of information processing, storage, and retrieval systems that underpin data science infrastructure. It encompasses database design, data modeling, information architecture, and system integration principles essential for managing large-scale data ecosystems. Information systems contribute knowledge about organizational data flows, enterprise architectures, and the sociotechnical aspects of data utilization in business contexts.

    See the Technical Applications & Data Analytics coursebook by Gross (2021) for further reading on foundations in informatics.

    Source: LinkedIn
  • Computer Science (algorithms, data structures, systems design):

    Computer science provides the computational foundation for data science through algorithm design, complexity analysis, and efficient data structures. Core contributions include machine learning algorithms, distributed computing paradigms, database systems, and software engineering practices. System design principles enable scalable data processing architectures, while computational thinking frameworks guide algorithmic problem-solving approaches essential for data-driven solutions.

    See also: Analytical Skills for Business - 1 Introduction and take a look at the AI Universe overview graphic:

    Source: LinkedIn

    See the Overview on no-code and low-code tools for data analytics for an overview on no-code and low-code tools for data analytics and AI tooling.

  • Mathematics (linear algebra, calculus, optimization):

    Mathematics provides the theoretical backbone for data science through linear algebra (matrix operations, eigenvalues, vector spaces), calculus (derivatives, gradients, optimization), and discrete mathematics (graph theory, combinatorics). These mathematical foundations enable dimensionality reduction techniques, gradient-based optimization algorithms, statistical modeling, and the rigorous formulation of machine learning problems (see the figure below). Mathematical rigor ensures the validity and interpretability of analytical results.

    Source: LinkedIn
  • Statistics & Econometrics (inference, modeling, causal analysis):

    Statistics provides the methodological framework for data analysis through hypothesis testing, confidence intervals, regression analysis, and experimental design. Econometrics contributes advanced techniques for causal inference, time series analysis, and handling observational data challenges such as endogeneity and selection bias. These disciplines ensure rigorous uncertainty quantification, model validation, and the ability to draw reliable conclusions from data while understanding limitations and assumptions.

  • Social Science & Behavioral Sciences (contextual interpretation, experimental design):

    Social and behavioral sciences contribute essential understanding of human behavior, organizational dynamics, and contextual factors that influence data generation and interpretation. These disciplines provide expertise in experimental design, survey methodology, ethical considerations, and the social implications of data-driven decisions. They ensure that data science applications consider human factors, cultural context, and societal impact while maintaining ethical standards in data collection and analysis.

    Source: LinkedIn

    A recent Business Punk pilot study reveals that AI systems like ChatGPT, Claude, and Meta AI are developing genuine brand characteristics through their consistent tonality, personality, and emotional resonance with users. The research shows US-based AI models dominate in both global presence and emotional connection, while European systems remain functionally strong but lack distinctive brand identity. This transformation marks a shift where AI branding emerges in real-time through every interaction, making personality a core product quality and positioning systems like Claude and Meta AI as potential “superbrands” of the next decade.

    Source: Business Punk

The interdisciplinary nature of data science requires practitioners to develop competencies across these domains while maintaining awareness of how different methodological traditions complement and inform each other. This multidisciplinary foundation enables data scientists to approach complex problems with both technical rigor and contextual understanding, ensuring that analytical solutions are both technically sound and practically relevant.

For further reading on the academic foundations of data science, see the comprehensive analysis in Defining Data Science as an Academic Discipline.

Significance of Business Data Analysis for Decision-Making

Business data analysis has evolved from a supporting function to a critical strategic capability that fundamentally transforms how organizations make decisions, allocate resources, and compete in modern markets. The systematic application of analytical methods to business data enables evidence-based decision-making that reduces uncertainty, improves operational efficiency, and creates sustainable competitive advantages.

Strategic Decision-Making Framework

Business data analysis provides a structured approach to strategic decision-making through multiple analytical dimensions:

  • Evidence-Based Strategic Planning: Data analysis supports long-term strategic decisions by providing empirical evidence about market trends, competitive positioning, and organizational capabilities. Statistical analysis of historical performance data, market research, and competitive intelligence enables organizations to formulate strategies grounded in quantifiable evidence rather than intuition alone.

  • Risk Assessment and Mitigation: Advanced analytical techniques enable comprehensive risk evaluation across operational, financial, and strategic dimensions. Monte Carlo simulations, scenario analysis, and predictive modeling help organizations quantify potential risks and develop contingency plans based on probabilistic assessments of future outcomes.

  • Resource Allocation Optimization: Data-driven resource allocation models leverage optimization algorithms and statistical analysis to maximize return on investment across different business units, projects, and initiatives. Linear programming, integer optimization, and multi-criteria decision analysis provide frameworks for allocating limited resources to achieve optimal organizational outcomes.

Operational Decision Support

At the operational level, business data analysis transforms day-to-day decision-making through real-time insights and systematic performance measurement:

  • Performance Measurement and Continuous Improvement: Key Performance Indicators (KPIs) and statistical process control methods enable organizations to monitor operational efficiency, quality metrics, and customer satisfaction in real-time. Time series analysis, control charts, and regression analysis identify trends, anomalies, and improvement opportunities that drive continuous operational enhancement.

  • Forecasting and Demand Planning: Statistical forecasting models using techniques such as ARIMA, exponential smoothing, and machine learning algorithms enable accurate demand prediction for inventory management, capacity planning, and supply chain optimization. These analytical approaches reduce uncertainty in operational planning while minimizing costs associated with overstock or stockouts.

  • Customer Analytics and Personalization: Advanced customer analytics leverage segmentation analysis, predictive modeling, and behavioral analytics to understand customer preferences, predict churn, and optimize retention strategies. Clustering algorithms, logistic regression, and recommendation systems enable personalized customer experiences that increase satisfaction and loyalty.

Tactical Decision Integration

Business data analysis bridges strategic planning and operational execution through tactical decision support:

  • Pricing Strategy Optimization: Price elasticity analysis, competitive pricing models, and revenue optimization techniques enable dynamic pricing strategies that maximize profitability while maintaining market competitiveness. Regression analysis, A/B testing, and econometric modeling provide empirical foundations for pricing decisions.

  • Market Intelligence and Competitive Analysis: Data analysis transforms market research and competitive intelligence into actionable insights through statistical analysis of market trends, customer behavior, and competitive positioning. Multivariate analysis, factor analysis, and time series forecasting identify market opportunities and competitive threats.

  • Financial Performance Analysis: Financial analytics encompassing ratio analysis, variance analysis, and predictive financial modeling enable organizations to assess financial health, identify cost reduction opportunities, and optimize capital structure decisions. Statistical analysis of financial data supports both internal performance evaluation and external stakeholder communication.

Contemporary Analytical Capabilities

Modern business data analysis capabilities extend traditional analytical methods through integration of advanced technologies and methodologies:

  • Real-Time Analytics and Decision Support: Stream processing, event-driven analytics, and real-time dashboards enable immediate response to changing business conditions. Complex event processing and real-time statistical monitoring support dynamic decision-making in fast-paced business environments.

  • Predictive and Prescriptive Analytics: Machine learning algorithms, neural networks, and optimization models enable organizations to not only predict future outcomes but also recommend optimal actions. These advanced analytical capabilities support automated decision-making and strategic scenario planning.

  • Data-Driven Innovation: Analytics-driven innovation leverages data science techniques to identify new business opportunities, develop innovative products and services, and create novel revenue streams. Advanced analytics enable organizations to discover hidden patterns, correlations, and insights that drive innovation and competitive differentiation.

The significance of business data analysis for decision-making extends beyond technical capabilities to encompass organizational transformation, cultural change, and strategic competitive positioning. Organizations that successfully integrate analytical capabilities into their decision-making processes achieve superior performance outcomes, enhanced agility, and sustainable competitive advantages in increasingly data-driven markets.

For comprehensive coverage of business data analysis methodologies and applications, see Advanced Business Analytics and the analytical foundations outlined in Evans (2020).

For open access resources, visit Kaggle, a platform for data science competitions and datasets.

Types of Analytics

  • Descriptive Analytics: What happened?
  • Predictive Analytics: What is likely to happen?
  • Prescriptive Analytics: What should we do?

Source: https://datamites.com/blog/descriptive-vs-predictive-vs-prescriptive-analytics/

Data Analytic Competencies

Data analytic competencies encompass the ability to apply machine learning, data mining, statistical methods, and algorithmic approaches to extract meaningful patterns, insights, and predictions from complex datasets. They include proficiency in exploratory data analysis, feature engineering, model selection, evaluation, and validation. These skills ensure rigorous interpretation of data, support evidence-based decision-making, and enable the development of robust analytical solutions adaptable to diverse health, social, and technological contexts.

Types of Data

The structure and temporal dimension of data fundamentally influence analytical approaches and statistical methods. Understanding data types enables researchers to select appropriate modeling techniques and interpret results within proper contextual boundaries.

  • Cross-sectional data captures observations of multiple entities (individuals, firms, countries) at a single point in time. This structure facilitates comparative analysis across units but does not track changes over time. Cross-sectional studies are particularly valuable for examining relationships between variables at a specific moment and testing hypotheses about population characteristics.

  • Time-series data records observations of a single entity across multiple time points, enabling the analysis of temporal patterns, trends, seasonality, and cyclical behaviors. Time-series methods account for autocorrelation and temporal dependencies, supporting forecasting and dynamic modeling. This data structure is essential for economic indicators, financial markets, and environmental monitoring.

  • Panel (longitudinal) data combines both dimensions, tracking multiple entities over time. This structure offers substantial analytical advantages by controlling for unobserved heterogeneity across entities and modeling both within-entity and between-entity variation. Panel data methods support causal inference through fixed-effects and random-effects models, difference-in-differences estimation, and dynamic panel specifications.

Source: https://static.vecteezy.com

Additional data structures:

  • Geo-referenced / spatial data is data associated with specific geographic locations, enabling spatial analysis and visualization. Techniques such as Geographic Information Systems (GIS), spatial autocorrelation, and spatial regression models are employed to analyze patterns and relationships in spatially distributed data.

Source: https://www.slingshotsimulations.com/
  • Streaming / real-time data is continuously generated data that is processed and analyzed in real-time. This data structure is crucial for applications requiring immediate insights, such as fraud detection, network monitoring, and real-time recommendation systems.

Types of Variables

  • Continuous (interval/ratio) data is measured on a scale with meaningful intervals and a true zero point (ratio) or arbitrary zero point (interval). Examples include height, weight, temperature, and income. Continuous variables support a wide range of statistical analyses, including regression and correlation.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning continuous data to a data frame and displaying it as a table
continuous_data <- data.frame(
  Height_cm = c(170, 165, 180, 175, 160, 185, 172, 168, 178, 182),
  Weight_kg = c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85)
)

# Displaying the original table
print(continuous_data)
   Height_cm Weight_kg
1        170        70
2        165        60
3        180        80
4        175        75
5        160        55
6        185        90
7        172        68
8        168        62
9        178        78
10       182        85
# Ordering the data frame by Height_cm
ordered_data <- continuous_data %>% 
  arrange(Height_cm)

# Displaying the ordered data frame
print(ordered_data)
   Height_cm Weight_kg
1        160        55
2        165        60
3        168        62
4        170        70
5        172        68
6        175        75
7        178        78
8        180        80
9        182        85
10       185        90
# Practicing:
# 1. Assign continuous data to a data frame and displaying it as a table
# 2. Order the data frame by a specific column
  • Count data represents the number of occurrences of an event or the frequency of a particular characteristic. Count variables are typically non-negative integers and can be analyzed using Poisson regression or negative binomial regression.
# Assigning count data to a data frame and displaying it as a table
count_data <- data.frame(
  Height = c(170, 165, 182, 175, 165, 175, 175, 168, 175, 182),
  Weight = c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85)
)

# Displaying the original data as a table
print(count_data)
   Height Weight
1     170     70
2     165     60
3     182     80
4     175     75
5     165     55
6     175     90
7     175     68
8     168     62
9     175     78
10    182     85
# Ordering the data frame by Height_cm in ascending order
ordered_count_data <- count_data %>%
  arrange(desc(Height), Weight) %>%
  count(Height)

# Displaying the ordered count data
print(ordered_count_data)
  Height n
1    165 2
2    168 1
3    170 1
4    175 4
5    182 2
# Practicing:
# 1. Assign count data to a data frame and displaying it as a table
  • Ordinal data represents categories with a meaningful order or ranking but no consistent interval between categories. Examples include survey responses (e.g., Likert scales) and socioeconomic status (e.g., low, medium, high). Ordinal variables can be analyzed using non-parametric tests or ordinal regression.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning ordinal data to a data frame and displaying it as a table
ordinal_data <- data.frame(
  Response = c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),
  Value = c(5, 4, 3, 2, 1)
)

# Displaying the original table
print(ordinal_data)
           Response Value
1 Strongly Disagree     5
2          Disagree     4
3           Neutral     3
4             Agree     2
5    Strongly Agree     1
# Ordering the data frame by Value
ordinal_data <- ordinal_data %>% 
  arrange(Value)

# Displaying the ordered data frame
print(ordinal_data)
           Response Value
1    Strongly Agree     1
2             Agree     2
3           Neutral     3
4          Disagree     4
5 Strongly Disagree     5
# Practicing:
# 1. Assign ordinal data to a data frame and displaying it as a table
# 2. Order the data frame by the ordinal value
  • Categorical (nominal / binary) data represents distinct categories without any inherent order. Nominal variables have two or more categories (e.g., gender, race, or marital status), while binary variables have only two categories (e.g., yes/no, success/failure). Categorical variables can be analyzed using chi-square tests or logistic regression.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning categorical data to a data frame and displaying it as a table
categorical_data <- data.frame(
  Event = c("A", "C", "B", "D", "E"),
  Category = c("Type1", "Type2", "Type1", "Type3", "Type2")
)

# Displaying the original table
print(categorical_data)
  Event Category
1     A    Type1
2     C    Type2
3     B    Type1
4     D    Type3
5     E    Type2
# Ordering the data frame by Event
categorical_data <- categorical_data %>% 
  arrange(Event)

# Displaying the table
print(categorical_data)
  Event Category
1     A    Type1
2     B    Type1
3     C    Type2
4     D    Type3
5     E    Type2
# Practicing:
# 1. Assign categorical data to a data frame and displaying it as a table
# 2. Order the data frame by a specific column
  • Compositional or hierarchical structures represent data with a part-to-whole relationship or nested categories. Examples include demographic data (e.g., age groups within gender) and geographical data (e.g., countries within continents). Compositional data can be analyzed using techniques such as hierarchical clustering or multilevel modeling.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning hierarchical data to a data frame and displaying it as a table
hierarchical_data <- data.frame(
  Country = c("USA", "Canada" , "USA", "Canada", "Mexico"),
  State_Province = c("California", "Ontario", "Texas", "Quebec", "Jalisco"),
  Population_Millions = c(39.5, 14.5, 29.0, 8.5, 8.3)
)

# Displaying the table
print(hierarchical_data)
  Country State_Province Population_Millions
1     USA     California                39.5
2  Canada        Ontario                14.5
3     USA          Texas                29.0
4  Canada         Quebec                 8.5
5  Mexico        Jalisco                 8.3
# Ordering the data frame by Country and then State_Province
hierarchical_data <- hierarchical_data %>% 
  arrange(Country, State_Province)

# Displaying the ordered data frame
print(hierarchical_data)
  Country State_Province Population_Millions
1  Canada        Ontario                14.5
2  Canada         Quebec                 8.5
3  Mexico        Jalisco                 8.3
4     USA     California                39.5
5     USA          Texas                29.0
# Grouping the data by Country and summarizing total population
hierarchical_data_grouped <- hierarchical_data %>% 
  group_by(Country) %>%
  summarise(Total_Population = sum(Population_Millions))

# Displaying the grouped data frame
print(hierarchical_data_grouped)
# A tibble: 3 × 2
  Country Total_Population
  <chr>              <dbl>
1 Canada              23  
2 Mexico               8.3
3 USA                 68.5
# Practicing:
# 1. Assign hierarchical data to a data frame and displaying it as a table
# 2. Order the data frame by multiple columns

Source: https://www.collegedisha.com/

Some small datasets to start with

A custom R package ourdata has been created to provide some small datasets (and also some helper R functions) for practice. You can install it from GitHub using the following commands:

# Installing Github R packages
devtools::install_github("DrBenjamin/ourdata")
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/RtmpHOUweO/remotes4ca21964d22f/DrBenjamin-ourdata-9500135/DESCRIPTION’ ... OK
* preparing ‘ourdata’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘ourdata_0.5.0.tar.gz’

Using the package and exploring its documentation:

# Loading package
library(ourdata)

# Opening help of package
??ourdata

# Showing welcome message
ourdata()
This is the R package `ourdata` used for Data Science courses at the Fresenius University of Applied Sciences.
Type `help(ourdata)` to display the content.
Type `ourdata_website()` to open the package website.
Have fun in the course!
Benjamin Gross
# Printing some datasets
print(koelsch)
  Jahr   Koelsch
1 2017 186784000
2 2018 191308000
3 2019 179322000
4 2020 169182000
print(kirche)
  Jahr Austritte
1 2017    364711
2 2018    437416
3 2019    539509
4 2020    441390
# Using help function from the package for a specific function from the `ourdata` R package
help(combine)
Help on topic 'combine' was found in the following packages:

  Package               Library
  dplyr                 /home/runner/R-library
  ourdata               /home/runner/R-library


Using the first match ...
# Using the `combine` function from the `ourdata` R package to combine two vectors into a data frame
ourdata::combine(kirche$Jahr, koelsch$Jahr, kirche$Austritte, koelsch$Koelsch)
'data.frame':   4 obs. of  3 variables:
 $ C1: chr  "2017" "2018" "2019" "2020"
 $ C2: num  364711 437416 539509 441390
 $ C3: num  1.87e+08 1.91e+08 1.79e+08 1.69e+08
    C1     C2        C3
1 2017 364711 186784000
2 2018 437416 191308000
3 2019 539509 179322000
4 2020 441390 169182000

** A messy dataset example**

We collect data from OECD (Organization for Economic Co-operation and Development), an international organization that works to build better policies for better lives. The dataset shows many columns (variables) with non-informative data and needs to be cleaned (wrangled) before analysis. First load the data into R from CSV file:

# Reading the dataset from a CSV file
preventable_deaths <- read.csv(
  "./topics/data/OECD_Preventable_Deaths.csv",
  stringsAsFactors = FALSE
)

or use the dataset directly from the ourdata R package:

# Reading the dataset from `ourdata` R package
library(ourdata)
preventable_deaths <- oecd_preventable

First we explore the data:

# Viewing structure of the dataset
str(preventable_deaths)

# Viewing first few rows
head(preventable_deaths)

# Checking dimensions
dim(preventable_deaths)

# Viewing column names
colnames(preventable_deaths)

# Summary statistics
summary(preventable_deaths)

# Checking for missing values
colSums(is.na(preventable_deaths))

Now we can start cleaning the data by removing non-informative columns and rows with missing values:

# Loading necessary library
library(dplyr)

# Selecting relevant columns for analysis
df_clean <- preventable_deaths %>%
  select(
    REF_AREA,
    Reference.area,
    TIME_PERIOD,
    OBS_VALUE
  ) %>%
  rename(
    country_code = REF_AREA,
    country = Reference.area,
    year = TIME_PERIOD,
    death_rate = OBS_VALUE
  )

# Converting year to numeric
df_clean$year <- as.numeric(df_clean$year)

# Converting death_rate to numeric (handling empty strings)
df_clean$death_rate <- as.numeric(df_clean$death_rate)

# Removing rows with missing death rates
df_clean <- df_clean %>%
  filter(!is.na(death_rate))

# Viewing cleaned data structure
str(df_clean)
'data.frame':   1162 obs. of  4 variables:
 $ country_code: chr  "AUS" "AUS" "AUS" "AUS" ...
 $ country     : chr  "Australia" "Australia" "Australia" "Australia" ...
 $ year        : num  2010 2011 2012 2013 2014 ...
 $ death_rate  : num  110 109 105 105 107 108 103 103 101 104 ...
# Summary of cleaned data
summary(df_clean)
 country_code         country               year        death_rate   
 Length:1162        Length:1162        Min.   :2010   Min.   : 34.0  
 Class :character   Class :character   1st Qu.:2013   1st Qu.: 75.0  
 Mode  :character   Mode  :character   Median :2016   Median :116.0  
                                       Mean   :2016   Mean   :128.7  
                                       3rd Qu.:2019   3rd Qu.:156.0  
                                       Max.   :2023   Max.   :453.0  

Showing some basic statistics of the cleaned data:

# Loading necessary library
library(dplyr)

# Printing overall statistics
cat("Mean death rate:", mean(df_clean$death_rate, na.rm = TRUE), "\n")
Mean death rate: 128.7151 
cat("Median death rate:", median(df_clean$death_rate, na.rm = TRUE), "\n")
Median death rate: 116 
cat("Standard deviation:", sd(df_clean$death_rate, na.rm = TRUE), "\n")
Standard deviation: 69.58822 
cat("Min death rate:", min(df_clean$death_rate, na.rm = TRUE), "\n")
Min death rate: 34 
cat("Max death rate:", max(df_clean$death_rate, na.rm = TRUE), "\n")
Max death rate: 453 
# Printing statistics by country
country_stats <- df_clean %>%
  group_by(country) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    median_rate = median(death_rate, na.rm = TRUE),
    sd_rate = sd(death_rate, na.rm = TRUE),
    min_rate = min(death_rate, na.rm = TRUE),
    max_rate = max(death_rate, na.rm = TRUE),
    n_observations = n()
  ) %>%
  arrange(desc(mean_rate))
print(country_stats)
# A tibble: 46 × 7
   country        mean_rate median_rate sd_rate min_rate max_rate n_observations
   <chr>              <dbl>       <dbl>   <dbl>    <dbl>    <dbl>          <int>
 1 South Africa        330.        338     53.1      241      438             22
 2 Latvia              230.        226.    65.8      151      364             28
 3 Lithuania           225.        204.    67.9      134      340             28
 4 Romania             223.        220.    46.5      172      303             24
 5 Hungary             218.        208     72.4      141      375             28
 6 Mexico              216.        217     75.5      155      453             26
 7 Bulgaria            193.        186     45.5      156      378             26
 8 Brazil              185.        172.    51.6      133      356             24
 9 Slovak Republ…      178.        167     43.4      130      308             24
10 Estonia             172.        167     59.7      100      265             26
# ℹ 36 more rows
# Printing statistics by year
year_stats <- df_clean %>%
  group_by(year) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    median_rate = median(death_rate, na.rm = TRUE),
    n_countries = n_distinct(country)
  ) %>%
  arrange(year)
print(year_stats)
# A tibble: 14 × 4
    year mean_rate median_rate n_countries
   <dbl>     <dbl>       <dbl>       <int>
 1  2010      141.       128            45
 2  2011      135.       124.           44
 3  2012      133.       120            45
 4  2013      130.       116            45
 5  2014      126.       117            46
 6  2015      125.       114.           45
 7  2016      124.       112.           46
 8  2017      122.       111            45
 9  2018      121.       108.           45
10  2019      118.       106.           44
11  2020      137.       118.           42
12  2021      148.       118.           40
13  2022      120.       108.           35
14  2023      112.        99.5          14

To visualize the cleaned data, we can create a distribution plot of preventable death rates:

# Loading necessary library
library(ggplot2)

# Plotting distribution of death rates
ggplot(df_clean, aes(x = death_rate)) +
  geom_histogram(bins = 30, fill = "#9B59B6", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Preventable Death Rates",
    x = "Death Rate per 100,000",
    y = "Frequency",
    caption = "Source: OECD Health Statistics"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

Conceptual Framework: Knowledge & Understanding of Data

  • Clarify analytical purpose and domain context to guide data selection and interpretation.
  • Define entities, observational units, and identifiers to ensure accurate data representation.
  • Align business concepts with data structures for meaningful analysis.

Data Collection

Data collection forms the foundational stage of any data science project, requiring systematic approaches to gather information that aligns with research objectives and analytical requirements. As outlined in modern statistical frameworks, effective data collection strategies must balance methodological rigor with practical constraints (M. & Hardin, 2021).

Methods of Data Collection

Core Data Collection Competencies

The competencies required for effective data collection encompass both technical proficiency and methodological understanding (see Data Collection Competencies.pdf):

  • Source Identification and Assessment: Systematically identify internal and external data sources, evaluating their relevance, quality, and accessibility for the analytical objectives.

  • Data Acquisition Methods: Implement appropriate collection techniques including APIs (for instance see Spotify API tutorial and Postman Spotify tutorial), database queries, survey instruments, sensor networks, web scraping, and third-party vendor partnerships, ensuring methodological alignment with research design.

    For the two projects below, here are some data sources recommendations (automatically created by Perplexity AI Deep Research):

  • Quality and Governance Framework: Establish protocols for assessing data provenance, licensing agreements, ethical compliance, and regulatory requirements (GDPR, industry-specific standards).

  • Methodological Considerations: Apply principles from research methodology to ensure data collection approaches support valid statistical inference and minimize bias introduction during the acquisition process.

Contemporary Data Collection Landscape

Modern data collection operates within an increasingly complex ecosystem characterized by diverse data types, real-time requirements, and distributed sources. The integration of traditional survey methods with emerging IoT sensors, social media APIs, and automated data pipelines requires comprehensive competency frameworks that address both technical implementation and methodological validity.

For comprehensive coverage of data collection methodologies and best practices, refer to: Research Methodology - Data Collection

Data Management

Data Management in data science curricula requires a coherent, multi‑facet framework that spans data quality, FAIR stewardship, master data governance, privacy/compliance, and modern architectures. Data quality assessment and governance define objective metrics (completeness, accuracy, consistency, plausibility, conformance) and governance processes that balance automated checks with human oversight. FAIR data principles provide a practical blueprint for metadata-rich stewardship to support findability, accessibility, interoperability, and reuse through machine-actionable metadata and persistent identifiers.

Master Data Management ensures clean, trusted core entities across systems via governance and harmonization. Data privacy, security, and regulatory compliance embed responsible data handling and risk management, guided by purpose limitation, data minimization, accuracy, storage limitations, integrity/confidentiality, and accountability. Emerging trends in cloud-native data platforms, ETL/ELT (Extract, Transform, Load or Extract, Load, Transform), data lakes/lakehouses, and broader metadata automation shape scalable storage/compute and governance, enabling reproducible analytics and ML workflows. Together, these strands underpin trustworthy, discoverable, and compliant data inputs for research and coursework (Weiskopf & Weng, 2016; Wilkinson et al., 2016; GO FAIR Foundation, n.d.; Semarchy, n.d.; IBM, n.d.).

Tools like dbt, Apache Airflow, and data catalog platforms (e.g., Alation, Collibra) operationalize data management practices. Also the n8n automation tool can be used for automating data workflows. See the Analytical Skills for Business course handbook on this topic. In late 2025 there was a free n8n certification course added to the n8n documentation.

Data Evaluation

  • Define data quality dimensions and assessment frameworks.
  • Distinguish validation from verification in data pipelines.
  • Apply statistical methods and data profiling for evaluation.
  • Balance automated and manual (human-in-the-loop) evaluation approaches.
  • Implement tools, workflows, and governance for data evaluation.

Data evaluation ensures datasets are fit-for-purpose, reliable, and trustworthy throughout the analytical lifecycle. It encompasses systematic assessment of data quality dimensions (completeness, accuracy, consistency, validity, timeliness, uniqueness), validation and verification processes, and strategic application of statistical methods and data profiling. Quality dimensions provide measurable criteria for determining whether data meets analytical requirements, while assessment frameworks translate these into actionable metrics enabling objective measurement and contextual interpretation.

Validation and verification are complementary processes essential for data integrity. Validation checks occur at entry points, preventing bad data through constraint checks, format validation, and business rule enforcement. Verification involves post-collection checks ensuring data remains accurate and consistent over time and across system boundaries, supporting reproducibility and traceability. Together, validation acts as a gatekeeper while verification provides ongoing quality assurance.

Statistical methods form the technical foundation for evaluation. Outlier detection techniques (z-score, IQR, DBSCAN) identify anomalous observations requiring investigation. Distribution checks assess whether data conforms to expected patterns, while profiling describes dataset structure, missingness patterns, and statistical properties. Regression analysis and hypothesis testing diagnose quality issues and quantify relationships between quality metrics and analytical outcomes.

Modern data evaluation balances automated and manual approaches. Automated evaluation offers speed, scalability, and consistency for large datasets through rule-based validation, statistical profiling, and machine learning-based anomaly detection. Manual evaluation contributes domain expertise, contextual understanding, and interpretative judgment that automated systems cannot replicate. Human-in-the-loop approaches combine automation’s efficiency with human interpretability, optimizing both throughput and quality.

Tools, workflows, and governance frameworks provide infrastructure for systematic evaluation across the data lifecycle. Data profiling tools (e.g., Pandas Profiling, Great Expectations, Deequ) automate quality assessment. Validation frameworks embed checks into ETL/ELT pipelines. Data lineage tracking and metadata management support traceability and impact analysis. Governance frameworks establish roles, responsibilities, and processes aligning evaluation practices with regulatory requirements and reproducibility needs.

Applications in the Programming Language R

Please read the How to Use R for Data Science by Prof. Dr. Huber for any basic questions regarding R programming.

Core tidyverse Tooling

Fundamental packages:

  • dplyr for data manipulation (filter, mutate, summarize, joins).
# Loading necessary library
library(dplyr)

# Using dplyr to pipe
df_summary <- df_clean %>%
  group_by(country) %>%
  summarize(
    mean_death_rate = mean(death_rate, na.rm = TRUE),
    max_death_rate = max(death_rate, na.rm = TRUE),
    min_death_rate = min(death_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_death_rate))

# Displaying the summary statistics
print(df_summary)
# A tibble: 46 × 4
   country         mean_death_rate max_death_rate min_death_rate
   <chr>                     <dbl>          <dbl>          <dbl>
 1 South Africa               330.            438            241
 2 Latvia                     230.            364            151
 3 Lithuania                  225.            340            134
 4 Romania                    223.            303            172
 5 Hungary                    218.            375            141
 6 Mexico                     216.            453            155
 7 Bulgaria                   193.            378            156
 8 Brazil                     185.            356            133
 9 Slovak Republic            178.            308            130
10 Estonia                    172.            265            100
# ℹ 36 more rows
  • tidyr for data reshaping (pivoting, nesting, separating, unnesting).
# Loading necessary library
library(tidyr)

# Using tidyr to pivot data
df_wide <- df_clean %>%
  pivot_wider(
    names_from = year,
    values_from = death_rate
  )

# Displaying the wide format data
print(df_wide)
# A tibble: 46 × 16
   country_code country  `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017`
   <chr>        <chr>    <list> <list> <list> <list> <list> <list> <list> <list>
 1 AUS          Austral… <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 2 AUT          Austria  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 3 BEL          Belgium  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 4 CAN          Canada   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 5 CHL          Chile    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 6 COL          Colombia <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 7 CRI          Costa R… <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 8 CZE          Czechia  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 9 DNK          Denmark  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
10 EST          Estonia  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
# ℹ 36 more rows
# ℹ 6 more variables: `2018` <list>, `2019` <list>, `2020` <list>,
#   `2021` <list>, `2022` <list>, `2023` <list>
  • ggplot2 for layered grammar-based visualization. See the example above.

  • some more R libraries:

    • readr for ingestion of CSV and other flat files.
    • readxl for ingestion of Excel files.
    • lubridate for date-time handling.
    • purrr for functional iteration.
    • stringr for text handling.
    • forcats for factor handling.

    for ingestion, functional iteration, text, and factor handling.

Data Visualization Principles

Choose encodings appropriate to variable types:

  • Continuous (Quantitative) variables → Position, Length Examples: x/y coordinates in scatter plots, bar heights, line positions
  • Categorical (Nominal) variables → Color hue, Shape, Facets Examples: different colors for groups, point shapes, separate panels
  • Ordinal variables → Ordered position, Color saturation Examples: ordered categories on axis, gradient colors from light to dark
  • Temporal variables → Position along x-axis, Line connections Examples: time on horizontal axis, connected points showing progression
  • Compositional (Part-to-whole) → Stacked position, Area Examples: stacked bars, proportional areas

Emphasize clarity: reduce chart junk; apply perceptual best practices:

Clarity in data visualization requires removing unnecessary elements that distract from the data while applying principles of human perception to enhance understanding.

  • Remove decorative elements (3D effects, shadows, gradients)
  • Eliminate redundant labels and gridlines
  • Minimize non-data ink (borders, background colors)
  • Avoid unnecessary legends when direct labeling is possible
  • Use position over angle for quantitative comparisons (bar charts > pie charts)
  • Maintain consistent scales across comparable charts
  • Respect aspect ratios that emphasize meaningful patterns
  • Choose colorblind-friendly palettes
  • Ensure sufficient contrast between data and background
# Loading necessary libraries
library(ggplot2)
library(dplyr)

# Creating data for demonstration
country_subset <- df_clean %>%
  filter(country %in% c("Germany", "France", "United Kingdom")) %>%
  filter(year >= 2015)

# Example: Clean, minimal visualization
ggplot(country_subset, aes(x = year, y = death_rate, color = country)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(
    title = "Preventable Death Rates (2015-2023)",
    x = "Year",
    y = "Deaths per 100,000",
    color = "Country"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),  # Removing unnecessary gridlines
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Support comparison, trend detection, and anomaly spotting:

Effective visualizations should facilitate three key analytical tasks: comparing values across groups, identifying trends over time, and detecting unusual patterns.

Support comparison:

  • Align items on common scales for direct comparison
  • Use small multiples (facets) for comparing across categories
  • Order categorical variables meaningfully (by value, alphabetically, or logically)
  • Keep consistent ordering across related charts

Enable trend detection:

  • Use connected lines for temporal data to show continuity
  • Add trend lines (linear, loess) to highlight overall patterns
  • Display sufficient time periods to establish meaningful trends
  • Avoid over-smoothing that hides important variations

Facilitate anomaly spotting:

  • Use reference lines or bands for expected ranges
  • Highlight outliers through color or annotation
  • Include context (confidence intervals, historical ranges)
  • Maintain consistent scales to make deviations visible
# Loading necessary libraries
library(ggplot2)
library(dplyr)

# Calculating statistics for anomaly detection
country_stats <- df_clean %>%
  group_by(country) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    sd_rate = sd(death_rate, na.rm = TRUE)
  )

# Joining back to identify anomalies
df_annotated <- df_clean %>%
  left_join(country_stats, by = "country") %>%
  mutate(
    z_score = (death_rate - mean_rate) / sd_rate,
    is_anomaly = abs(z_score) > 2  # Flagging values > 2 standard deviations
  )

# Example: Visualization supporting comparison, trends, and anomaly detection
selected_countries <- c("Germany", "France", "United Kingdom", "Italy", "Spain")
df_subset <- df_annotated %>% filter(country %in% selected_countries)

ggplot(df_subset, aes(x = year, y = death_rate, color = country)) +
  geom_line(linewidth = 0.8) +
  geom_point(aes(size = is_anomaly, alpha = is_anomaly)) +
  scale_size_manual(values = c(1.5, 3), guide = "none") +
  scale_alpha_manual(values = c(0.6, 1), guide = "none") +
  facet_wrap(~ country, ncol = 2) +  # Small multiples for comparison
  labs(
    title = "Preventable Death Rates: Trends and Anomalies",
    subtitle = "Larger points indicate statistical anomalies (>2 SD from country mean)",
    x = "Year",
    y = "Deaths per 100,000",
    color = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold")
  )

Detecting Outliers and Anomalies

  • Rule-based methods (IQR, z-scores): These classical univariate approaches identify outliers by establishing statistical thresholds, where observations falling beyond predefined boundaries (e.g., 1.5×IQR or ±3 standard deviations) are flagged as potential anomalies. While computationally efficient and interpretable, these methods assume underlying distributional properties and may overlook multivariate patterns.
# Loading necessary library
library(dplyr)

# Calculating z-scores for death_rate
df_zscores <- df_clean %>%
  group_by(country) %>%
  mutate(
    mean_rate = mean(death_rate, na.rm = TRUE),
    sd_rate = sd(death_rate, na.rm = TRUE),
    z_score = (death_rate - mean_rate) / sd_rate,
    is_outlier = abs(z_score) > 3  # Flagging z-scores beyond ±3
  ) %>%
  ungroup()

# Displaying observations flagged as outliers
outliers <- df_zscores %>%
  filter(is_outlier)
print(outliers)
# A tibble: 10 × 8
   country_code country     year death_rate mean_rate sd_rate z_score is_outlier
   <chr>        <chr>      <dbl>      <dbl>     <dbl>   <dbl>   <dbl> <lgl>     
 1 COL          Colombia    2021        304      142.    48.2    3.35 TRUE      
 2 CRI          Costa Rica  2021        209      114.    28.3    3.36 TRUE      
 3 MEX          Mexico      2020        445      216.    75.5    3.03 TRUE      
 4 MEX          Mexico      2021        453      216.    75.5    3.14 TRUE      
 5 SVK          Slovak Re…  2021        308      178.    43.4    3.00 TRUE      
 6 ARG          Argentina   2021        269      149.    27.1    4.42 TRUE      
 7 BRA          Brazil      2021        356      185.    51.6    3.31 TRUE      
 8 BGR          Bulgaria    2021        378      193.    45.5    4.07 TRUE      
 9 PER          Peru        2020        408      124.    90.4    3.14 TRUE      
10 PER          Peru        2021        447      124.    90.4    3.57 TRUE      
  • Robust statistics (median, MAD): Resistant measures such as the median and median absolute deviation (MAD) provide reliable central tendency and dispersion estimates that are less influenced by extreme values compared to mean and standard deviation. These statistics form the foundation for outlier detection in skewed or heavy-tailed distributions where parametric assumptions are violated.
# Loading necessary library
library(dplyr)

# Calculating robust statistics for death_rate
df_robust <- df_clean %>%
  group_by(country) %>%
  mutate(
    median_rate = median(death_rate, na.rm = TRUE),
    mad_rate = mad(death_rate, constant = 1, na.rm = TRUE), 
    robust_z = 0.6745 * (death_rate - median_rate) / mad_rate,
    is_robust_outlier = abs(robust_z) > 3.5
  ) %>%
  ungroup()

# Displaying observations flagged as robust outliers
robust_outliers <- df_robust %>%
  filter(is_robust_outlier)
print(robust_outliers)
# A tibble: 10 × 8
   country_code country    year death_rate median_rate mad_rate robust_z
   <chr>        <chr>     <dbl>      <dbl>       <dbl>    <dbl>    <dbl>
 1 COL          Colombia   2021        304       132       27       4.30
 2 MEX          Mexico     2020        445       217       34       4.52
 3 MEX          Mexico     2021        453       217       34       4.68
 4 ARG          Argentina  2020        194       144.       5.5     6.19
 5 ARG          Argentina  2021        269       144.       5.5    15.4 
 6 BRA          Brazil     2021        356       172.      31       4.01
 7 BGR          Bulgaria   2021        378       186       20.5     6.32
 8 PER          Peru       2020        408        97.5      6.5    32.2 
 9 PER          Peru       2021        447        97.5      6.5    36.3 
10 PER          Peru       2022        146        97.5      6.5     5.03
# ℹ 1 more variable: is_robust_outlier <lgl>
  • Model-based or multivariate detection (e.g., Mahalanobis distance, clustering residuals): Advanced techniques account for correlation structures and multidimensional relationships, enabling detection of outliers that appear normal in individual dimensions but are anomalous in multivariate space. Mahalanobis distance measures how many standard deviations an observation is from the distribution center, while clustering residuals identify observations that deviate from expected cluster membership patterns.

  • Distinguish errors vs. novel but valid observations: Critical analytical judgment is required to differentiate between measurement errors, data entry mistakes, and legitimate extreme values that represent rare but genuine phenomena. This distinction has profound implications for data quality management and scientific discovery, as premature removal of valid outliers may obscure important patterns or emerging trends.

Dimensionality Reduction

  • Motivation: mitigate multicollinearity, noise, and curse of dimensionality: High-dimensional datasets present computational challenges and statistical complications, including increased sparsity, model overfitting, and reduced discriminatory power of distance-based methods. Dimensionality reduction addresses these issues by transforming data into lower-dimensional representations while preserving essential variance and structural relationships.

  • Techniques: Principal Component Analysis (PCA), Factor Analysis, (optionally) t-SNE / UMAP (for exploration): PCA identifies orthogonal linear combinations of variables that maximize variance, creating uncorrelated components suitable for regression and classification tasks. Factor Analysis assumes latent constructs underlying observed variables, focusing on shared variance and theoretical interpretation, while non-linear methods like t-SNE and UMAP preserve local neighborhood structures for exploratory visualization of complex data manifolds.

# Loading necessary libraries
library(dplyr)

# Preparing data for PCA (using numeric variables only)
df_pca <- df_clean %>%
  select(year, death_rate) %>%
  na.omit()

# Performing PCA
pca_result <- prcomp(df_pca, scale. = TRUE)

# Displaying summary of PCA
summary(pca_result)
Importance of components:
                          PC1    PC2
Standard deviation     1.0189 0.9807
Proportion of Variance 0.5191 0.4809
Cumulative Proportion  0.5191 1.0000
# Showing principal component loadings
print(pca_result$rotation)
                  PC1       PC2
year        0.7071068 0.7071068
death_rate -0.7071068 0.7071068
# Calculating proportion of variance explained
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cat("Proportion of variance explained by PC1:", round(var_explained[1], 3), "\n")
Proportion of variance explained by PC1: 0.519 
cat("Proportion of variance explained by PC2:", round(var_explained[2], 3), "\n")
Proportion of variance explained by PC2: 0.481 
  • Interpretability vs. compression trade-offs: Dimensionality reduction inherently balances the competing objectives of achieving parsimonious data representations and maintaining interpretable relationships to original variables. While aggressive compression maximizes computational efficiency and reduces noise, it may obscure meaningful features and complicate domain-specific interpretation of analytical results.

Data Exploration and Mining

  • Structured EDA workflow: question → visualize → quantify → refine: Exploratory Data Analysis follows a systematic iterative process that begins with research questions, employs visualization to generate hypotheses, quantifies patterns through statistical measures, and refines understanding through successive analytical cycles. This disciplined approach prevents data dredging while ensuring comprehensive investigation of data characteristics and relationships.
# Loading necessary libraries
library(ggplot2)
library(dplyr)

# Step 1: Question - Are death rates declining over time?

# Step 2: Visualize
ggplot(df_clean, aes(x = year, y = death_rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "EDA: Death Rates Over Time",
    x = "Year",
    y = "Deaths per 100,000"
  ) +
  theme_minimal()

# Step 3: Quantify
trend_model <- lm(death_rate ~ year, data = df_clean)
summary(trend_model)

Call:
lm(formula = death_rate ~ year, data = df_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-89.80 -54.93 -13.16  27.95 327.80 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 1544.5523  1087.6865   1.420    0.156
year          -0.7023     0.5395  -1.302    0.193

Residual standard error: 69.57 on 1160 degrees of freedom
Multiple R-squared:  0.001459,  Adjusted R-squared:  0.0005978 
F-statistic: 1.694 on 1 and 1160 DF,  p-value: 0.1933
# Step 4: Refine - Examine by country
country_trends <- df_clean %>%
  group_by(country) %>%
  summarise(
    correlation = cor(year, death_rate, use = "complete.obs")
  ) %>%
  arrange(correlation)

print(head(country_trends))
# A tibble: 6 × 2
  country      correlation
  <chr>              <dbl>
1 South Africa      -0.663
2 Israel            -0.473
3 Korea             -0.338
4 Japan             -0.324
5 Luxembourg        -0.317
6 Lithuania         -0.290
  • PCA for variance structure: Principal Component Analysis reveals the underlying variance-covariance structure of multivariate data, identifying dimensions of maximum variability and reducing redundancy among correlated variables. By examining eigenvalues and component loadings, analysts determine the effective dimensionality of the dataset and detect dominant patterns in the data.

  • Factor Analysis for latent constructs: This technique assumes that observed variables are manifestations of unobservable latent factors, making it particularly valuable for psychometric research and construct validation. Unlike PCA, Factor Analysis models measurement error explicitly and focuses on shared rather than total variance, facilitating theoretical interpretation of underlying psychological or economic constructs.

# Loading necessary libraries
library(dplyr)
library(tidyr)

# Creating a wider dataset for factor analysis
# First, get unique country-year combinations by averaging any duplicates
df_wide_fa <- df_clean %>%
  filter(year >= 2018) %>%
  select(country, year, death_rate) %>%
  group_by(country, year) %>%
  summarise(death_rate = mean(death_rate, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = year, values_from = death_rate, names_prefix = "year_") %>%
  select(-country) %>%
  na.omit()

# Performing factor analysis using base R stats package
if(nrow(df_wide_fa) > 10 && ncol(df_wide_fa) >= 3) {
  # Determining number of factors (minimum of 2 or ncol-1)
  n_factors <- min(2, ncol(df_wide_fa) - 1)
  
  # Using factanal from stats package
  fa_result <- factanal(df_wide_fa, factors = n_factors, rotation = "varimax")
  
  # Displaying factor loadings
  cat("Factor Loadings:\n")
  print(fa_result$loadings)
  
  # Calculating proportion of variance explained
  loadings_sq <- fa_result$loadings^2
  var_explained <- colSums(loadings_sq) / nrow(loadings_sq)
  cat("\nProportion of variance explained by each factor:\n")
  print(var_explained)
  
  # Displaying uniquenesses (proportion of variance not explained by factors)
  cat("\nUniqueness (1 - communality):\n")
  print(fa_result$uniquenesses)
} else {
  cat("Note: Not enough observations or variables for factor analysis.\n")
  cat("Factor analysis requires at least 3 variables and sufficient observations.\n")
}
Factor Loadings:

Loadings:
          Factor1 Factor2
year_2018 0.759   0.650  
year_2019 0.754   0.655  
year_2020 0.673   0.736  
year_2021 0.731   0.668  
year_2022 0.775   0.630  
year_2023 0.729   0.664  

               Factor1 Factor2
SS loadings      3.262   2.676
Proportion Var   0.544   0.446
Cumulative Var   0.544   0.990

Proportion of variance explained by each factor:
  Factor1   Factor2 
0.5437206 0.4460732 

Uniqueness (1 - communality):
 year_2018  year_2019  year_2020  year_2021  year_2022  year_2023 
0.00500000 0.00500000 0.00500000 0.01875942 0.00500000 0.02905113 
  • Regression Analysis for relationships and predictive structure: Linear and non-linear regression models quantify relationships between dependent and independent variables, enabling both explanatory analysis of associations and predictive modeling of outcomes. These methods provide parameter estimates, statistical inference, and diagnostic tools to assess model adequacy and identify influential observations.
# Loading necessary library
library(dplyr)

# Preparing data with a categorical variable
df_regression <- df_clean %>%
  mutate(
    time_period = case_when(
      year < 2015 ~ "Early",
      year >= 2015 & year < 2020 ~ "Middle",
      year >= 2020 ~ "Recent"
    )
  ) %>%
  filter(!is.na(time_period))

# Simple linear regression
model_simple <- lm(death_rate ~ year, data = df_regression)
summary(model_simple)

Call:
lm(formula = death_rate ~ year, data = df_regression)

Residuals:
   Min     1Q Median     3Q    Max 
-89.80 -54.93 -13.16  27.95 327.80 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 1544.5523  1087.6865   1.420    0.156
year          -0.7023     0.5395  -1.302    0.193

Residual standard error: 69.57 on 1160 degrees of freedom
Multiple R-squared:  0.001459,  Adjusted R-squared:  0.0005978 
F-statistic: 1.694 on 1 and 1160 DF,  p-value: 0.1933
# Multiple regression with categorical predictor
model_multiple <- lm(death_rate ~ year + time_period, data = df_regression)
summary(model_multiple)

Call:
lm(formula = death_rate ~ year + time_period, data = df_regression)

Residuals:
   Min     1Q Median     3Q    Max 
-98.10 -54.76 -12.67  27.19 319.30 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)       6960.638   3073.466   2.265   0.0237 *
year                -3.393      1.528  -2.221   0.0265 *
time_periodMiddle    5.755      8.892   0.647   0.5177  
time_periodRecent   31.218     14.975   2.085   0.0373 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 69.32 on 1158 degrees of freedom
Multiple R-squared:  0.01036,   Adjusted R-squared:  0.007793 
F-statistic:  4.04 on 3 and 1158 DF,  p-value: 0.007176
# Extracting and interpreting coefficients
cat("\nCoefficient interpretation:\n")

Coefficient interpretation:
cat("Year coefficient:", round(coef(model_simple)[2], 3), "\n")
Year coefficient: -0.702 
cat("This means death rate changes by", round(coef(model_simple)[2], 3), 
    "per 100,000 per year\n")
This means death rate changes by -0.702 per 100,000 per year
  • Clustering (k-means, hierarchical) for pattern discovery (if included): Unsupervised clustering algorithms partition observations into homogeneous groups based on similarity metrics, revealing natural taxonomies and segment structures within data. K-means optimizes within-cluster variance through iterative assignment, while hierarchical methods create nested groupings that can be visualized through dendrograms to inform cluster selection.
# Loading necessary libraries
library(dplyr)
library(ggplot2)

# Preparing data for clustering - average death rate by country
df_cluster <- df_clean %>%
  group_by(country) %>%
  summarise(
    mean_death_rate = mean(death_rate, na.rm = TRUE),
    trend = cor(year, death_rate, use = "complete.obs")
  ) %>%
  na.omit()

# K-means clustering
set.seed(123)
kmeans_result <- kmeans(df_cluster[, c("mean_death_rate", "trend")], centers = 3)

# Adding cluster assignment to data
df_cluster$cluster <- as.factor(kmeans_result$cluster)

# Visualizing clusters
ggplot(df_cluster, aes(x = mean_death_rate, y = trend, color = cluster)) +
  geom_point(size = 3) +
  labs(
    title = "K-means Clustering of Countries",
    x = "Mean Death Rate",
    y = "Time Trend (correlation)",
    color = "Cluster"
  ) +
  theme_minimal()

# Displaying cluster centers
cat("\nCluster Centers:\n")

Cluster Centers:
print(kmeans_result$centers)
  mean_death_rate       trend
1        91.57995 -0.13751640
2       159.63393  0.08403368
3       240.58291 -0.15086402

Causal Inference with Regression Analysis

  • Distinguish association vs. causation: Statistical association measures correlation between variables without implying directionality, whereas causal inference attempts to establish that changes in one variable directly produce changes in another. Demonstrating causality requires careful consideration of temporal precedence, theoretical mechanisms, and elimination of alternative explanations through research design and statistical controls.
# Loading necessary library
library(dplyr)

# Example: Association between year and death rate
association <- cor(df_clean$year, df_clean$death_rate, use = "complete.obs")
cat("Association (correlation) between year and death rate:", round(association, 3), "\n")
Association (correlation) between year and death rate: -0.038 
cat("\nThis shows association, but does NOT prove that time causes death rate changes.\n")

This shows association, but does NOT prove that time causes death rate changes.
cat("Potential confounders: healthcare improvements, policy changes, etc.\n")
Potential confounders: healthcare improvements, policy changes, etc.
# Demonstrating how confounders can affect interpretation
# Creating a hypothetical scenario
df_confound <- df_clean %>%
  mutate(
    developed = country %in% c("Germany", "France", "United Kingdom", 
                                "United States", "Japan")
  )

# Correlation in developed vs developing countries
cor_developed <- df_confound %>%
  filter(developed == TRUE) %>%
  summarise(cor = cor(year, death_rate, use = "complete.obs")) %>%
  pull(cor)

cor_developing <- df_confound %>%
  filter(developed == FALSE) %>%
  summarise(cor = cor(year, death_rate, use = "complete.obs")) %>%
  pull(cor)

cat("\nCorrelation in developed countries:", round(cor_developed, 3), "\n")

Correlation in developed countries: 0.002 
cat("Correlation in developing countries:", round(cor_developing, 3), "\n")
Correlation in developing countries: -0.046 
cat("\nDifferent correlations suggest development level may be a confounder.\n")

Different correlations suggest development level may be a confounder.
  • Model specification and confounding control: Proper model specification identifies relevant covariates and functional forms to isolate the causal effect of interest while controlling for confounding variables that influence both treatment and outcome. Omitted variable bias, measurement error, and incorrect functional forms threaten causal identification, necessitating theory-driven variable selection and specification testing.

  • Assumptions: linearity, independence, homoskedasticity, exogeneity: Valid causal inference from regression requires that relationships are linear in parameters, observations are independent, error variance is constant across predictor levels, and explanatory variables are uncorrelated with the error term. Violations of these assumptions bias coefficient estimates, invalidate standard errors, and compromise hypothesis tests, requiring diagnostic assessment and remedial measures.

# Loading necessary libraries
library(ggplot2)
library(dplyr)
library(lmtest)

# Fitting a regression model
model <- lm(death_rate ~ year, data = df_clean)

# Checking assumptions through diagnostic plots
par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

# Testing for homoskedasticity (Breusch-Pagan test)
bp_test <- bptest(model)
cat("\nBreusch-Pagan test for homoskedasticity:\n")

Breusch-Pagan test for homoskedasticity:
cat("p-value:", bp_test$p.value, "\n")
p-value: 0.2263175 
if(bp_test$p.value < 0.05) {
  cat("Evidence of heteroskedasticity (non-constant variance)\n")
} else {
  cat("No strong evidence against homoskedasticity\n")
}
No strong evidence against homoskedasticity
# Checking for normality of residuals
shapiro_test <- shapiro.test(residuals(model)[1:5000])  # Shapiro test limited to 5000 obs
cat("\nShapiro-Wilk test for normality of residuals:\n")

Shapiro-Wilk test for normality of residuals:
cat("p-value:", shapiro_test$p.value, "\n")
p-value: 9.063082e-29 
  • Interpretation of coefficients and marginal effects: Regression coefficients represent the expected change in the dependent variable associated with a one-unit change in the independent variable, holding other factors constant. Marginal effects extend this interpretation to non-linear models and interaction terms, quantifying how the impact of one variable varies across levels of another variable.
# Loading necessary library
library(dplyr)

# Creating interaction term
df_interaction <- df_clean %>%
  mutate(
    recent_period = ifelse(year >= 2020, 1, 0),
    year_centered = year - mean(year)
  )

# Model with interaction
model_interaction <- lm(death_rate ~ year_centered * recent_period, 
                        data = df_interaction)
summary(model_interaction)

Call:
lm(formula = death_rate ~ year_centered * recent_period, data = df_interaction)

Residuals:
    Min      1Q  Median      3Q     Max 
-104.90  -54.62  -13.11   28.45  318.36 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  123.948      2.607  47.541  < 2e-16 ***
year_centered                 -2.313      0.807  -2.866  0.00423 ** 
recent_period                 56.979     22.746   2.505  0.01238 *  
year_centered:recent_period   -6.947      4.376  -1.588  0.11267    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 69.25 on 1158 degrees of freedom
Multiple R-squared:  0.01215,   Adjusted R-squared:  0.00959 
F-statistic: 4.747 on 3 and 1158 DF,  p-value: 0.002694
# Interpreting coefficients
coefs <- coef(model_interaction)
cat("\nCoefficient Interpretation:\n")

Coefficient Interpretation:
cat("--------------------------------------\n")
--------------------------------------
cat("Main effect of year:", round(coefs["year_centered"], 3), "\n")
Main effect of year: -2.313 
cat("  -> In pre-2020 period, each year is associated with a", 
    round(coefs["year_centered"], 3), "change in death rate\n\n")
  -> In pre-2020 period, each year is associated with a -2.313 change in death rate
cat("Interaction effect:", round(coefs["year_centered:recent_period"], 3), "\n")
Interaction effect: -6.947 
cat("  -> In 2020+, the year effect changes by an additional", 
    round(coefs["year_centered:recent_period"], 3), "\n")
  -> In 2020+, the year effect changes by an additional -6.947 
cat("  -> Total effect in 2020+:", 
    round(coefs["year_centered"] + coefs["year_centered:recent_period"], 3), "\n")
  -> Total effect in 2020+: -9.26 
  • Sensitivity and robustness checks: Assessing the stability of causal conclusions across alternative model specifications, sample restrictions, and analytical choices strengthens inference and identifies fragile results dependent on specific assumptions. Techniques include varying control variables, testing different functional forms, examining subgroup effects, and conducting placebo tests to validate identification strategies.
# Loading necessary library
library(dplyr)

# Base model
model1 <- lm(death_rate ~ year, data = df_clean)

# Alternative specification 1: Adding squared term
df_robust <- df_clean %>%
  mutate(year_sq = year^2)
model2 <- lm(death_rate ~ year + year_sq, data = df_robust)

# Alternative specification 2: Different time periods
df_subset1 <- df_clean %>% filter(year >= 2015)
model3 <- lm(death_rate ~ year, data = df_subset1)

df_subset2 <- df_clean %>% filter(year < 2020)
model4 <- lm(death_rate ~ year, data = df_subset2)

# Comparing coefficients across models
cat("Robustness Check: Year Coefficient Across Specifications\n")
Robustness Check: Year Coefficient Across Specifications
cat("-------------------------------------------------------\n")
-------------------------------------------------------
cat("Model 1 (Base):", round(coef(model1)["year"], 4), "\n")
Model 1 (Base): -0.7023 
cat("Model 2 (Quadratic):", round(coef(model2)["year"], 4), "\n")
Model 2 (Quadratic): -1105.846 
cat("Model 3 (2015+):", round(coef(model3)["year"], 4), "\n")
Model 3 (2015+): 0.9826 
cat("Model 4 (Pre-2020):", round(coef(model4)["year"], 4), "\n")
Model 4 (Pre-2020): -2.3126 
cat("\nInterpretation: If coefficients are similar across specifications,\n")

Interpretation: If coefficients are similar across specifications,
cat("this suggests the relationship is robust to model choices.\n")
this suggests the relationship is robust to model choices.
# Comparing R-squared values
cat("\nModel Fit Comparison:\n")

Model Fit Comparison:
cat("Model 1 R-squared:", round(summary(model1)$r.squared, 4), "\n")
Model 1 R-squared: 0.0015 
cat("Model 2 R-squared:", round(summary(model2)$r.squared, 4), "\n")
Model 2 R-squared: 0.0042 

Project report and presentation

You will find all formal instructions for the project report and presentation in the guidelines.

Presentation timetable:

Assessment criteria for the report:

  • Clarity of Research Question (and Objectives)
  • Data Understanding and Preparation
  • Appropriateness of Methods and Analysis
  • Implementation and Code Quality (in R)
  • Interpretation of Results
  • Visualization and Communication
  • Structure and Presentation of the Report
  • Critical Reflection and Limitations

Example linear regression analysis and storytelling:

# Loading necessary libraries
library(ourdata)
library(dplyr)

# Creating data frames from the ourdata package datasets
hdi_df <- hdi
imr_df <- imr

# Combining two data frames
combined_df <- ourdata::combine(imr_df$name, hdi_df$country, imr_df$`deaths/1`, hdi_df$HumanDevelopmentIndex_2024, 
                                col1 = "Country", col2 = "IMR", col3 = "HDI")
'data.frame':   181 obs. of  3 variables:
 $ Country: chr  "Afghanistan" "Somalia" "Central African Republic" "Equatorial Guinea" ...
 $ IMR    : num  101.3 83.6 80.5 77.4 71.2 ...
 $ HDI    : num  0.496 0.404 0.414 0.674 0.467 0.419 0.416 0.388 0.493 0.419 ...
# or combining with dplyr
combined_df <- inner_join(imr_df, hdi_df, by = c("name" = "country")) %>%
  select(Country = name, IMR = `deaths/1`, HDI = HumanDevelopmentIndex_2024)

# Creating scatter plot
plot(combined_df$HDI, combined_df$IMR, main = "Influence of HDI (Human Development Index) on IMR (Infant Mortality Rate)",
     sub = "HDI = Independent Variable) / IMR = Dependent Variable", ylab = "IMR", xlab = "HDI")
     
# Adding regression line
model <- lm(combined_df$IMR ~ combined_df$HDI)
abline(model, col = "red")

# Showing summary of the model
summary(model)

Call:
lm(formula = combined_df$IMR ~ combined_df$HDI)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.387  -5.474  -0.017   4.470  54.530 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       99.893      3.623   27.57   <2e-16 ***
combined_df$HDI -107.103      4.775  -22.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.723 on 179 degrees of freedom
Multiple R-squared:  0.7376,    Adjusted R-squared:  0.7361 
F-statistic: 503.2 on 1 and 179 DF,  p-value: < 2.2e-16

Conclusion and interpretation of the results:

The fitted linear model (IMR ~ HDI) shows a very strong, negative association: the HDI coefficient is −113.23 (SE = 4.73, t = −23.92, p < 2e‑16).

This means that a one-unit increase in HDI (note: HDI ranges roughly 0–1) is associated with a predicted decrease in infant mortality of about 113 deaths per 1,000 live births.

Practical meaning Higher human development (better education, income and health components of HDI) is strongly associated with lower infant mortality.

Caveats and diagnostics to check Correlation ≠ causation: this is observational. Confounding (e.g., health spending, access to care, urbanization) could drive both HDI and IMR. Avoid causal claims without further analysis.

Summary HDI and IMR are strongly negatively associated: higher HDI is linked to substantially lower infant mortality (HDI explains ~77% of IMR variation here), but confirmatory causal analysis and diagnostic checks are needed before policy attribution.

Example of a heatmap visualization:

# Loading necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)

# Preparing data for heatmap
heatmap_data <- df_clean %>%
  filter(year >= 2010 & year <= 2023) %>%
  group_by(country, year) %>%
  summarise(mean_death_rate = mean(death_rate, na.rm = TRUE), .groups = "drop")

# Creating heatmap
ggplot(heatmap_data, aes(x = year, y = country, fill = mean_death_rate)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "plasma", na.value = "grey50") +
  labs(
    title = "Heatmap of Mean Preventable Death Rates by Country (2010-2023)",
    x = "Year",
    y = "Country",
    fill = "Mean Death Rate"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 6),
    plot.title = element_text(face = "bold")
  )

Literature

All references for this course.

Essential Readings

Further Readings