Data Science and Data Analytics (WS 2025/26)

International Business Management (B. A.)

Author
Affiliations

© Benjamin Gross

Hochschule Fresenius - University of Applied Science

Email: benjamin.gross@ext.hs-fresenius.de

Website: https://drbenjamin.github.io

Published

05.11.2025 12:39

Abstract

This document provides the course material for Data Science and Data Analytics (B. A. – International Business Management). Upon successful completion of the course, students will be able to: recognize important technological and methodological advancements in data science and distinguish between descriptive, predictive, and prescriptive analytics; demonstrate proficiency in classifying data and variables, collecting and managing data, and conducting comprehensive data evaluations; utilize R for effective data manipulation, cleaning, visualization, outlier detection, and dimensionality reduction; conduct sophisticated data exploration and mining techniques (including PCA, Factor Analysis, and Regression Analysis) to discover underlying patterns and inform decision-making; analyze and interpret causal relationships in data using regression analysis; evaluate and organize the implementation of a data analysis project in a business environment; and communicate the results and effects of a data analysis project in a structured way.

Scope and Nature of Data Science

Let’s start this course with some definitions and context.

Definition of Data Science:

The field of Data Science concerns techniques for extracting knowledge from diverse data, with a particular focus on ‘big’ data exhibiting ‘V’ attributes such as volume, velocity, variety, value and veracity.

Maneth & Poulovassilis (2016)

Definition of Data Analytics:

Data analytics is the systematic process of examining data using statistical, computational, and domain-specific methods to extract insights, identify patterns, and support decision-making. It combines competencies in data handling, analysis techniques, and domain knowledge to generate actionable outcomes in organizational contexts (Cuadrado-Gallego et al., 2023).

Definition of Business Analytics:

Business analytics is the science of posing and answering data questions related to business. Business analytics has rapidly expanded in the last few years to include tools drawn from statistics, data management, data visualization, and machine learn- ing. There is increasing emphasis on big data handling to assimilate the advances made in data sciences. As is often the case with applied methodologies, business analytics has to be soundly grounded in applications in various disciplines and business verticals to be valuable. The bridge between the tools and the applications are the modeling methods used by managers and researchers in disciplines such as finance, marketing, and operations.

Pochiraju & Seshadri (2019)

There are many roles in the data science field, including (but not limited to):

Source: LinkedIn

For skills and competencies required for data science activities, see Skills Landscape.

Defining Data Science as an Academic Discipline

Data science emerges as an interdisciplinary field that synthesizes methodologies and insights from multiple academic domains to extract knowledge and actionable insights from data. As an academic discipline, data science represents a convergence of computational, statistical, and domain-specific expertise that addresses the growing need for data-driven decision-making in various sectors.

Data science draws from and interacts with multiple foundational disciplines:

  • Informatics / Information Systems:

    Informatics provides the foundational understanding of information processing, storage, and retrieval systems that underpin data science infrastructure. It encompasses database design, data modeling, information architecture, and system integration principles essential for managing large-scale data ecosystems. Information systems contribute knowledge about organizational data flows, enterprise architectures, and the sociotechnical aspects of data utilization in business contexts.

    See the Technical Applications & Data Analytics coursebook by Gross (2021) for further reading on foundations in informatics.

  • Computer Science (algorithms, data structures, systems design):

    Computer science provides the computational foundation for data science through algorithm design, complexity analysis, and efficient data structures. Core contributions include machine learning algorithms, distributed computing paradigms, database systems, and software engineering practices. System design principles enable scalable data processing architectures, while computational thinking frameworks guide algorithmic problem-solving approaches essential for data-driven solutions.

    See also: Analytical Skills for Business - 1 Introduction and take a look at the AI Universe overview graphic:

    Source: LinkedIn

    See the Overview on no-code and low-code tools for data analytics for an overview on no-code and low-code tools for data analytics and AI tooling.

  • Mathematics (linear algebra, calculus, optimization):

    Mathematics provides the theoretical backbone for data science through linear algebra (matrix operations, eigenvalues, vector spaces), calculus (derivatives, gradients, optimization), and discrete mathematics (graph theory, combinatorics). These mathematical foundations enable dimensionality reduction techniques, gradient-based optimization algorithms, statistical modeling, and the rigorous formulation of machine learning problems (see the figure below). Mathematical rigor ensures the validity and interpretability of analytical results.

    Source: LinkedIn
  • Statistics & Econometrics (inference, modeling, causal analysis):

    Statistics provides the methodological framework for data analysis through hypothesis testing, confidence intervals, regression analysis, and experimental design. Econometrics contributes advanced techniques for causal inference, time series analysis, and handling observational data challenges such as endogeneity and selection bias. These disciplines ensure rigorous uncertainty quantification, model validation, and the ability to draw reliable conclusions from data while understanding limitations and assumptions.

  • Social Science & Behavioral Sciences (contextual interpretation, experimental design):

    Social and behavioral sciences contribute essential understanding of human behavior, organizational dynamics, and contextual factors that influence data generation and interpretation. These disciplines provide expertise in experimental design, survey methodology, ethical considerations, and the social implications of data-driven decisions. They ensure that data science applications consider human factors, cultural context, and societal impact while maintaining ethical standards in data collection and analysis.

    Source: LinkedIn

The interdisciplinary nature of data science requires practitioners to develop competencies across these domains while maintaining awareness of how different methodological traditions complement and inform each other. This multidisciplinary foundation enables data scientists to approach complex problems with both technical rigor and contextual understanding, ensuring that analytical solutions are both technically sound and practically relevant.

For further reading on the academic foundations of data science, see the comprehensive analysis in Defining Data Science as an Academic Discipline.

Significance of Business Data Analysis for Decision-Making

Business data analysis has evolved from a supporting function to a critical strategic capability that fundamentally transforms how organizations make decisions, allocate resources, and compete in modern markets. The systematic application of analytical methods to business data enables evidence-based decision-making that reduces uncertainty, improves operational efficiency, and creates sustainable competitive advantages.

Strategic Decision-Making Framework

Business data analysis provides a structured approach to strategic decision-making through multiple analytical dimensions:

  • Evidence-Based Strategic Planning: Data analysis supports long-term strategic decisions by providing empirical evidence about market trends, competitive positioning, and organizational capabilities. Statistical analysis of historical performance data, market research, and competitive intelligence enables organizations to formulate strategies grounded in quantifiable evidence rather than intuition alone.

  • Risk Assessment and Mitigation: Advanced analytical techniques enable comprehensive risk evaluation across operational, financial, and strategic dimensions. Monte Carlo simulations, scenario analysis, and predictive modeling help organizations quantify potential risks and develop contingency plans based on probabilistic assessments of future outcomes.

  • Resource Allocation Optimization: Data-driven resource allocation models leverage optimization algorithms and statistical analysis to maximize return on investment across different business units, projects, and initiatives. Linear programming, integer optimization, and multi-criteria decision analysis provide frameworks for allocating limited resources to achieve optimal organizational outcomes.

Operational Decision Support

At the operational level, business data analysis transforms day-to-day decision-making through real-time insights and systematic performance measurement:

  • Performance Measurement and Continuous Improvement: Key Performance Indicators (KPIs) and statistical process control methods enable organizations to monitor operational efficiency, quality metrics, and customer satisfaction in real-time. Time series analysis, control charts, and regression analysis identify trends, anomalies, and improvement opportunities that drive continuous operational enhancement.

  • Forecasting and Demand Planning: Statistical forecasting models using techniques such as ARIMA, exponential smoothing, and machine learning algorithms enable accurate demand prediction for inventory management, capacity planning, and supply chain optimization. These analytical approaches reduce uncertainty in operational planning while minimizing costs associated with overstock or stockouts.

  • Customer Analytics and Personalization: Advanced customer analytics leverage segmentation analysis, predictive modeling, and behavioral analytics to understand customer preferences, predict churn, and optimize retention strategies. Clustering algorithms, logistic regression, and recommendation systems enable personalized customer experiences that increase satisfaction and loyalty.

Tactical Decision Integration

Business data analysis bridges strategic planning and operational execution through tactical decision support:

  • Pricing Strategy Optimization: Price elasticity analysis, competitive pricing models, and revenue optimization techniques enable dynamic pricing strategies that maximize profitability while maintaining market competitiveness. Regression analysis, A/B testing, and econometric modeling provide empirical foundations for pricing decisions.

  • Market Intelligence and Competitive Analysis: Data analysis transforms market research and competitive intelligence into actionable insights through statistical analysis of market trends, customer behavior, and competitive positioning. Multivariate analysis, factor analysis, and time series forecasting identify market opportunities and competitive threats.

  • Financial Performance Analysis: Financial analytics encompassing ratio analysis, variance analysis, and predictive financial modeling enable organizations to assess financial health, identify cost reduction opportunities, and optimize capital structure decisions. Statistical analysis of financial data supports both internal performance evaluation and external stakeholder communication.

Contemporary Analytical Capabilities

Modern business data analysis capabilities extend traditional analytical methods through integration of advanced technologies and methodologies:

  • Real-Time Analytics and Decision Support: Stream processing, event-driven analytics, and real-time dashboards enable immediate response to changing business conditions. Complex event processing and real-time statistical monitoring support dynamic decision-making in fast-paced business environments.

  • Predictive and Prescriptive Analytics: Machine learning algorithms, neural networks, and optimization models enable organizations to not only predict future outcomes but also recommend optimal actions. These advanced analytical capabilities support automated decision-making and strategic scenario planning.

  • Data-Driven Innovation: Analytics-driven innovation leverages data science techniques to identify new business opportunities, develop innovative products and services, and create novel revenue streams. Advanced analytics enable organizations to discover hidden patterns, correlations, and insights that drive innovation and competitive differentiation.

The significance of business data analysis for decision-making extends beyond technical capabilities to encompass organizational transformation, cultural change, and strategic competitive positioning. Organizations that successfully integrate analytical capabilities into their decision-making processes achieve superior performance outcomes, enhanced agility, and sustainable competitive advantages in increasingly data-driven markets.

For comprehensive coverage of business data analysis methodologies and applications, see Advanced Business Analytics and the analytical foundations outlined in Evans (2020).

For open access resources, visit Kaggle, a platform for data science competitions and datasets.

Types of Analytics

  • Descriptive Analytics: What happened?
  • Predictive Analytics: What is likely to happen?
  • Prescriptive Analytics: What should we do?

Source: https://datamites.com/blog/descriptive-vs-predictive-vs-prescriptive-analytics/

Data Analytic Competencies

Data analytic competencies encompass the ability to apply machine learning, data mining, statistical methods, and algorithmic approaches to extract meaningful patterns, insights, and predictions from complex datasets. They include proficiency in exploratory data analysis, feature engineering, model selection, evaluation, and validation. These skills ensure rigorous interpretation of data, support evidence-based decision-making, and enable the development of robust analytical solutions adaptable to diverse health, social, and technological contexts.

Types of Data

The structure and temporal dimension of data fundamentally influence analytical approaches and statistical methods. Understanding data types enables researchers to select appropriate modeling techniques and interpret results within proper contextual boundaries.

  • Cross-sectional data captures observations of multiple entities (individuals, firms, countries) at a single point in time. This structure facilitates comparative analysis across units but does not track changes over time. Cross-sectional studies are particularly valuable for examining relationships between variables at a specific moment and testing hypotheses about population characteristics.

  • Time-series data records observations of a single entity across multiple time points, enabling the analysis of temporal patterns, trends, seasonality, and cyclical behaviors. Time-series methods account for autocorrelation and temporal dependencies, supporting forecasting and dynamic modeling. This data structure is essential for economic indicators, financial markets, and environmental monitoring.

  • Panel (longitudinal) data combines both dimensions, tracking multiple entities over time. This structure offers substantial analytical advantages by controlling for unobserved heterogeneity across entities and modeling both within-entity and between-entity variation. Panel data methods support causal inference through fixed-effects and random-effects models, difference-in-differences estimation, and dynamic panel specifications.

Source: https://static.vecteezy.com

Additional data structures:

  • Geo-referenced / spatial data is data associated with specific geographic locations, enabling spatial analysis and visualization. Techniques such as Geographic Information Systems (GIS), spatial autocorrelation, and spatial regression models are employed to analyze patterns and relationships in spatially distributed data.

Source: https://www.slingshotsimulations.com/
  • Streaming / real-time data is continuously generated data that is processed and analyzed in real-time. This data structure is crucial for applications requiring immediate insights, such as fraud detection, network monitoring, and real-time recommendation systems.

Types of Variables

  • Continuous (interval/ratio) data is measured on a scale with meaningful intervals and a true zero point (ratio) or arbitrary zero point (interval). Examples include height, weight, temperature, and income. Continuous variables support a wide range of statistical analyses, including regression and correlation.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning continuous data to a data frame and displaying it as a table
continuous_data <- data.frame(
  Height_cm = c(170, 165, 180, 175, 160, 185, 172, 168, 178, 182),
  Weight_kg = c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85)
)

# Displaying the original table
print(continuous_data)
   Height_cm Weight_kg
1        170        70
2        165        60
3        180        80
4        175        75
5        160        55
6        185        90
7        172        68
8        168        62
9        178        78
10       182        85
# Ordering the data frame by Height_cm
ordered_data <- continuous_data %>% 
  arrange(Height_cm)

# Displaying the ordered data frame
print(ordered_data)
   Height_cm Weight_kg
1        160        55
2        165        60
3        168        62
4        170        70
5        172        68
6        175        75
7        178        78
8        180        80
9        182        85
10       185        90
# Practicing:
# 1. Assign continuous data to a data frame and displaying it as a table
# 2. Order the data frame by a specific column
  • Count data represents the number of occurrences of an event or the frequency of a particular characteristic. Count variables are typically non-negative integers and can be analyzed using Poisson regression or negative binomial regression.
# Assigning count data to a data frame and displaying it as a table
count_data <- data.frame(
  Height = c(170, 165, 182, 175, 165, 175, 175, 168, 175, 182),
  Weight = c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85)
)

# Displaying the original data as a table
print(count_data)
   Height Weight
1     170     70
2     165     60
3     182     80
4     175     75
5     165     55
6     175     90
7     175     68
8     168     62
9     175     78
10    182     85
# Ordering the data frame by Height_cm in ascending order
ordered_count_data <- count_data %>%
  arrange(desc(Height), Weight) %>%
  count(Height)

# Displaying the ordered count data
print(ordered_count_data)
  Height n
1    165 2
2    168 1
3    170 1
4    175 4
5    182 2
# Practicing:
# 1. Assign count data to a data frame and displaying it as a table
  • Ordinal data represents categories with a meaningful order or ranking but no consistent interval between categories. Examples include survey responses (e.g., Likert scales) and socioeconomic status (e.g., low, medium, high). Ordinal variables can be analyzed using non-parametric tests or ordinal regression.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning ordinal data to a data frame and displaying it as a table
ordinal_data <- data.frame(
  Response = c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),
  Value = c(5, 4, 3, 2, 1)
)

# Displaying the original table
print(ordinal_data)
           Response Value
1 Strongly Disagree     5
2          Disagree     4
3           Neutral     3
4             Agree     2
5    Strongly Agree     1
# Ordering the data frame by Value
ordinal_data <- ordinal_data %>% 
  arrange(Value)

# Displaying the ordered data frame
print(ordinal_data)
           Response Value
1    Strongly Agree     1
2             Agree     2
3           Neutral     3
4          Disagree     4
5 Strongly Disagree     5
# Practicing:
# 1. Assign ordinal data to a data frame and displaying it as a table
# 2. Order the data frame by the ordinal value
  • Categorical (nominal / binary) data represents distinct categories without any inherent order. Nominal variables have two or more categories (e.g., gender, race, or marital status), while binary variables have only two categories (e.g., yes/no, success/failure). Categorical variables can be analyzed using chi-square tests or logistic regression.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning categorical data to a data frame and displaying it as a table
categorical_data <- data.frame(
  Event = c("A", "C", "B", "D", "E"),
  Category = c("Type1", "Type2", "Type1", "Type3", "Type2")
)

# Displaying the original table
print(categorical_data)
  Event Category
1     A    Type1
2     C    Type2
3     B    Type1
4     D    Type3
5     E    Type2
# Ordering the data frame by Event
categorical_data <- categorical_data %>% 
  arrange(Event)

# Displaying the table
print(categorical_data)
  Event Category
1     A    Type1
2     B    Type1
3     C    Type2
4     D    Type3
5     E    Type2
# Practicing:
# 1. Assign categorical data to a data frame and displaying it as a table
# 2. Order the data frame by a specific column
  • Compositional or hierarchical structures represent data with a part-to-whole relationship or nested categories. Examples include demographic data (e.g., age groups within gender) and geographical data (e.g., countries within continents). Compositional data can be analyzed using techniques such as hierarchical clustering or multilevel modeling.
# Loading the dplyr library for data manipulation
library(dplyr)

# Assigning hierarchical data to a data frame and displaying it as a table
hierarchical_data <- data.frame(
  Country = c("USA", "Canada" , "USA", "Canada", "Mexico"),
  State_Province = c("California", "Ontario", "Texas", "Quebec", "Jalisco"),
  Population_Millions = c(39.5, 14.5, 29.0, 8.5, 8.3)
)

# Displaying the table
print(hierarchical_data)
  Country State_Province Population_Millions
1     USA     California                39.5
2  Canada        Ontario                14.5
3     USA          Texas                29.0
4  Canada         Quebec                 8.5
5  Mexico        Jalisco                 8.3
# Ordering the data frame by Country and then State_Province
hierarchical_data <- hierarchical_data %>% 
  arrange(Country, State_Province)

# Displaying the ordered data frame
print(hierarchical_data)
  Country State_Province Population_Millions
1  Canada        Ontario                14.5
2  Canada         Quebec                 8.5
3  Mexico        Jalisco                 8.3
4     USA     California                39.5
5     USA          Texas                29.0
# Grouping the data by Country and summarizing total population
hierarchical_data_grouped <- hierarchical_data %>% 
  group_by(Country) %>%
  summarise(Total_Population = sum(Population_Millions))

# Displaying the grouped data frame
print(hierarchical_data_grouped)
# A tibble: 3 × 2
  Country Total_Population
  <chr>              <dbl>
1 Canada              23  
2 Mexico               8.3
3 USA                 68.5
# Practicing:
# 1. Assign hierarchical data to a data frame and displaying it as a table
# 2. Order the data frame by multiple columns

Source: https://www.collegedisha.com/

Some small datasets to start with

A custom R package ourdata has been created to provide some small datasets (and also some helper R functions) for practice. You can install it from GitHub using the following commands:

# Installing Github R packages
devtools::install_github("DrBenjamin/ourdata")
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/Rtmp7gKUFu/remotes3b727f7dfc9e/DrBenjamin-ourdata-3403a65/DESCRIPTION’ ... OK
* preparing ‘ourdata’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘ourdata_0.5.0.tar.gz’

Using the package and exploring its documentation:

# Loading package
library(ourdata)

# Opening help of package
??ourdata

# Showing welcome message
ourdata()
This is the R package `ourdata` used for Data Science courses at the Fresenius University of Applied Sciences.
Type `help(ourdata)` to display the content.
Type `ourdata_website()` to open the package website.
Have fun in the course!
Benjamin Gross
# Printing some datasets
print(koelsch)
  Jahr   Koelsch
1 2017 186784000
2 2018 191308000
3 2019 179322000
4 2020 169182000
print(kirche)
  Jahr Austritte
1 2017    364711
2 2018    437416
3 2019    539509
4 2020    441390
# Using help function from the package for a specific function from the `ourdata` R package
help(combine)
Help on topic 'combine' was found in the following packages:

  Package               Library
  dplyr                 /home/runner/R-library
  ourdata               /home/runner/R-library


Using the first match ...
# Using the `combine` function from the `ourdata` R package to combine two vectors into a data frame
ourdata::combine(kirche$Jahr, koelsch$Jahr, kirche$Austritte, koelsch$Koelsch)
'data.frame':   4 obs. of  3 variables:
 $ C1: chr  "2017" "2018" "2019" "2020"
 $ C2: num  364711 437416 539509 441390
 $ C3: num  1.87e+08 1.91e+08 1.79e+08 1.69e+08
    C1     C2        C3
1 2017 364711 186784000
2 2018 437416 191308000
3 2019 539509 179322000
4 2020 441390 169182000

** A messy dataset example**

We collect data from OECD (Organization for Economic Co-operation and Development), an international organization that works to build better policies for better lives. The dataset shows many columns (variables) with non-informative data and needs to be cleaned (wrangled) before analysis. First load the data into R from CSV file:

# Reading the dataset from a CSV file
preventable_deaths <- read.csv(
  "./topics/data/OECD_Preventable_Deaths.csv",
  stringsAsFactors = FALSE
)

or use the dataset directly from the ourdata R package:

# Reading the dataset from `ourdata` R package
library(ourdata)
preventable_deaths <- oecd_preventable

First we explore the data:

# Viewing structure of the dataset
str(preventable_deaths)

# Viewing first few rows
head(preventable_deaths)

# Checking dimensions
dim(preventable_deaths)

# Viewing column names
colnames(preventable_deaths)

# Summary statistics
summary(preventable_deaths)

# Checking for missing values
colSums(is.na(preventable_deaths))

Now we can start cleaning the data by removing non-informative columns and rows with missing values:

# Loading necessary library
library(dplyr)

# Selecting relevant columns for analysis
df_clean <- preventable_deaths %>%
  select(
    REF_AREA,
    Reference.area,
    TIME_PERIOD,
    OBS_VALUE
  ) %>%
  rename(
    country_code = REF_AREA,
    country = Reference.area,
    year = TIME_PERIOD,
    death_rate = OBS_VALUE
  )

# Converting year to numeric
df_clean$year <- as.numeric(df_clean$year)

# Converting death_rate to numeric (handling empty strings)
df_clean$death_rate <- as.numeric(df_clean$death_rate)

# Removing rows with missing death rates
df_clean <- df_clean %>%
  filter(!is.na(death_rate))

# Viewing cleaned data structure
str(df_clean)
'data.frame':   1162 obs. of  4 variables:
 $ country_code: chr  "AUS" "AUS" "AUS" "AUS" ...
 $ country     : chr  "Australia" "Australia" "Australia" "Australia" ...
 $ year        : num  2010 2011 2012 2013 2014 ...
 $ death_rate  : num  110 109 105 105 107 108 103 103 101 104 ...
# Summary of cleaned data
summary(df_clean)
 country_code         country               year        death_rate   
 Length:1162        Length:1162        Min.   :2010   Min.   : 34.0  
 Class :character   Class :character   1st Qu.:2013   1st Qu.: 75.0  
 Mode  :character   Mode  :character   Median :2016   Median :116.0  
                                       Mean   :2016   Mean   :128.7  
                                       3rd Qu.:2019   3rd Qu.:156.0  
                                       Max.   :2023   Max.   :453.0  

Showing some basic statistics of the cleaned data:

# Loading necessary library
library(dplyr)

# Printing overall statistics
cat("Mean death rate:", mean(df_clean$death_rate, na.rm = TRUE), "\n")
Mean death rate: 128.7151 
cat("Median death rate:", median(df_clean$death_rate, na.rm = TRUE), "\n")
Median death rate: 116 
cat("Standard deviation:", sd(df_clean$death_rate, na.rm = TRUE), "\n")
Standard deviation: 69.58822 
cat("Min death rate:", min(df_clean$death_rate, na.rm = TRUE), "\n")
Min death rate: 34 
cat("Max death rate:", max(df_clean$death_rate, na.rm = TRUE), "\n")
Max death rate: 453 
# Printing statistics by country
country_stats <- df_clean %>%
  group_by(country) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    median_rate = median(death_rate, na.rm = TRUE),
    sd_rate = sd(death_rate, na.rm = TRUE),
    min_rate = min(death_rate, na.rm = TRUE),
    max_rate = max(death_rate, na.rm = TRUE),
    n_observations = n()
  ) %>%
  arrange(desc(mean_rate))
print(country_stats)
# A tibble: 46 × 7
   country        mean_rate median_rate sd_rate min_rate max_rate n_observations
   <chr>              <dbl>       <dbl>   <dbl>    <dbl>    <dbl>          <int>
 1 South Africa        330.        338     53.1      241      438             22
 2 Latvia              230.        226.    65.8      151      364             28
 3 Lithuania           225.        204.    67.9      134      340             28
 4 Romania             223.        220.    46.5      172      303             24
 5 Hungary             218.        208     72.4      141      375             28
 6 Mexico              216.        217     75.5      155      453             26
 7 Bulgaria            193.        186     45.5      156      378             26
 8 Brazil              185.        172.    51.6      133      356             24
 9 Slovak Republ…      178.        167     43.4      130      308             24
10 Estonia             172.        167     59.7      100      265             26
# ℹ 36 more rows
# Printing statistics by year
year_stats <- df_clean %>%
  group_by(year) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    median_rate = median(death_rate, na.rm = TRUE),
    n_countries = n_distinct(country)
  ) %>%
  arrange(year)
print(year_stats)
# A tibble: 14 × 4
    year mean_rate median_rate n_countries
   <dbl>     <dbl>       <dbl>       <int>
 1  2010      141.       128            45
 2  2011      135.       124.           44
 3  2012      133.       120            45
 4  2013      130.       116            45
 5  2014      126.       117            46
 6  2015      125.       114.           45
 7  2016      124.       112.           46
 8  2017      122.       111            45
 9  2018      121.       108.           45
10  2019      118.       106.           44
11  2020      137.       118.           42
12  2021      148.       118.           40
13  2022      120.       108.           35
14  2023      112.        99.5          14

To visualize the cleaned data, we can create a distribution plot of preventable death rates:

# Loading necessary library
library(ggplot2)

# Plotting distribution of death rates
ggplot(df_clean, aes(x = death_rate)) +
  geom_histogram(bins = 30, fill = "#9B59B6", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Preventable Death Rates",
    x = "Death Rate per 100,000",
    y = "Frequency",
    caption = "Source: OECD Health Statistics"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

Conceptual Framework: Knowledge & Understanding of Data

  • Clarify analytical purpose and domain context to guide data selection and interpretation.
  • Define entities, observational units, and identifiers to ensure accurate data representation.
  • Align business concepts with data structures for meaningful analysis.

Data Collection

Data collection forms the foundational stage of any data science project, requiring systematic approaches to gather information that aligns with research objectives and analytical requirements. As outlined in modern statistical frameworks, effective data collection strategies must balance methodological rigor with practical constraints (M. & Hardin, 2021).

Methods of Data Collection

Core Data Collection Competencies

The competencies required for effective data collection encompass both technical proficiency and methodological understanding (see Data Collection Competencies.pdf):

  • Source Identification and Assessment: Systematically identify internal and external data sources, evaluating their relevance, quality, and accessibility for the analytical objectives.

  • Data Acquisition Methods: Implement appropriate collection techniques including APIs (for instance see Spotify API tutorial and Postman Spotify tutorial), database queries, survey instruments, sensor networks, web scraping, and third-party vendor partnerships, ensuring methodological alignment with research design.

    For the two projects below, here are some data sources recommendations (automatically created by Perplexity AI Deep Research):

  • Quality and Governance Framework: Establish protocols for assessing data provenance, licensing agreements, ethical compliance, and regulatory requirements (GDPR, industry-specific standards).

  • Methodological Considerations: Apply principles from research methodology to ensure data collection approaches support valid statistical inference and minimize bias introduction during the acquisition process.

Contemporary Data Collection Landscape

Modern data collection operates within an increasingly complex ecosystem characterized by diverse data types, real-time requirements, and distributed sources. The integration of traditional survey methods with emerging IoT sensors, social media APIs, and automated data pipelines requires comprehensive competency frameworks that address both technical implementation and methodological validity.

For comprehensive coverage of data collection methodologies and best practices, refer to: Research Methodology - Data Collection

Data Management

Data Management in data science curricula requires a coherent, multi‑facet framework that spans data quality, FAIR stewardship, master data governance, privacy/compliance, and modern architectures. Data quality assessment and governance define objective metrics (completeness, accuracy, consistency, plausibility, conformance) and governance processes that balance automated checks with human oversight. FAIR data principles provide a practical blueprint for metadata-rich stewardship to support findability, accessibility, interoperability, and reuse through machine-actionable metadata and persistent identifiers. Master Data Management ensures clean, trusted core entities across systems via governance and harmonization. Data privacy, security, and regulatory compliance embed responsible data handling and risk management, guided by purpose limitation, data minimization, accuracy, storage limitations, integrity/confidentiality, and accountability. Emerging trends in cloud-native data platforms, ETL/ELT, data lakes/lakehouses, and broader metadata automation shape scalable storage/compute and governance, enabling reproducible analytics and ML workflows. Together, these strands underpin trustworthy, discoverable, and compliant data inputs for research and coursework (Weiskopf & Weng, 2016; Wilkinson et al., 2016; GO FAIR Foundation, n.d.; Semarchy, n.d.; IBM, n.d.).

Data Evaluation

  • Define data quality dimensions and assessment frameworks.
  • Distinguish validation from verification in data pipelines.
  • Apply statistical methods and data profiling for evaluation.
  • Balance automated and manual (human-in-the-loop) evaluation approaches.
  • Implement tools, workflows, and governance for data evaluation.

Data evaluation ensures datasets are fit-for-purpose, reliable, and trustworthy throughout the analytical lifecycle. It encompasses systematic assessment of data quality dimensions (completeness, accuracy, consistency, validity, timeliness, uniqueness), validation and verification processes, and strategic application of statistical methods and data profiling. Quality dimensions provide measurable criteria for determining whether data meets analytical requirements, while assessment frameworks translate these into actionable metrics enabling objective measurement and contextual interpretation.

Validation and verification are complementary processes essential for data integrity. Validation checks occur at entry points, preventing bad data through constraint checks, format validation, and business rule enforcement. Verification involves post-collection checks ensuring data remains accurate and consistent over time and across system boundaries, supporting reproducibility and traceability. Together, validation acts as a gatekeeper while verification provides ongoing quality assurance.

Statistical methods form the technical foundation for evaluation. Outlier detection techniques (z-score, IQR, DBSCAN) identify anomalous observations requiring investigation. Distribution checks assess whether data conforms to expected patterns, while profiling describes dataset structure, missingness patterns, and statistical properties. Regression analysis and hypothesis testing diagnose quality issues and quantify relationships between quality metrics and analytical outcomes.

Modern data evaluation balances automated and manual approaches. Automated evaluation offers speed, scalability, and consistency for large datasets through rule-based validation, statistical profiling, and machine learning-based anomaly detection. Manual evaluation contributes domain expertise, contextual understanding, and interpretative judgment that automated systems cannot replicate. Human-in-the-loop approaches combine automation’s efficiency with human interpretability, optimizing both throughput and quality.

Tools, workflows, and governance frameworks provide infrastructure for systematic evaluation across the data lifecycle. Data profiling tools (e.g., Pandas Profiling, Great Expectations, Deequ) automate quality assessment. Validation frameworks embed checks into ETL/ELT pipelines. Data lineage tracking and metadata management support traceability and impact analysis. Governance frameworks establish roles, responsibilities, and processes aligning evaluation practices with regulatory requirements and reproducibility needs.

Applications in the Programming Language R

Please read the How to Use R for Data Science by Prof. Dr. Huber for any basic questions regarding R programming.

Core tidyverse Tooling

Fundamental packages:

  • dplyr for data manipulation (filter, mutate, summarize, joins).
# Loading necessary library
library(dplyr)

# Using dplyr to pipe
df_summary <- df_clean %>%
  group_by(country) %>%
  summarize(
    mean_death_rate = mean(death_rate, na.rm = TRUE),
    max_death_rate = max(death_rate, na.rm = TRUE),
    min_death_rate = min(death_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_death_rate))

# Displaying the summary statistics
print(df_summary)
# A tibble: 46 × 4
   country         mean_death_rate max_death_rate min_death_rate
   <chr>                     <dbl>          <dbl>          <dbl>
 1 South Africa               330.            438            241
 2 Latvia                     230.            364            151
 3 Lithuania                  225.            340            134
 4 Romania                    223.            303            172
 5 Hungary                    218.            375            141
 6 Mexico                     216.            453            155
 7 Bulgaria                   193.            378            156
 8 Brazil                     185.            356            133
 9 Slovak Republic            178.            308            130
10 Estonia                    172.            265            100
# ℹ 36 more rows
  • tidyr for data reshaping (pivoting, nesting, separating, unnesting).
# Loading necessary library
library(tidyr)

# Using tidyr to pivot data
df_wide <- df_clean %>%
  pivot_wider(
    names_from = year,
    values_from = death_rate
  )

# Displaying the wide format data
print(df_wide)
# A tibble: 46 × 16
   country_code country  `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017`
   <chr>        <chr>    <list> <list> <list> <list> <list> <list> <list> <list>
 1 AUS          Austral… <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 2 AUT          Austria  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 3 BEL          Belgium  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 4 CAN          Canada   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 5 CHL          Chile    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 6 COL          Colombia <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 7 CRI          Costa R… <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 8 CZE          Czechia  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 9 DNK          Denmark  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
10 EST          Estonia  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
# ℹ 36 more rows
# ℹ 6 more variables: `2018` <list>, `2019` <list>, `2020` <list>,
#   `2021` <list>, `2022` <list>, `2023` <list>
  • ggplot2 for layered grammar-based visualization. See the example above.

  • some more R libraries:

    • readr for ingestion of CSV and other flat files.
    • readxl for ingestion of Excel files.
    • lubridate for date-time handling.
    • purrr for functional iteration.
    • stringr for text handling.
    • forcats for factor handling.

    for ingestion, functional iteration, text, and factor handling.

Data Visualization Principles

Choose encodings appropriate to variable types:

  • Continuous (Quantitative) variables → Position, Length Examples: x/y coordinates in scatter plots, bar heights, line positions
  • Categorical (Nominal) variables → Color hue, Shape, Facets Examples: different colors for groups, point shapes, separate panels
  • Ordinal variables → Ordered position, Color saturation Examples: ordered categories on axis, gradient colors from light to dark
  • Temporal variables → Position along x-axis, Line connections Examples: time on horizontal axis, connected points showing progression
  • Compositional (Part-to-whole) → Stacked position, Area Examples: stacked bars, proportional areas

Emphasize clarity: reduce chart junk; apply perceptual best practices:

Clarity in data visualization requires removing unnecessary elements that distract from the data while applying principles of human perception to enhance understanding.

  • Remove decorative elements (3D effects, shadows, gradients)
  • Eliminate redundant labels and gridlines
  • Minimize non-data ink (borders, background colors)
  • Avoid unnecessary legends when direct labeling is possible
  • Use position over angle for quantitative comparisons (bar charts > pie charts)
  • Maintain consistent scales across comparable charts
  • Respect aspect ratios that emphasize meaningful patterns
  • Choose colorblind-friendly palettes
  • Ensure sufficient contrast between data and background
# Loading necessary libraries
library(ggplot2)
library(dplyr)

# Creating data for demonstration
country_subset <- df_clean %>%
  filter(country %in% c("Germany", "France", "United Kingdom")) %>%
  filter(year >= 2015)

# Example: Clean, minimal visualization
ggplot(country_subset, aes(x = year, y = death_rate, color = country)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(
    title = "Preventable Death Rates (2015-2023)",
    x = "Year",
    y = "Deaths per 100,000",
    color = "Country"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),  # Removing unnecessary gridlines
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Support comparison, trend detection, and anomaly spotting:

Effective visualizations should facilitate three key analytical tasks: comparing values across groups, identifying trends over time, and detecting unusual patterns.

Support comparison:

  • Align items on common scales for direct comparison
  • Use small multiples (facets) for comparing across categories
  • Order categorical variables meaningfully (by value, alphabetically, or logically)
  • Keep consistent ordering across related charts

Enable trend detection:

  • Use connected lines for temporal data to show continuity
  • Add trend lines (linear, loess) to highlight overall patterns
  • Display sufficient time periods to establish meaningful trends
  • Avoid over-smoothing that hides important variations

Facilitate anomaly spotting:

  • Use reference lines or bands for expected ranges
  • Highlight outliers through color or annotation
  • Include context (confidence intervals, historical ranges)
  • Maintain consistent scales to make deviations visible
# Loading necessary libraries
library(ggplot2)
library(dplyr)

# Calculating statistics for anomaly detection
country_stats <- df_clean %>%
  group_by(country) %>%
  summarise(
    mean_rate = mean(death_rate, na.rm = TRUE),
    sd_rate = sd(death_rate, na.rm = TRUE)
  )

# Joining back to identify anomalies
df_annotated <- df_clean %>%
  left_join(country_stats, by = "country") %>%
  mutate(
    z_score = (death_rate - mean_rate) / sd_rate,
    is_anomaly = abs(z_score) > 2  # Flagging values > 2 standard deviations
  )

# Example: Visualization supporting comparison, trends, and anomaly detection
selected_countries <- c("Germany", "France", "United Kingdom", "Italy", "Spain")
df_subset <- df_annotated %>% filter(country %in% selected_countries)

ggplot(df_subset, aes(x = year, y = death_rate, color = country)) +
  geom_line(linewidth = 0.8) +
  geom_point(aes(size = is_anomaly, alpha = is_anomaly)) +
  scale_size_manual(values = c(1.5, 3), guide = "none") +
  scale_alpha_manual(values = c(0.6, 1), guide = "none") +
  facet_wrap(~ country, ncol = 2) +  # Small multiples for comparison
  labs(
    title = "Preventable Death Rates: Trends and Anomalies",
    subtitle = "Larger points indicate statistical anomalies (>2 SD from country mean)",
    x = "Year",
    y = "Deaths per 100,000",
    color = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold")
  )

Detecting Outliers and Anomalies

  • Rule-based methods (IQR, z-scores).
  • Robust statistics (median, MAD).
  • Model-based or multivariate detection (e.g., Mahalanobis distance, clustering residuals).
  • Distinguish errors vs. novel but valid observations.

Dimensionality Reduction

  • Motivation: mitigate multicollinearity, noise, and curse of dimensionality.
  • Techniques: Principal Component Analysis (PCA), Factor Analysis, (optionally) t-SNE / UMAP (for exploration).
  • Interpretability vs. compression trade-offs.

Data Exploration and Mining

  • Structured EDA workflow: question → visualize → quantify → refine.
  • PCA for variance structure.
  • Factor Analysis for latent constructs.
  • Regression Analysis for relationships and predictive structure.
  • Clustering (k-means, hierarchical) for pattern discovery (if included).

Causal Inference with Regression Analysis

  • Distinguish association vs. causation.
  • Model specification and confounding control.
  • Assumptions: linearity, independence, homoskedasticity, exogeneity.
  • Interpretation of coefficients and marginal effects.
  • Sensitivity and robustness checks.

Literature

All references for this course.

Essential Readings

Further Readings