Hochschule Fresenius - University of Applied Science
Email: benjamin.gross@ext.hs-fresenius.de
Website: https://drbenjamin.github.io
Published
19.12.2025 11:52
Abstract
This document provides the course material for Data Science and Data Analytics (B. A. – International Business Management). Upon successful completion of the course, students will be able to: recognize important technological and methodological advancements in data science and distinguish between descriptive, predictive, and prescriptive analytics; demonstrate proficiency in classifying data and variables, collecting and managing data, and conducting comprehensive data evaluations; utilize R for effective data manipulation, cleaning, visualization, outlier detection, and dimensionality reduction; conduct sophisticated data exploration and mining techniques (including PCA, Factor Analysis, and Regression Analysis) to discover underlying patterns and inform decision-making; analyze and interpret causal relationships in data using regression analysis; evaluate and organize the implementation of a data analysis project in a business environment; and communicate the results and effects of a data analysis project in a structured way.
Scope and Nature of Data Science
Let’s start this course with some definitions and context.
Definition of Data Science:
The field of Data Science concerns techniques for extracting knowledge from diverse data, with a particular focus on ‘big’ data exhibiting ‘V’ attributes such as volume, velocity, variety, value and veracity.
Maneth & Poulovassilis (2016)
Definition of Data Analytics:
Data analytics is the systematic process of examining data using statistical, computational, and domain-specific methods to extract insights, identify patterns, and support decision-making. It combines competencies in data handling, analysis techniques, and domain knowledge to generate actionable outcomes in organizational contexts (Cuadrado-Gallego et al., 2023).
Definition of Business Analytics:
Business analytics is the science of posing and answering data questions related to business. Business analytics has rapidly expanded in the last few years to include tools drawn from statistics, data management, data visualization, and machine learn- ing. There is increasing emphasis on big data handling to assimilate the advances made in data sciences. As is often the case with applied methodologies, business analytics has to be soundly grounded in applications in various disciplines and business verticals to be valuable. The bridge between the tools and the applications are the modeling methods used by managers and researchers in disciplines such as finance, marketing, and operations.
Pochiraju & Seshadri (2019)
There are many roles in the data science field, including (but not limited to):
Source:LinkedIn
For skills and competencies required for data science activities, see Skills Landscape.
Defining Data Science as an Academic Discipline
Data science emerges as an interdisciplinary field that synthesizes methodologies and insights from multiple academic domains to extract knowledge and actionable insights from data. As an academic discipline, data science represents a convergence of computational, statistical, and domain-specific expertise that addresses the growing need for data-driven decision-making in various sectors.
Data science draws from and interacts with multiple foundational disciplines:
Informatics / Information Systems:
Informatics provides the foundational understanding of information processing, storage, and retrieval systems that underpin data science infrastructure. It encompasses database design, data modeling, information architecture, and system integration principles essential for managing large-scale data ecosystems. Information systems contribute knowledge about organizational data flows, enterprise architectures, and the sociotechnical aspects of data utilization in business contexts.
Computer Science (algorithms, data structures, systems design):
Computer science provides the computational foundation for data science through algorithm design, complexity analysis, and efficient data structures. Core contributions include machine learning algorithms, distributed computing paradigms, database systems, and software engineering practices. System design principles enable scalable data processing architectures, while computational thinking frameworks guide algorithmic problem-solving approaches essential for data-driven solutions.
Mathematics provides the theoretical backbone for data science through linear algebra (matrix operations, eigenvalues, vector spaces), calculus (derivatives, gradients, optimization), and discrete mathematics (graph theory, combinatorics). These mathematical foundations enable dimensionality reduction techniques, gradient-based optimization algorithms, statistical modeling, and the rigorous formulation of machine learning problems (see the figure below). Mathematical rigor ensures the validity and interpretability of analytical results.
Statistics provides the methodological framework for data analysis through hypothesis testing, confidence intervals, regression analysis, and experimental design. Econometrics contributes advanced techniques for causal inference, time series analysis, and handling observational data challenges such as endogeneity and selection bias. These disciplines ensure rigorous uncertainty quantification, model validation, and the ability to draw reliable conclusions from data while understanding limitations and assumptions.
Social Science & Behavioral Sciences (contextual interpretation, experimental design):
Social and behavioral sciences contribute essential understanding of human behavior, organizational dynamics, and contextual factors that influence data generation and interpretation. These disciplines provide expertise in experimental design, survey methodology, ethical considerations, and the social implications of data-driven decisions. They ensure that data science applications consider human factors, cultural context, and societal impact while maintaining ethical standards in data collection and analysis.
Source:LinkedIn
A recent Business Punk pilot study reveals that AI systems like ChatGPT, Claude, and Meta AI are developing genuine brand characteristics through their consistent tonality, personality, and emotional resonance with users. The research shows US-based AI models dominate in both global presence and emotional connection, while European systems remain functionally strong but lack distinctive brand identity. This transformation marks a shift where AI branding emerges in real-time through every interaction, making personality a core product quality and positioning systems like Claude and Meta AI as potential “superbrands” of the next decade.
Source:Business Punk
The interdisciplinary nature of data science requires practitioners to develop competencies across these domains while maintaining awareness of how different methodological traditions complement and inform each other. This multidisciplinary foundation enables data scientists to approach complex problems with both technical rigor and contextual understanding, ensuring that analytical solutions are both technically sound and practically relevant.
Significance of Business Data Analysis for Decision-Making
Business data analysis has evolved from a supporting function to a critical strategic capability that fundamentally transforms how organizations make decisions, allocate resources, and compete in modern markets. The systematic application of analytical methods to business data enables evidence-based decision-making that reduces uncertainty, improves operational efficiency, and creates sustainable competitive advantages.
Strategic Decision-Making Framework
Business data analysis provides a structured approach to strategic decision-making through multiple analytical dimensions:
Evidence-Based Strategic Planning: Data analysis supports long-term strategic decisions by providing empirical evidence about market trends, competitive positioning, and organizational capabilities. Statistical analysis of historical performance data, market research, and competitive intelligence enables organizations to formulate strategies grounded in quantifiable evidence rather than intuition alone.
Risk Assessment and Mitigation: Advanced analytical techniques enable comprehensive risk evaluation across operational, financial, and strategic dimensions. Monte Carlo simulations, scenario analysis, and predictive modeling help organizations quantify potential risks and develop contingency plans based on probabilistic assessments of future outcomes.
Resource Allocation Optimization: Data-driven resource allocation models leverage optimization algorithms and statistical analysis to maximize return on investment across different business units, projects, and initiatives. Linear programming, integer optimization, and multi-criteria decision analysis provide frameworks for allocating limited resources to achieve optimal organizational outcomes.
Operational Decision Support
At the operational level, business data analysis transforms day-to-day decision-making through real-time insights and systematic performance measurement:
Performance Measurement and Continuous Improvement: Key Performance Indicators (KPIs) and statistical process control methods enable organizations to monitor operational efficiency, quality metrics, and customer satisfaction in real-time. Time series analysis, control charts, and regression analysis identify trends, anomalies, and improvement opportunities that drive continuous operational enhancement.
Forecasting and Demand Planning: Statistical forecasting models using techniques such as ARIMA, exponential smoothing, and machine learning algorithms enable accurate demand prediction for inventory management, capacity planning, and supply chain optimization. These analytical approaches reduce uncertainty in operational planning while minimizing costs associated with overstock or stockouts.
Customer Analytics and Personalization: Advanced customer analytics leverage segmentation analysis, predictive modeling, and behavioral analytics to understand customer preferences, predict churn, and optimize retention strategies. Clustering algorithms, logistic regression, and recommendation systems enable personalized customer experiences that increase satisfaction and loyalty.
Tactical Decision Integration
Business data analysis bridges strategic planning and operational execution through tactical decision support:
Pricing Strategy Optimization: Price elasticity analysis, competitive pricing models, and revenue optimization techniques enable dynamic pricing strategies that maximize profitability while maintaining market competitiveness. Regression analysis, A/B testing, and econometric modeling provide empirical foundations for pricing decisions.
Market Intelligence and Competitive Analysis: Data analysis transforms market research and competitive intelligence into actionable insights through statistical analysis of market trends, customer behavior, and competitive positioning. Multivariate analysis, factor analysis, and time series forecasting identify market opportunities and competitive threats.
Financial Performance Analysis: Financial analytics encompassing ratio analysis, variance analysis, and predictive financial modeling enable organizations to assess financial health, identify cost reduction opportunities, and optimize capital structure decisions. Statistical analysis of financial data supports both internal performance evaluation and external stakeholder communication.
Contemporary Analytical Capabilities
Modern business data analysis capabilities extend traditional analytical methods through integration of advanced technologies and methodologies:
Real-Time Analytics and Decision Support: Stream processing, event-driven analytics, and real-time dashboards enable immediate response to changing business conditions. Complex event processing and real-time statistical monitoring support dynamic decision-making in fast-paced business environments.
Predictive and Prescriptive Analytics: Machine learning algorithms, neural networks, and optimization models enable organizations to not only predict future outcomes but also recommend optimal actions. These advanced analytical capabilities support automated decision-making and strategic scenario planning.
Data-Driven Innovation: Analytics-driven innovation leverages data science techniques to identify new business opportunities, develop innovative products and services, and create novel revenue streams. Advanced analytics enable organizations to discover hidden patterns, correlations, and insights that drive innovation and competitive differentiation.
The significance of business data analysis for decision-making extends beyond technical capabilities to encompass organizational transformation, cultural change, and strategic competitive positioning. Organizations that successfully integrate analytical capabilities into their decision-making processes achieve superior performance outcomes, enhanced agility, and sustainable competitive advantages in increasingly data-driven markets.
For comprehensive coverage of business data analysis methodologies and applications, see Advanced Business Analytics and the analytical foundations outlined in Evans (2020).
For open access resources, visit Kaggle, a platform for data science competitions and datasets.
Emerging Trends
Key technological and methodological developments shaping the data landscape:
Evolution of computing and data processing architectures.
Digitalization of processes and platforms.
Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
Big Data ecosystems (volume, velocity, variety, veracity, value).
Source:LinkedIn
Internet of Things (IoT) and sensor-driven data generation.
Data analytic competencies encompass the ability to apply machine learning, data mining, statistical methods, and algorithmic approaches to extract meaningful patterns, insights, and predictions from complex datasets. They include proficiency in exploratory data analysis, feature engineering, model selection, evaluation, and validation. These skills ensure rigorous interpretation of data, support evidence-based decision-making, and enable the development of robust analytical solutions adaptable to diverse health, social, and technological contexts.
Types of Data
The structure and temporal dimension of data fundamentally influence analytical approaches and statistical methods. Understanding data types enables researchers to select appropriate modeling techniques and interpret results within proper contextual boundaries.
Cross-sectional data captures observations of multiple entities (individuals, firms, countries) at a single point in time. This structure facilitates comparative analysis across units but does not track changes over time. Cross-sectional studies are particularly valuable for examining relationships between variables at a specific moment and testing hypotheses about population characteristics.
Time-series data records observations of a single entity across multiple time points, enabling the analysis of temporal patterns, trends, seasonality, and cyclical behaviors. Time-series methods account for autocorrelation and temporal dependencies, supporting forecasting and dynamic modeling. This data structure is essential for economic indicators, financial markets, and environmental monitoring.
Panel (longitudinal) data combines both dimensions, tracking multiple entities over time. This structure offers substantial analytical advantages by controlling for unobserved heterogeneity across entities and modeling both within-entity and between-entity variation. Panel data methods support causal inference through fixed-effects and random-effects models, difference-in-differences estimation, and dynamic panel specifications.
Source:https://static.vecteezy.com
Additional data structures:
Geo-referenced / spatial data is data associated with specific geographic locations, enabling spatial analysis and visualization. Techniques such as Geographic Information Systems (GIS), spatial autocorrelation, and spatial regression models are employed to analyze patterns and relationships in spatially distributed data.
Source:https://www.slingshotsimulations.com/
Streaming / real-time data is continuously generated data that is processed and analyzed in real-time. This data structure is crucial for applications requiring immediate insights, such as fraud detection, network monitoring, and real-time recommendation systems.
Types of Variables
Continuous (interval/ratio) data is measured on a scale with meaningful intervals and a true zero point (ratio) or arbitrary zero point (interval). Examples include height, weight, temperature, and income. Continuous variables support a wide range of statistical analyses, including regression and correlation.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning continuous data to a data frame and displaying it as a tablecontinuous_data <-data.frame(Height_cm =c(170, 165, 180, 175, 160, 185, 172, 168, 178, 182),Weight_kg =c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85))# Displaying the original tableprint(continuous_data)
# Practicing:# 1. Assign continuous data to a data frame and displaying it as a table# 2. Order the data frame by a specific column
Count data represents the number of occurrences of an event or the frequency of a particular characteristic. Count variables are typically non-negative integers and can be analyzed using Poisson regression or negative binomial regression.
# Assigning count data to a data frame and displaying it as a tablecount_data <-data.frame(Height =c(170, 165, 182, 175, 165, 175, 175, 168, 175, 182),Weight =c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85))# Displaying the original data as a tableprint(count_data)
# Ordering the data frame by Height_cm in ascending orderordered_count_data <- count_data %>%arrange(desc(Height), Weight) %>%count(Height)# Displaying the ordered count dataprint(ordered_count_data)
Height n
1 165 2
2 168 1
3 170 1
4 175 4
5 182 2
# Practicing:# 1. Assign count data to a data frame and displaying it as a table
Ordinal data represents categories with a meaningful order or ranking but no consistent interval between categories. Examples include survey responses (e.g., Likert scales) and socioeconomic status (e.g., low, medium, high). Ordinal variables can be analyzed using non-parametric tests or ordinal regression.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning ordinal data to a data frame and displaying it as a tableordinal_data <-data.frame(Response =c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),Value =c(5, 4, 3, 2, 1))# Displaying the original tableprint(ordinal_data)
# Practicing:# 1. Assign ordinal data to a data frame and displaying it as a table# 2. Order the data frame by the ordinal value
Categorical (nominal / binary) data represents distinct categories without any inherent order. Nominal variables have two or more categories (e.g., gender, race, or marital status), while binary variables have only two categories (e.g., yes/no, success/failure). Categorical variables can be analyzed using chi-square tests or logistic regression.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning categorical data to a data frame and displaying it as a tablecategorical_data <-data.frame(Event =c("A", "C", "B", "D", "E"),Category =c("Type1", "Type2", "Type1", "Type3", "Type2"))# Displaying the original tableprint(categorical_data)
Event Category
1 A Type1
2 C Type2
3 B Type1
4 D Type3
5 E Type2
# Ordering the data frame by Eventcategorical_data <- categorical_data %>%arrange(Event)# Displaying the tableprint(categorical_data)
Event Category
1 A Type1
2 B Type1
3 C Type2
4 D Type3
5 E Type2
# Practicing:# 1. Assign categorical data to a data frame and displaying it as a table# 2. Order the data frame by a specific column
Compositional or hierarchical structures represent data with a part-to-whole relationship or nested categories. Examples include demographic data (e.g., age groups within gender) and geographical data (e.g., countries within continents). Compositional data can be analyzed using techniques such as hierarchical clustering or multilevel modeling.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning hierarchical data to a data frame and displaying it as a tablehierarchical_data <-data.frame(Country =c("USA", "Canada" , "USA", "Canada", "Mexico"),State_Province =c("California", "Ontario", "Texas", "Quebec", "Jalisco"),Population_Millions =c(39.5, 14.5, 29.0, 8.5, 8.3))# Displaying the tableprint(hierarchical_data)
Country State_Province Population_Millions
1 USA California 39.5
2 Canada Ontario 14.5
3 USA Texas 29.0
4 Canada Quebec 8.5
5 Mexico Jalisco 8.3
# Ordering the data frame by Country and then State_Provincehierarchical_data <- hierarchical_data %>%arrange(Country, State_Province)# Displaying the ordered data frameprint(hierarchical_data)
Country State_Province Population_Millions
1 Canada Ontario 14.5
2 Canada Quebec 8.5
3 Mexico Jalisco 8.3
4 USA California 39.5
5 USA Texas 29.0
# Grouping the data by Country and summarizing total populationhierarchical_data_grouped <- hierarchical_data %>%group_by(Country) %>%summarise(Total_Population =sum(Population_Millions))# Displaying the grouped data frameprint(hierarchical_data_grouped)
# A tibble: 3 × 2
Country Total_Population
<chr> <dbl>
1 Canada 23
2 Mexico 8.3
3 USA 68.5
# Practicing:# 1. Assign hierarchical data to a data frame and displaying it as a table# 2. Order the data frame by multiple columns
Source:https://www.collegedisha.com/
Some small datasets to start with
A custom R package ourdata has been created to provide some small datasets (and also some helper R functions) for practice. You can install it from GitHub using the following commands:
# Installing Github R packagesdevtools::install_github("DrBenjamin/ourdata")
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/RtmpHOUweO/remotes4ca21964d22f/DrBenjamin-ourdata-9500135/DESCRIPTION’ ... OK
* preparing ‘ourdata’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘ourdata_0.5.0.tar.gz’
Using the package and exploring its documentation:
# Loading packagelibrary(ourdata)# Opening help of package??ourdata# Showing welcome messageourdata()
This is the R package `ourdata` used for Data Science courses at the Fresenius University of Applied Sciences.
Type `help(ourdata)` to display the content.
Type `ourdata_website()` to open the package website.
Have fun in the course!
Benjamin Gross
# Using help function from the package for a specific function from the `ourdata` R packagehelp(combine)
Help on topic 'combine' was found in the following packages:
Package Library
dplyr /home/runner/R-library
ourdata /home/runner/R-library
Using the first match ...
# Using the `combine` function from the `ourdata` R package to combine two vectors into a data frameourdata::combine(kirche$Jahr, koelsch$Jahr, kirche$Austritte, koelsch$Koelsch)
'data.frame': 4 obs. of 3 variables:
$ C1: chr "2017" "2018" "2019" "2020"
$ C2: num 364711 437416 539509 441390
$ C3: num 1.87e+08 1.91e+08 1.79e+08 1.69e+08
We collect data from OECD (Organization for Economic Co-operation and Development), an international organization that works to build better policies for better lives. The dataset shows many columns (variables) with non-informative data and needs to be cleaned (wrangled) before analysis. First load the data into R from CSV file:
# Reading the dataset from a CSV filepreventable_deaths <-read.csv("./topics/data/OECD_Preventable_Deaths.csv",stringsAsFactors =FALSE)
or use the dataset directly from the ourdata R package:
# Reading the dataset from `ourdata` R packagelibrary(ourdata)preventable_deaths <- oecd_preventable
First we explore the data:
# Viewing structure of the datasetstr(preventable_deaths)# Viewing first few rowshead(preventable_deaths)# Checking dimensionsdim(preventable_deaths)# Viewing column namescolnames(preventable_deaths)# Summary statisticssummary(preventable_deaths)# Checking for missing valuescolSums(is.na(preventable_deaths))
Now we can start cleaning the data by removing non-informative columns and rows with missing values:
# Loading necessary librarylibrary(dplyr)# Selecting relevant columns for analysisdf_clean <- preventable_deaths %>%select( REF_AREA, Reference.area, TIME_PERIOD, OBS_VALUE ) %>%rename(country_code = REF_AREA,country = Reference.area,year = TIME_PERIOD,death_rate = OBS_VALUE )# Converting year to numericdf_clean$year <-as.numeric(df_clean$year)# Converting death_rate to numeric (handling empty strings)df_clean$death_rate <-as.numeric(df_clean$death_rate)# Removing rows with missing death ratesdf_clean <- df_clean %>%filter(!is.na(death_rate))# Viewing cleaned data structurestr(df_clean)
'data.frame': 1162 obs. of 4 variables:
$ country_code: chr "AUS" "AUS" "AUS" "AUS" ...
$ country : chr "Australia" "Australia" "Australia" "Australia" ...
$ year : num 2010 2011 2012 2013 2014 ...
$ death_rate : num 110 109 105 105 107 108 103 103 101 104 ...
# Summary of cleaned datasummary(df_clean)
country_code country year death_rate
Length:1162 Length:1162 Min. :2010 Min. : 34.0
Class :character Class :character 1st Qu.:2013 1st Qu.: 75.0
Mode :character Mode :character Median :2016 Median :116.0
Mean :2016 Mean :128.7
3rd Qu.:2019 3rd Qu.:156.0
Max. :2023 Max. :453.0
Showing some basic statistics of the cleaned data:
To visualize the cleaned data, we can create a distribution plot of preventable death rates:
# Loading necessary librarylibrary(ggplot2)# Plotting distribution of death ratesggplot(df_clean, aes(x = death_rate)) +geom_histogram(bins =30, fill ="#9B59B6", color ="white", alpha =0.8) +labs(title ="Distribution of Preventable Death Rates",x ="Death Rate per 100,000",y ="Frequency",caption ="Source: OECD Health Statistics" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14) )
Conceptual Framework: Knowledge & Understanding of Data
Clarify analytical purpose and domain context to guide data selection and interpretation.
Define entities, observational units, and identifiers to ensure accurate data representation.
Align business concepts with data structures for meaningful analysis.
Data Collection
Data collection forms the foundational stage of any data science project, requiring systematic approaches to gather information that aligns with research objectives and analytical requirements. As outlined in modern statistical frameworks, effective data collection strategies must balance methodological rigor with practical constraints (M. & Hardin, 2021).
Methods of Data Collection
Core Data Collection Competencies
The competencies required for effective data collection encompass both technical proficiency and methodological understanding (see Data Collection Competencies.pdf):
Source Identification and Assessment: Systematically identify internal and external data sources, evaluating their relevance, quality, and accessibility for the analytical objectives.
Data Acquisition Methods: Implement appropriate collection techniques including APIs (for instance see Spotify API tutorial and Postman Spotify tutorial), database queries, survey instruments, sensor networks, web scraping, and third-party vendor partnerships, ensuring methodological alignment with research design.
For the two projects below, here are some data sources recommendations (automatically created by Perplexity AI Deep Research):
Quality and Governance Framework: Establish protocols for assessing data provenance, licensing agreements, ethical compliance, and regulatory requirements (GDPR, industry-specific standards).
Methodological Considerations: Apply principles from research methodology to ensure data collection approaches support valid statistical inference and minimize bias introduction during the acquisition process.
Contemporary Data Collection Landscape
Modern data collection operates within an increasingly complex ecosystem characterized by diverse data types, real-time requirements, and distributed sources. The integration of traditional survey methods with emerging IoT sensors, social media APIs, and automated data pipelines requires comprehensive competency frameworks that address both technical implementation and methodological validity.
Data Management in data science curricula requires a coherent, multi‑facet framework that spans data quality, FAIR stewardship, master data governance, privacy/compliance, and modern architectures. Data quality assessment and governance define objective metrics (completeness, accuracy, consistency, plausibility, conformance) and governance processes that balance automated checks with human oversight. FAIR data principles provide a practical blueprint for metadata-rich stewardship to support findability, accessibility, interoperability, and reuse through machine-actionable metadata and persistent identifiers.
Master Data Management ensures clean, trusted core entities across systems via governance and harmonization. Data privacy, security, and regulatory compliance embed responsible data handling and risk management, guided by purpose limitation, data minimization, accuracy, storage limitations, integrity/confidentiality, and accountability. Emerging trends in cloud-native data platforms, ETL/ELT (Extract, Transform, Load or Extract, Load, Transform), data lakes/lakehouses, and broader metadata automation shape scalable storage/compute and governance, enabling reproducible analytics and ML workflows. Together, these strands underpin trustworthy, discoverable, and compliant data inputs for research and coursework (Weiskopf & Weng, 2016; Wilkinson et al., 2016; GO FAIR Foundation, n.d.; Semarchy, n.d.; IBM, n.d.).
Define data quality dimensions and assessment frameworks.
Distinguish validation from verification in data pipelines.
Apply statistical methods and data profiling for evaluation.
Balance automated and manual (human-in-the-loop) evaluation approaches.
Implement tools, workflows, and governance for data evaluation.
Data evaluation ensures datasets are fit-for-purpose, reliable, and trustworthy throughout the analytical lifecycle. It encompasses systematic assessment of data quality dimensions (completeness, accuracy, consistency, validity, timeliness, uniqueness), validation and verification processes, and strategic application of statistical methods and data profiling. Quality dimensions provide measurable criteria for determining whether data meets analytical requirements, while assessment frameworks translate these into actionable metrics enabling objective measurement and contextual interpretation.
Validation and verification are complementary processes essential for data integrity. Validation checks occur at entry points, preventing bad data through constraint checks, format validation, and business rule enforcement. Verification involves post-collection checks ensuring data remains accurate and consistent over time and across system boundaries, supporting reproducibility and traceability. Together, validation acts as a gatekeeper while verification provides ongoing quality assurance.
Statistical methods form the technical foundation for evaluation. Outlier detection techniques (z-score, IQR, DBSCAN) identify anomalous observations requiring investigation. Distribution checks assess whether data conforms to expected patterns, while profiling describes dataset structure, missingness patterns, and statistical properties. Regression analysis and hypothesis testing diagnose quality issues and quantify relationships between quality metrics and analytical outcomes.
Modern data evaluation balances automated and manual approaches. Automated evaluation offers speed, scalability, and consistency for large datasets through rule-based validation, statistical profiling, and machine learning-based anomaly detection. Manual evaluation contributes domain expertise, contextual understanding, and interpretative judgment that automated systems cannot replicate. Human-in-the-loop approaches combine automation’s efficiency with human interpretability, optimizing both throughput and quality.
Tools, workflows, and governance frameworks provide infrastructure for systematic evaluation across the data lifecycle. Data profiling tools (e.g., Pandas Profiling, Great Expectations, Deequ) automate quality assessment. Validation frameworks embed checks into ETL/ELT pipelines. Data lineage tracking and metadata management support traceability and impact analysis. Governance frameworks establish roles, responsibilities, and processes aligning evaluation practices with regulatory requirements and reproducibility needs.
ggplot2 for layered grammar-based visualization. See the example above.
some more R libraries:
readr for ingestion of CSV and other flat files.
readxl for ingestion of Excel files.
lubridate for date-time handling.
purrr for functional iteration.
stringr for text handling.
forcats for factor handling.
for ingestion, functional iteration, text, and factor handling.
Data Visualization Principles
Choose encodings appropriate to variable types:
Continuous (Quantitative) variables → Position, Length Examples: x/y coordinates in scatter plots, bar heights, line positions
Categorical (Nominal) variables → Color hue, Shape, Facets Examples: different colors for groups, point shapes, separate panels
Ordinal variables → Ordered position, Color saturation Examples: ordered categories on axis, gradient colors from light to dark
Temporal variables → Position along x-axis, Line connections Examples: time on horizontal axis, connected points showing progression
Compositional (Part-to-whole) → Stacked position, Area Examples: stacked bars, proportional areas
Emphasize clarity: reduce chart junk; apply perceptual best practices:
Clarity in data visualization requires removing unnecessary elements that distract from the data while applying principles of human perception to enhance understanding.
Remove decorative elements (3D effects, shadows, gradients)
Avoid unnecessary legends when direct labeling is possible
Use position over angle for quantitative comparisons (bar charts > pie charts)
Maintain consistent scales across comparable charts
Respect aspect ratios that emphasize meaningful patterns
Choose colorblind-friendly palettes
Ensure sufficient contrast between data and background
# Loading necessary librarieslibrary(ggplot2)library(dplyr)# Creating data for demonstrationcountry_subset <- df_clean %>%filter(country %in%c("Germany", "France", "United Kingdom")) %>%filter(year >=2015)# Example: Clean, minimal visualizationggplot(country_subset, aes(x = year, y = death_rate, color = country)) +geom_line(linewidth =1) +geom_point(size =2) +labs(title ="Preventable Death Rates (2015-2023)",x ="Year",y ="Deaths per 100,000",color ="Country" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), # Removing unnecessary gridlineslegend.position ="bottom",plot.title =element_text(face ="bold") )
Support comparison, trend detection, and anomaly spotting:
Effective visualizations should facilitate three key analytical tasks: comparing values across groups, identifying trends over time, and detecting unusual patterns.
Support comparison:
Align items on common scales for direct comparison
Use small multiples (facets) for comparing across categories
Order categorical variables meaningfully (by value, alphabetically, or logically)
Keep consistent ordering across related charts
Enable trend detection:
Use connected lines for temporal data to show continuity
Add trend lines (linear, loess) to highlight overall patterns
Display sufficient time periods to establish meaningful trends
Avoid over-smoothing that hides important variations
Facilitate anomaly spotting:
Use reference lines or bands for expected ranges
Highlight outliers through color or annotation
Include context (confidence intervals, historical ranges)
Maintain consistent scales to make deviations visible
# Loading necessary librarieslibrary(ggplot2)library(dplyr)# Calculating statistics for anomaly detectioncountry_stats <- df_clean %>%group_by(country) %>%summarise(mean_rate =mean(death_rate, na.rm =TRUE),sd_rate =sd(death_rate, na.rm =TRUE) )# Joining back to identify anomaliesdf_annotated <- df_clean %>%left_join(country_stats, by ="country") %>%mutate(z_score = (death_rate - mean_rate) / sd_rate,is_anomaly =abs(z_score) >2# Flagging values > 2 standard deviations )# Example: Visualization supporting comparison, trends, and anomaly detectionselected_countries <-c("Germany", "France", "United Kingdom", "Italy", "Spain")df_subset <- df_annotated %>%filter(country %in% selected_countries)ggplot(df_subset, aes(x = year, y = death_rate, color = country)) +geom_line(linewidth =0.8) +geom_point(aes(size = is_anomaly, alpha = is_anomaly)) +scale_size_manual(values =c(1.5, 3), guide ="none") +scale_alpha_manual(values =c(0.6, 1), guide ="none") +facet_wrap(~ country, ncol =2) +# Small multiples for comparisonlabs(title ="Preventable Death Rates: Trends and Anomalies",subtitle ="Larger points indicate statistical anomalies (>2 SD from country mean)",x ="Year",y ="Deaths per 100,000",color ="Country" ) +theme_minimal() +theme(legend.position ="none",strip.text =element_text(face ="bold"),plot.title =element_text(face ="bold") )
Detecting Outliers and Anomalies
Rule-based methods (IQR, z-scores): These classical univariate approaches identify outliers by establishing statistical thresholds, where observations falling beyond predefined boundaries (e.g., 1.5×IQR or ±3 standard deviations) are flagged as potential anomalies. While computationally efficient and interpretable, these methods assume underlying distributional properties and may overlook multivariate patterns.
# A tibble: 10 × 8
country_code country year death_rate mean_rate sd_rate z_score is_outlier
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 COL Colombia 2021 304 142. 48.2 3.35 TRUE
2 CRI Costa Rica 2021 209 114. 28.3 3.36 TRUE
3 MEX Mexico 2020 445 216. 75.5 3.03 TRUE
4 MEX Mexico 2021 453 216. 75.5 3.14 TRUE
5 SVK Slovak Re… 2021 308 178. 43.4 3.00 TRUE
6 ARG Argentina 2021 269 149. 27.1 4.42 TRUE
7 BRA Brazil 2021 356 185. 51.6 3.31 TRUE
8 BGR Bulgaria 2021 378 193. 45.5 4.07 TRUE
9 PER Peru 2020 408 124. 90.4 3.14 TRUE
10 PER Peru 2021 447 124. 90.4 3.57 TRUE
Robust statistics (median, MAD): Resistant measures such as the median and median absolute deviation (MAD) provide reliable central tendency and dispersion estimates that are less influenced by extreme values compared to mean and standard deviation. These statistics form the foundation for outlier detection in skewed or heavy-tailed distributions where parametric assumptions are violated.
# A tibble: 10 × 8
country_code country year death_rate median_rate mad_rate robust_z
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 COL Colombia 2021 304 132 27 4.30
2 MEX Mexico 2020 445 217 34 4.52
3 MEX Mexico 2021 453 217 34 4.68
4 ARG Argentina 2020 194 144. 5.5 6.19
5 ARG Argentina 2021 269 144. 5.5 15.4
6 BRA Brazil 2021 356 172. 31 4.01
7 BGR Bulgaria 2021 378 186 20.5 6.32
8 PER Peru 2020 408 97.5 6.5 32.2
9 PER Peru 2021 447 97.5 6.5 36.3
10 PER Peru 2022 146 97.5 6.5 5.03
# ℹ 1 more variable: is_robust_outlier <lgl>
Model-based or multivariate detection (e.g., Mahalanobis distance, clustering residuals): Advanced techniques account for correlation structures and multidimensional relationships, enabling detection of outliers that appear normal in individual dimensions but are anomalous in multivariate space. Mahalanobis distance measures how many standard deviations an observation is from the distribution center, while clustering residuals identify observations that deviate from expected cluster membership patterns.
Distinguish errors vs. novel but valid observations: Critical analytical judgment is required to differentiate between measurement errors, data entry mistakes, and legitimate extreme values that represent rare but genuine phenomena. This distinction has profound implications for data quality management and scientific discovery, as premature removal of valid outliers may obscure important patterns or emerging trends.
Dimensionality Reduction
Motivation: mitigate multicollinearity, noise, and curse of dimensionality: High-dimensional datasets present computational challenges and statistical complications, including increased sparsity, model overfitting, and reduced discriminatory power of distance-based methods. Dimensionality reduction addresses these issues by transforming data into lower-dimensional representations while preserving essential variance and structural relationships.
Techniques: Principal Component Analysis (PCA), Factor Analysis, (optionally) t-SNE / UMAP (for exploration): PCA identifies orthogonal linear combinations of variables that maximize variance, creating uncorrelated components suitable for regression and classification tasks. Factor Analysis assumes latent constructs underlying observed variables, focusing on shared variance and theoretical interpretation, while non-linear methods like t-SNE and UMAP preserve local neighborhood structures for exploratory visualization of complex data manifolds.
# Loading necessary librarieslibrary(dplyr)# Preparing data for PCA (using numeric variables only)df_pca <- df_clean %>%select(year, death_rate) %>%na.omit()# Performing PCApca_result <-prcomp(df_pca, scale. =TRUE)# Displaying summary of PCAsummary(pca_result)
Importance of components:
PC1 PC2
Standard deviation 1.0189 0.9807
Proportion of Variance 0.5191 0.4809
Cumulative Proportion 0.5191 1.0000
# Showing principal component loadingsprint(pca_result$rotation)
PC1 PC2
year 0.7071068 0.7071068
death_rate -0.7071068 0.7071068
# Calculating proportion of variance explainedvar_explained <- pca_result$sdev^2/sum(pca_result$sdev^2)cat("Proportion of variance explained by PC1:", round(var_explained[1], 3), "\n")
Proportion of variance explained by PC1: 0.519
cat("Proportion of variance explained by PC2:", round(var_explained[2], 3), "\n")
Proportion of variance explained by PC2: 0.481
Interpretability vs. compression trade-offs: Dimensionality reduction inherently balances the competing objectives of achieving parsimonious data representations and maintaining interpretable relationships to original variables. While aggressive compression maximizes computational efficiency and reduces noise, it may obscure meaningful features and complicate domain-specific interpretation of analytical results.
Data Exploration and Mining
Structured EDA workflow: question → visualize → quantify → refine: Exploratory Data Analysis follows a systematic iterative process that begins with research questions, employs visualization to generate hypotheses, quantifies patterns through statistical measures, and refines understanding through successive analytical cycles. This disciplined approach prevents data dredging while ensuring comprehensive investigation of data characteristics and relationships.
# Loading necessary librarieslibrary(ggplot2)library(dplyr)# Step 1: Question - Are death rates declining over time?# Step 2: Visualizeggplot(df_clean, aes(x = year, y = death_rate)) +geom_point(alpha =0.3) +geom_smooth(method ="lm", color ="red") +labs(title ="EDA: Death Rates Over Time",x ="Year",y ="Deaths per 100,000" ) +theme_minimal()
# Step 3: Quantifytrend_model <-lm(death_rate ~ year, data = df_clean)summary(trend_model)
Call:
lm(formula = death_rate ~ year, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-89.80 -54.93 -13.16 27.95 327.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1544.5523 1087.6865 1.420 0.156
year -0.7023 0.5395 -1.302 0.193
Residual standard error: 69.57 on 1160 degrees of freedom
Multiple R-squared: 0.001459, Adjusted R-squared: 0.0005978
F-statistic: 1.694 on 1 and 1160 DF, p-value: 0.1933
# Step 4: Refine - Examine by countrycountry_trends <- df_clean %>%group_by(country) %>%summarise(correlation =cor(year, death_rate, use ="complete.obs") ) %>%arrange(correlation)print(head(country_trends))
# A tibble: 6 × 2
country correlation
<chr> <dbl>
1 South Africa -0.663
2 Israel -0.473
3 Korea -0.338
4 Japan -0.324
5 Luxembourg -0.317
6 Lithuania -0.290
PCA for variance structure: Principal Component Analysis reveals the underlying variance-covariance structure of multivariate data, identifying dimensions of maximum variability and reducing redundancy among correlated variables. By examining eigenvalues and component loadings, analysts determine the effective dimensionality of the dataset and detect dominant patterns in the data.
Factor Analysis for latent constructs: This technique assumes that observed variables are manifestations of unobservable latent factors, making it particularly valuable for psychometric research and construct validation. Unlike PCA, Factor Analysis models measurement error explicitly and focuses on shared rather than total variance, facilitating theoretical interpretation of underlying psychological or economic constructs.
# Loading necessary librarieslibrary(dplyr)library(tidyr)# Creating a wider dataset for factor analysis# First, get unique country-year combinations by averaging any duplicatesdf_wide_fa <- df_clean %>%filter(year >=2018) %>%select(country, year, death_rate) %>%group_by(country, year) %>%summarise(death_rate =mean(death_rate, na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = year, values_from = death_rate, names_prefix ="year_") %>%select(-country) %>%na.omit()# Performing factor analysis using base R stats packageif(nrow(df_wide_fa) >10&&ncol(df_wide_fa) >=3) {# Determining number of factors (minimum of 2 or ncol-1) n_factors <-min(2, ncol(df_wide_fa) -1)# Using factanal from stats package fa_result <-factanal(df_wide_fa, factors = n_factors, rotation ="varimax")# Displaying factor loadingscat("Factor Loadings:\n")print(fa_result$loadings)# Calculating proportion of variance explained loadings_sq <- fa_result$loadings^2 var_explained <-colSums(loadings_sq) /nrow(loadings_sq)cat("\nProportion of variance explained by each factor:\n")print(var_explained)# Displaying uniquenesses (proportion of variance not explained by factors)cat("\nUniqueness (1 - communality):\n")print(fa_result$uniquenesses)} else {cat("Note: Not enough observations or variables for factor analysis.\n")cat("Factor analysis requires at least 3 variables and sufficient observations.\n")}
Regression Analysis for relationships and predictive structure: Linear and non-linear regression models quantify relationships between dependent and independent variables, enabling both explanatory analysis of associations and predictive modeling of outcomes. These methods provide parameter estimates, statistical inference, and diagnostic tools to assess model adequacy and identify influential observations.
# Loading necessary librarylibrary(dplyr)# Preparing data with a categorical variabledf_regression <- df_clean %>%mutate(time_period =case_when( year <2015~"Early", year >=2015& year <2020~"Middle", year >=2020~"Recent" ) ) %>%filter(!is.na(time_period))# Simple linear regressionmodel_simple <-lm(death_rate ~ year, data = df_regression)summary(model_simple)
Call:
lm(formula = death_rate ~ year, data = df_regression)
Residuals:
Min 1Q Median 3Q Max
-89.80 -54.93 -13.16 27.95 327.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1544.5523 1087.6865 1.420 0.156
year -0.7023 0.5395 -1.302 0.193
Residual standard error: 69.57 on 1160 degrees of freedom
Multiple R-squared: 0.001459, Adjusted R-squared: 0.0005978
F-statistic: 1.694 on 1 and 1160 DF, p-value: 0.1933
# Multiple regression with categorical predictormodel_multiple <-lm(death_rate ~ year + time_period, data = df_regression)summary(model_multiple)
Call:
lm(formula = death_rate ~ year + time_period, data = df_regression)
Residuals:
Min 1Q Median 3Q Max
-98.10 -54.76 -12.67 27.19 319.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6960.638 3073.466 2.265 0.0237 *
year -3.393 1.528 -2.221 0.0265 *
time_periodMiddle 5.755 8.892 0.647 0.5177
time_periodRecent 31.218 14.975 2.085 0.0373 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 69.32 on 1158 degrees of freedom
Multiple R-squared: 0.01036, Adjusted R-squared: 0.007793
F-statistic: 4.04 on 3 and 1158 DF, p-value: 0.007176
# Extracting and interpreting coefficientscat("\nCoefficient interpretation:\n")
cat("This means death rate changes by", round(coef(model_simple)[2], 3), "per 100,000 per year\n")
This means death rate changes by -0.702 per 100,000 per year
Clustering (k-means, hierarchical) for pattern discovery (if included): Unsupervised clustering algorithms partition observations into homogeneous groups based on similarity metrics, revealing natural taxonomies and segment structures within data. K-means optimizes within-cluster variance through iterative assignment, while hierarchical methods create nested groupings that can be visualized through dendrograms to inform cluster selection.
# Loading necessary librarieslibrary(dplyr)library(ggplot2)# Preparing data for clustering - average death rate by countrydf_cluster <- df_clean %>%group_by(country) %>%summarise(mean_death_rate =mean(death_rate, na.rm =TRUE),trend =cor(year, death_rate, use ="complete.obs") ) %>%na.omit()# K-means clusteringset.seed(123)kmeans_result <-kmeans(df_cluster[, c("mean_death_rate", "trend")], centers =3)# Adding cluster assignment to datadf_cluster$cluster <-as.factor(kmeans_result$cluster)# Visualizing clustersggplot(df_cluster, aes(x = mean_death_rate, y = trend, color = cluster)) +geom_point(size =3) +labs(title ="K-means Clustering of Countries",x ="Mean Death Rate",y ="Time Trend (correlation)",color ="Cluster" ) +theme_minimal()
Distinguish association vs. causation: Statistical association measures correlation between variables without implying directionality, whereas causal inference attempts to establish that changes in one variable directly produce changes in another. Demonstrating causality requires careful consideration of temporal precedence, theoretical mechanisms, and elimination of alternative explanations through research design and statistical controls.
# Loading necessary librarylibrary(dplyr)# Example: Association between year and death rateassociation <-cor(df_clean$year, df_clean$death_rate, use ="complete.obs")cat("Association (correlation) between year and death rate:", round(association, 3), "\n")
Association (correlation) between year and death rate: -0.038
cat("\nThis shows association, but does NOT prove that time causes death rate changes.\n")
This shows association, but does NOT prove that time causes death rate changes.
Potential confounders: healthcare improvements, policy changes, etc.
# Demonstrating how confounders can affect interpretation# Creating a hypothetical scenariodf_confound <- df_clean %>%mutate(developed = country %in%c("Germany", "France", "United Kingdom", "United States", "Japan") )# Correlation in developed vs developing countriescor_developed <- df_confound %>%filter(developed ==TRUE) %>%summarise(cor =cor(year, death_rate, use ="complete.obs")) %>%pull(cor)cor_developing <- df_confound %>%filter(developed ==FALSE) %>%summarise(cor =cor(year, death_rate, use ="complete.obs")) %>%pull(cor)cat("\nCorrelation in developed countries:", round(cor_developed, 3), "\n")
Correlation in developed countries: 0.002
cat("Correlation in developing countries:", round(cor_developing, 3), "\n")
Correlation in developing countries: -0.046
cat("\nDifferent correlations suggest development level may be a confounder.\n")
Different correlations suggest development level may be a confounder.
Model specification and confounding control: Proper model specification identifies relevant covariates and functional forms to isolate the causal effect of interest while controlling for confounding variables that influence both treatment and outcome. Omitted variable bias, measurement error, and incorrect functional forms threaten causal identification, necessitating theory-driven variable selection and specification testing.
Assumptions: linearity, independence, homoskedasticity, exogeneity: Valid causal inference from regression requires that relationships are linear in parameters, observations are independent, error variance is constant across predictor levels, and explanatory variables are uncorrelated with the error term. Violations of these assumptions bias coefficient estimates, invalidate standard errors, and compromise hypothesis tests, requiring diagnostic assessment and remedial measures.
# Loading necessary librarieslibrary(ggplot2)library(dplyr)library(lmtest)# Fitting a regression modelmodel <-lm(death_rate ~ year, data = df_clean)# Checking assumptions through diagnostic plotspar(mfrow =c(2, 2))plot(model)
par(mfrow =c(1, 1))# Testing for homoskedasticity (Breusch-Pagan test)bp_test <-bptest(model)cat("\nBreusch-Pagan test for homoskedasticity:\n")
Breusch-Pagan test for homoskedasticity:
cat("p-value:", bp_test$p.value, "\n")
p-value: 0.2263175
if(bp_test$p.value <0.05) {cat("Evidence of heteroskedasticity (non-constant variance)\n")} else {cat("No strong evidence against homoskedasticity\n")}
No strong evidence against homoskedasticity
# Checking for normality of residualsshapiro_test <-shapiro.test(residuals(model)[1:5000]) # Shapiro test limited to 5000 obscat("\nShapiro-Wilk test for normality of residuals:\n")
Shapiro-Wilk test for normality of residuals:
cat("p-value:", shapiro_test$p.value, "\n")
p-value: 9.063082e-29
Interpretation of coefficients and marginal effects: Regression coefficients represent the expected change in the dependent variable associated with a one-unit change in the independent variable, holding other factors constant. Marginal effects extend this interpretation to non-linear models and interaction terms, quantifying how the impact of one variable varies across levels of another variable.
# Loading necessary librarylibrary(dplyr)# Creating interaction termdf_interaction <- df_clean %>%mutate(recent_period =ifelse(year >=2020, 1, 0),year_centered = year -mean(year) )# Model with interactionmodel_interaction <-lm(death_rate ~ year_centered * recent_period, data = df_interaction)summary(model_interaction)
Call:
lm(formula = death_rate ~ year_centered * recent_period, data = df_interaction)
Residuals:
Min 1Q Median 3Q Max
-104.90 -54.62 -13.11 28.45 318.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.948 2.607 47.541 < 2e-16 ***
year_centered -2.313 0.807 -2.866 0.00423 **
recent_period 56.979 22.746 2.505 0.01238 *
year_centered:recent_period -6.947 4.376 -1.588 0.11267
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 69.25 on 1158 degrees of freedom
Multiple R-squared: 0.01215, Adjusted R-squared: 0.00959
F-statistic: 4.747 on 3 and 1158 DF, p-value: 0.002694
cat(" -> In 2020+, the year effect changes by an additional", round(coefs["year_centered:recent_period"], 3), "\n")
-> In 2020+, the year effect changes by an additional -6.947
cat(" -> Total effect in 2020+:", round(coefs["year_centered"] + coefs["year_centered:recent_period"], 3), "\n")
-> Total effect in 2020+: -9.26
Sensitivity and robustness checks: Assessing the stability of causal conclusions across alternative model specifications, sample restrictions, and analytical choices strengthens inference and identifies fragile results dependent on specific assumptions. Techniques include varying control variables, testing different functional forms, examining subgroup effects, and conducting placebo tests to validate identification strategies.
# Loading necessary librarylibrary(dplyr)# Base modelmodel1 <-lm(death_rate ~ year, data = df_clean)# Alternative specification 1: Adding squared termdf_robust <- df_clean %>%mutate(year_sq = year^2)model2 <-lm(death_rate ~ year + year_sq, data = df_robust)# Alternative specification 2: Different time periodsdf_subset1 <- df_clean %>%filter(year >=2015)model3 <-lm(death_rate ~ year, data = df_subset1)df_subset2 <- df_clean %>%filter(year <2020)model4 <-lm(death_rate ~ year, data = df_subset2)# Comparing coefficients across modelscat("Robustness Check: Year Coefficient Across Specifications\n")
Robustness Check: Year Coefficient Across Specifications
# or combining with dplyrcombined_df <-inner_join(imr_df, hdi_df, by =c("name"="country")) %>%select(Country = name, IMR =`deaths/1`, HDI = HumanDevelopmentIndex_2024)# Creating scatter plotplot(combined_df$HDI, combined_df$IMR, main ="Influence of HDI (Human Development Index) on IMR (Infant Mortality Rate)",sub ="HDI = Independent Variable) / IMR = Dependent Variable", ylab ="IMR", xlab ="HDI")# Adding regression linemodel <-lm(combined_df$IMR ~ combined_df$HDI)abline(model, col ="red")
# Showing summary of the modelsummary(model)
Call:
lm(formula = combined_df$IMR ~ combined_df$HDI)
Residuals:
Min 1Q Median 3Q Max
-24.387 -5.474 -0.017 4.470 54.530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.893 3.623 27.57 <2e-16 ***
combined_df$HDI -107.103 4.775 -22.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.723 on 179 degrees of freedom
Multiple R-squared: 0.7376, Adjusted R-squared: 0.7361
F-statistic: 503.2 on 1 and 179 DF, p-value: < 2.2e-16
Conclusion and interpretation of the results:
The fitted linear model (IMR ~ HDI) shows a very strong, negative association: the HDI coefficient is −113.23 (SE = 4.73, t = −23.92, p < 2e‑16).
This means that a one-unit increase in HDI (note: HDI ranges roughly 0–1) is associated with a predicted decrease in infant mortality of about 113 deaths per 1,000 live births.
Practical meaning Higher human development (better education, income and health components of HDI) is strongly associated with lower infant mortality.
Caveats and diagnostics to check Correlation ≠ causation: this is observational. Confounding (e.g., health spending, access to care, urbanization) could drive both HDI and IMR. Avoid causal claims without further analysis.
Summary HDI and IMR are strongly negatively associated: higher HDI is linked to substantially lower infant mortality (HDI explains ~77% of IMR variation here), but confirmatory causal analysis and diagnostic checks are needed before policy attribution.
Example of a heatmap visualization:
# Loading necessary librarieslibrary(ggplot2)library(dplyr)library(tidyr)# Preparing data for heatmapheatmap_data <- df_clean %>%filter(year >=2010& year <=2023) %>%group_by(country, year) %>%summarise(mean_death_rate =mean(death_rate, na.rm =TRUE), .groups ="drop")# Creating heatmapggplot(heatmap_data, aes(x = year, y = country, fill = mean_death_rate)) +geom_tile(color ="white") +scale_fill_viridis_c(option ="plasma", na.value ="grey50") +labs(title ="Heatmap of Mean Preventable Death Rates by Country (2010-2023)",x ="Year",y ="Country",fill ="Mean Death Rate" ) +theme_minimal() +theme(axis.text.y =element_text(size =6),plot.title =element_text(face ="bold") )
Cuadrado-Gallego, J. J., Y. Demchenko, J. G. Pérez, et al. (2023). “Data Analytics: A Theoretical and Practical View from the EDISON Project”. In: Data Analytics: A Theoretical and Practical View from the EDISON Project, pp. 1-477. DOI: 10.1007/978-3-031-39129-3/COVER.
Kumar, U. D. (2017). “Business analytics: The science of data-driven decision making.”
Lee, K., N. G. Weiskopf, and J. Pathak (2018). “A framework for data quality assessment in clinical research datasets”. In: AMIA Annual Symposium Proceedings. , pp. 1080–1089. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977591/.
Weiskopf, N. G. and C. Weng (2016). “Data quality assessment framework to assess electronic medical record data for use in research”. In: International Journal of Medical Informatics 90, pp. 40–47. URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC12335082/.
— (2013). “Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research”. In: Journal of the American Medical Informatics Association 20.1, pp. 144–151. DOI: 10.1136/amiajnl-2011-000681.
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, et al. (2016). “The FAIR guiding principles for scientific data management and stewardship”. In: Scientific Data 3.160018. URL: https://www.nature.com/articles/sdata.2016.18.