Hochschule Fresenius - University of Applied Science
Email: benjamin.gross@ext.hs-fresenius.de
Website: https://drbenjamin.github.io
Published
05.11.2025 12:39
Abstract
This document provides the course material for Data Science and Data Analytics (B. A. – International Business Management). Upon successful completion of the course, students will be able to: recognize important technological and methodological advancements in data science and distinguish between descriptive, predictive, and prescriptive analytics; demonstrate proficiency in classifying data and variables, collecting and managing data, and conducting comprehensive data evaluations; utilize R for effective data manipulation, cleaning, visualization, outlier detection, and dimensionality reduction; conduct sophisticated data exploration and mining techniques (including PCA, Factor Analysis, and Regression Analysis) to discover underlying patterns and inform decision-making; analyze and interpret causal relationships in data using regression analysis; evaluate and organize the implementation of a data analysis project in a business environment; and communicate the results and effects of a data analysis project in a structured way.
Scope and Nature of Data Science
Let’s start this course with some definitions and context.
Definition of Data Science:
The field of Data Science concerns techniques for extracting knowledge from diverse data, with a particular focus on ‘big’ data exhibiting ‘V’ attributes such as volume, velocity, variety, value and veracity.
Maneth & Poulovassilis (2016)
Definition of Data Analytics:
Data analytics is the systematic process of examining data using statistical, computational, and domain-specific methods to extract insights, identify patterns, and support decision-making. It combines competencies in data handling, analysis techniques, and domain knowledge to generate actionable outcomes in organizational contexts (Cuadrado-Gallego et al., 2023).
Definition of Business Analytics:
Business analytics is the science of posing and answering data questions related to business. Business analytics has rapidly expanded in the last few years to include tools drawn from statistics, data management, data visualization, and machine learn- ing. There is increasing emphasis on big data handling to assimilate the advances made in data sciences. As is often the case with applied methodologies, business analytics has to be soundly grounded in applications in various disciplines and business verticals to be valuable. The bridge between the tools and the applications are the modeling methods used by managers and researchers in disciplines such as finance, marketing, and operations.
Pochiraju & Seshadri (2019)
There are many roles in the data science field, including (but not limited to):
Source:LinkedIn
For skills and competencies required for data science activities, see Skills Landscape.
Defining Data Science as an Academic Discipline
Data science emerges as an interdisciplinary field that synthesizes methodologies and insights from multiple academic domains to extract knowledge and actionable insights from data. As an academic discipline, data science represents a convergence of computational, statistical, and domain-specific expertise that addresses the growing need for data-driven decision-making in various sectors.
Data science draws from and interacts with multiple foundational disciplines:
Informatics / Information Systems:
Informatics provides the foundational understanding of information processing, storage, and retrieval systems that underpin data science infrastructure. It encompasses database design, data modeling, information architecture, and system integration principles essential for managing large-scale data ecosystems. Information systems contribute knowledge about organizational data flows, enterprise architectures, and the sociotechnical aspects of data utilization in business contexts.
Computer Science (algorithms, data structures, systems design):
Computer science provides the computational foundation for data science through algorithm design, complexity analysis, and efficient data structures. Core contributions include machine learning algorithms, distributed computing paradigms, database systems, and software engineering practices. System design principles enable scalable data processing architectures, while computational thinking frameworks guide algorithmic problem-solving approaches essential for data-driven solutions.
Mathematics provides the theoretical backbone for data science through linear algebra (matrix operations, eigenvalues, vector spaces), calculus (derivatives, gradients, optimization), and discrete mathematics (graph theory, combinatorics). These mathematical foundations enable dimensionality reduction techniques, gradient-based optimization algorithms, statistical modeling, and the rigorous formulation of machine learning problems (see the figure below). Mathematical rigor ensures the validity and interpretability of analytical results.
Statistics provides the methodological framework for data analysis through hypothesis testing, confidence intervals, regression analysis, and experimental design. Econometrics contributes advanced techniques for causal inference, time series analysis, and handling observational data challenges such as endogeneity and selection bias. These disciplines ensure rigorous uncertainty quantification, model validation, and the ability to draw reliable conclusions from data while understanding limitations and assumptions.
Social Science & Behavioral Sciences (contextual interpretation, experimental design):
Social and behavioral sciences contribute essential understanding of human behavior, organizational dynamics, and contextual factors that influence data generation and interpretation. These disciplines provide expertise in experimental design, survey methodology, ethical considerations, and the social implications of data-driven decisions. They ensure that data science applications consider human factors, cultural context, and societal impact while maintaining ethical standards in data collection and analysis.
Source:LinkedIn
The interdisciplinary nature of data science requires practitioners to develop competencies across these domains while maintaining awareness of how different methodological traditions complement and inform each other. This multidisciplinary foundation enables data scientists to approach complex problems with both technical rigor and contextual understanding, ensuring that analytical solutions are both technically sound and practically relevant.
Significance of Business Data Analysis for Decision-Making
Business data analysis has evolved from a supporting function to a critical strategic capability that fundamentally transforms how organizations make decisions, allocate resources, and compete in modern markets. The systematic application of analytical methods to business data enables evidence-based decision-making that reduces uncertainty, improves operational efficiency, and creates sustainable competitive advantages.
Strategic Decision-Making Framework
Business data analysis provides a structured approach to strategic decision-making through multiple analytical dimensions:
Evidence-Based Strategic Planning: Data analysis supports long-term strategic decisions by providing empirical evidence about market trends, competitive positioning, and organizational capabilities. Statistical analysis of historical performance data, market research, and competitive intelligence enables organizations to formulate strategies grounded in quantifiable evidence rather than intuition alone.
Risk Assessment and Mitigation: Advanced analytical techniques enable comprehensive risk evaluation across operational, financial, and strategic dimensions. Monte Carlo simulations, scenario analysis, and predictive modeling help organizations quantify potential risks and develop contingency plans based on probabilistic assessments of future outcomes.
Resource Allocation Optimization: Data-driven resource allocation models leverage optimization algorithms and statistical analysis to maximize return on investment across different business units, projects, and initiatives. Linear programming, integer optimization, and multi-criteria decision analysis provide frameworks for allocating limited resources to achieve optimal organizational outcomes.
Operational Decision Support
At the operational level, business data analysis transforms day-to-day decision-making through real-time insights and systematic performance measurement:
Performance Measurement and Continuous Improvement: Key Performance Indicators (KPIs) and statistical process control methods enable organizations to monitor operational efficiency, quality metrics, and customer satisfaction in real-time. Time series analysis, control charts, and regression analysis identify trends, anomalies, and improvement opportunities that drive continuous operational enhancement.
Forecasting and Demand Planning: Statistical forecasting models using techniques such as ARIMA, exponential smoothing, and machine learning algorithms enable accurate demand prediction for inventory management, capacity planning, and supply chain optimization. These analytical approaches reduce uncertainty in operational planning while minimizing costs associated with overstock or stockouts.
Customer Analytics and Personalization: Advanced customer analytics leverage segmentation analysis, predictive modeling, and behavioral analytics to understand customer preferences, predict churn, and optimize retention strategies. Clustering algorithms, logistic regression, and recommendation systems enable personalized customer experiences that increase satisfaction and loyalty.
Tactical Decision Integration
Business data analysis bridges strategic planning and operational execution through tactical decision support:
Pricing Strategy Optimization: Price elasticity analysis, competitive pricing models, and revenue optimization techniques enable dynamic pricing strategies that maximize profitability while maintaining market competitiveness. Regression analysis, A/B testing, and econometric modeling provide empirical foundations for pricing decisions.
Market Intelligence and Competitive Analysis: Data analysis transforms market research and competitive intelligence into actionable insights through statistical analysis of market trends, customer behavior, and competitive positioning. Multivariate analysis, factor analysis, and time series forecasting identify market opportunities and competitive threats.
Financial Performance Analysis: Financial analytics encompassing ratio analysis, variance analysis, and predictive financial modeling enable organizations to assess financial health, identify cost reduction opportunities, and optimize capital structure decisions. Statistical analysis of financial data supports both internal performance evaluation and external stakeholder communication.
Contemporary Analytical Capabilities
Modern business data analysis capabilities extend traditional analytical methods through integration of advanced technologies and methodologies:
Real-Time Analytics and Decision Support: Stream processing, event-driven analytics, and real-time dashboards enable immediate response to changing business conditions. Complex event processing and real-time statistical monitoring support dynamic decision-making in fast-paced business environments.
Predictive and Prescriptive Analytics: Machine learning algorithms, neural networks, and optimization models enable organizations to not only predict future outcomes but also recommend optimal actions. These advanced analytical capabilities support automated decision-making and strategic scenario planning.
Data-Driven Innovation: Analytics-driven innovation leverages data science techniques to identify new business opportunities, develop innovative products and services, and create novel revenue streams. Advanced analytics enable organizations to discover hidden patterns, correlations, and insights that drive innovation and competitive differentiation.
The significance of business data analysis for decision-making extends beyond technical capabilities to encompass organizational transformation, cultural change, and strategic competitive positioning. Organizations that successfully integrate analytical capabilities into their decision-making processes achieve superior performance outcomes, enhanced agility, and sustainable competitive advantages in increasingly data-driven markets.
For comprehensive coverage of business data analysis methodologies and applications, see Advanced Business Analytics and the analytical foundations outlined in Evans (2020).
For open access resources, visit Kaggle, a platform for data science competitions and datasets.
Emerging Trends
Key technological and methodological developments shaping the data landscape:
Evolution of computing and data processing architectures.
Digitalization of processes and platforms.
Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
Big Data ecosystems (volume, velocity, variety, veracity, value).
Source:LinkedIn
Internet of Things (IoT) and sensor-driven data generation.
Data analytic competencies encompass the ability to apply machine learning, data mining, statistical methods, and algorithmic approaches to extract meaningful patterns, insights, and predictions from complex datasets. They include proficiency in exploratory data analysis, feature engineering, model selection, evaluation, and validation. These skills ensure rigorous interpretation of data, support evidence-based decision-making, and enable the development of robust analytical solutions adaptable to diverse health, social, and technological contexts.
Types of Data
The structure and temporal dimension of data fundamentally influence analytical approaches and statistical methods. Understanding data types enables researchers to select appropriate modeling techniques and interpret results within proper contextual boundaries.
Cross-sectional data captures observations of multiple entities (individuals, firms, countries) at a single point in time. This structure facilitates comparative analysis across units but does not track changes over time. Cross-sectional studies are particularly valuable for examining relationships between variables at a specific moment and testing hypotheses about population characteristics.
Time-series data records observations of a single entity across multiple time points, enabling the analysis of temporal patterns, trends, seasonality, and cyclical behaviors. Time-series methods account for autocorrelation and temporal dependencies, supporting forecasting and dynamic modeling. This data structure is essential for economic indicators, financial markets, and environmental monitoring.
Panel (longitudinal) data combines both dimensions, tracking multiple entities over time. This structure offers substantial analytical advantages by controlling for unobserved heterogeneity across entities and modeling both within-entity and between-entity variation. Panel data methods support causal inference through fixed-effects and random-effects models, difference-in-differences estimation, and dynamic panel specifications.
Source:https://static.vecteezy.com
Additional data structures:
Geo-referenced / spatial data is data associated with specific geographic locations, enabling spatial analysis and visualization. Techniques such as Geographic Information Systems (GIS), spatial autocorrelation, and spatial regression models are employed to analyze patterns and relationships in spatially distributed data.
Source:https://www.slingshotsimulations.com/
Streaming / real-time data is continuously generated data that is processed and analyzed in real-time. This data structure is crucial for applications requiring immediate insights, such as fraud detection, network monitoring, and real-time recommendation systems.
Types of Variables
Continuous (interval/ratio) data is measured on a scale with meaningful intervals and a true zero point (ratio) or arbitrary zero point (interval). Examples include height, weight, temperature, and income. Continuous variables support a wide range of statistical analyses, including regression and correlation.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning continuous data to a data frame and displaying it as a tablecontinuous_data <-data.frame(Height_cm =c(170, 165, 180, 175, 160, 185, 172, 168, 178, 182),Weight_kg =c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85))# Displaying the original tableprint(continuous_data)
# Practicing:# 1. Assign continuous data to a data frame and displaying it as a table# 2. Order the data frame by a specific column
Count data represents the number of occurrences of an event or the frequency of a particular characteristic. Count variables are typically non-negative integers and can be analyzed using Poisson regression or negative binomial regression.
# Assigning count data to a data frame and displaying it as a tablecount_data <-data.frame(Height =c(170, 165, 182, 175, 165, 175, 175, 168, 175, 182),Weight =c(70, 60, 80, 75, 55, 90, 68, 62, 78, 85))# Displaying the original data as a tableprint(count_data)
# Ordering the data frame by Height_cm in ascending orderordered_count_data <- count_data %>%arrange(desc(Height), Weight) %>%count(Height)# Displaying the ordered count dataprint(ordered_count_data)
Height n
1 165 2
2 168 1
3 170 1
4 175 4
5 182 2
# Practicing:# 1. Assign count data to a data frame and displaying it as a table
Ordinal data represents categories with a meaningful order or ranking but no consistent interval between categories. Examples include survey responses (e.g., Likert scales) and socioeconomic status (e.g., low, medium, high). Ordinal variables can be analyzed using non-parametric tests or ordinal regression.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning ordinal data to a data frame and displaying it as a tableordinal_data <-data.frame(Response =c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),Value =c(5, 4, 3, 2, 1))# Displaying the original tableprint(ordinal_data)
# Practicing:# 1. Assign ordinal data to a data frame and displaying it as a table# 2. Order the data frame by the ordinal value
Categorical (nominal / binary) data represents distinct categories without any inherent order. Nominal variables have two or more categories (e.g., gender, race, or marital status), while binary variables have only two categories (e.g., yes/no, success/failure). Categorical variables can be analyzed using chi-square tests or logistic regression.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning categorical data to a data frame and displaying it as a tablecategorical_data <-data.frame(Event =c("A", "C", "B", "D", "E"),Category =c("Type1", "Type2", "Type1", "Type3", "Type2"))# Displaying the original tableprint(categorical_data)
Event Category
1 A Type1
2 C Type2
3 B Type1
4 D Type3
5 E Type2
# Ordering the data frame by Eventcategorical_data <- categorical_data %>%arrange(Event)# Displaying the tableprint(categorical_data)
Event Category
1 A Type1
2 B Type1
3 C Type2
4 D Type3
5 E Type2
# Practicing:# 1. Assign categorical data to a data frame and displaying it as a table# 2. Order the data frame by a specific column
Compositional or hierarchical structures represent data with a part-to-whole relationship or nested categories. Examples include demographic data (e.g., age groups within gender) and geographical data (e.g., countries within continents). Compositional data can be analyzed using techniques such as hierarchical clustering or multilevel modeling.
# Loading the dplyr library for data manipulationlibrary(dplyr)# Assigning hierarchical data to a data frame and displaying it as a tablehierarchical_data <-data.frame(Country =c("USA", "Canada" , "USA", "Canada", "Mexico"),State_Province =c("California", "Ontario", "Texas", "Quebec", "Jalisco"),Population_Millions =c(39.5, 14.5, 29.0, 8.5, 8.3))# Displaying the tableprint(hierarchical_data)
Country State_Province Population_Millions
1 USA California 39.5
2 Canada Ontario 14.5
3 USA Texas 29.0
4 Canada Quebec 8.5
5 Mexico Jalisco 8.3
# Ordering the data frame by Country and then State_Provincehierarchical_data <- hierarchical_data %>%arrange(Country, State_Province)# Displaying the ordered data frameprint(hierarchical_data)
Country State_Province Population_Millions
1 Canada Ontario 14.5
2 Canada Quebec 8.5
3 Mexico Jalisco 8.3
4 USA California 39.5
5 USA Texas 29.0
# Grouping the data by Country and summarizing total populationhierarchical_data_grouped <- hierarchical_data %>%group_by(Country) %>%summarise(Total_Population =sum(Population_Millions))# Displaying the grouped data frameprint(hierarchical_data_grouped)
# A tibble: 3 × 2
Country Total_Population
<chr> <dbl>
1 Canada 23
2 Mexico 8.3
3 USA 68.5
# Practicing:# 1. Assign hierarchical data to a data frame and displaying it as a table# 2. Order the data frame by multiple columns
Source:https://www.collegedisha.com/
Some small datasets to start with
A custom R package ourdata has been created to provide some small datasets (and also some helper R functions) for practice. You can install it from GitHub using the following commands:
# Installing Github R packagesdevtools::install_github("DrBenjamin/ourdata")
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/Rtmp7gKUFu/remotes3b727f7dfc9e/DrBenjamin-ourdata-3403a65/DESCRIPTION’ ... OK
* preparing ‘ourdata’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘ourdata_0.5.0.tar.gz’
Using the package and exploring its documentation:
# Loading packagelibrary(ourdata)# Opening help of package??ourdata# Showing welcome messageourdata()
This is the R package `ourdata` used for Data Science courses at the Fresenius University of Applied Sciences.
Type `help(ourdata)` to display the content.
Type `ourdata_website()` to open the package website.
Have fun in the course!
Benjamin Gross
# Using help function from the package for a specific function from the `ourdata` R packagehelp(combine)
Help on topic 'combine' was found in the following packages:
Package Library
dplyr /home/runner/R-library
ourdata /home/runner/R-library
Using the first match ...
# Using the `combine` function from the `ourdata` R package to combine two vectors into a data frameourdata::combine(kirche$Jahr, koelsch$Jahr, kirche$Austritte, koelsch$Koelsch)
'data.frame': 4 obs. of 3 variables:
$ C1: chr "2017" "2018" "2019" "2020"
$ C2: num 364711 437416 539509 441390
$ C3: num 1.87e+08 1.91e+08 1.79e+08 1.69e+08
We collect data from OECD (Organization for Economic Co-operation and Development), an international organization that works to build better policies for better lives. The dataset shows many columns (variables) with non-informative data and needs to be cleaned (wrangled) before analysis. First load the data into R from CSV file:
# Reading the dataset from a CSV filepreventable_deaths <-read.csv("./topics/data/OECD_Preventable_Deaths.csv",stringsAsFactors =FALSE)
or use the dataset directly from the ourdata R package:
# Reading the dataset from `ourdata` R packagelibrary(ourdata)preventable_deaths <- oecd_preventable
First we explore the data:
# Viewing structure of the datasetstr(preventable_deaths)# Viewing first few rowshead(preventable_deaths)# Checking dimensionsdim(preventable_deaths)# Viewing column namescolnames(preventable_deaths)# Summary statisticssummary(preventable_deaths)# Checking for missing valuescolSums(is.na(preventable_deaths))
Now we can start cleaning the data by removing non-informative columns and rows with missing values:
# Loading necessary librarylibrary(dplyr)# Selecting relevant columns for analysisdf_clean <- preventable_deaths %>%select( REF_AREA, Reference.area, TIME_PERIOD, OBS_VALUE ) %>%rename(country_code = REF_AREA,country = Reference.area,year = TIME_PERIOD,death_rate = OBS_VALUE )# Converting year to numericdf_clean$year <-as.numeric(df_clean$year)# Converting death_rate to numeric (handling empty strings)df_clean$death_rate <-as.numeric(df_clean$death_rate)# Removing rows with missing death ratesdf_clean <- df_clean %>%filter(!is.na(death_rate))# Viewing cleaned data structurestr(df_clean)
'data.frame': 1162 obs. of 4 variables:
$ country_code: chr "AUS" "AUS" "AUS" "AUS" ...
$ country : chr "Australia" "Australia" "Australia" "Australia" ...
$ year : num 2010 2011 2012 2013 2014 ...
$ death_rate : num 110 109 105 105 107 108 103 103 101 104 ...
# Summary of cleaned datasummary(df_clean)
country_code country year death_rate
Length:1162 Length:1162 Min. :2010 Min. : 34.0
Class :character Class :character 1st Qu.:2013 1st Qu.: 75.0
Mode :character Mode :character Median :2016 Median :116.0
Mean :2016 Mean :128.7
3rd Qu.:2019 3rd Qu.:156.0
Max. :2023 Max. :453.0
Showing some basic statistics of the cleaned data:
To visualize the cleaned data, we can create a distribution plot of preventable death rates:
# Loading necessary librarylibrary(ggplot2)# Plotting distribution of death ratesggplot(df_clean, aes(x = death_rate)) +geom_histogram(bins =30, fill ="#9B59B6", color ="white", alpha =0.8) +labs(title ="Distribution of Preventable Death Rates",x ="Death Rate per 100,000",y ="Frequency",caption ="Source: OECD Health Statistics" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14) )
Conceptual Framework: Knowledge & Understanding of Data
Clarify analytical purpose and domain context to guide data selection and interpretation.
Define entities, observational units, and identifiers to ensure accurate data representation.
Align business concepts with data structures for meaningful analysis.
Data Collection
Data collection forms the foundational stage of any data science project, requiring systematic approaches to gather information that aligns with research objectives and analytical requirements. As outlined in modern statistical frameworks, effective data collection strategies must balance methodological rigor with practical constraints (M. & Hardin, 2021).
Methods of Data Collection
Core Data Collection Competencies
The competencies required for effective data collection encompass both technical proficiency and methodological understanding (see Data Collection Competencies.pdf):
Source Identification and Assessment: Systematically identify internal and external data sources, evaluating their relevance, quality, and accessibility for the analytical objectives.
Data Acquisition Methods: Implement appropriate collection techniques including APIs (for instance see Spotify API tutorial and Postman Spotify tutorial), database queries, survey instruments, sensor networks, web scraping, and third-party vendor partnerships, ensuring methodological alignment with research design.
For the two projects below, here are some data sources recommendations (automatically created by Perplexity AI Deep Research):
Quality and Governance Framework: Establish protocols for assessing data provenance, licensing agreements, ethical compliance, and regulatory requirements (GDPR, industry-specific standards).
Methodological Considerations: Apply principles from research methodology to ensure data collection approaches support valid statistical inference and minimize bias introduction during the acquisition process.
Contemporary Data Collection Landscape
Modern data collection operates within an increasingly complex ecosystem characterized by diverse data types, real-time requirements, and distributed sources. The integration of traditional survey methods with emerging IoT sensors, social media APIs, and automated data pipelines requires comprehensive competency frameworks that address both technical implementation and methodological validity.
Data Management in data science curricula requires a coherent, multi‑facet framework that spans data quality, FAIR stewardship, master data governance, privacy/compliance, and modern architectures. Data quality assessment and governance define objective metrics (completeness, accuracy, consistency, plausibility, conformance) and governance processes that balance automated checks with human oversight. FAIR data principles provide a practical blueprint for metadata-rich stewardship to support findability, accessibility, interoperability, and reuse through machine-actionable metadata and persistent identifiers. Master Data Management ensures clean, trusted core entities across systems via governance and harmonization. Data privacy, security, and regulatory compliance embed responsible data handling and risk management, guided by purpose limitation, data minimization, accuracy, storage limitations, integrity/confidentiality, and accountability. Emerging trends in cloud-native data platforms, ETL/ELT, data lakes/lakehouses, and broader metadata automation shape scalable storage/compute and governance, enabling reproducible analytics and ML workflows. Together, these strands underpin trustworthy, discoverable, and compliant data inputs for research and coursework (Weiskopf & Weng, 2016; Wilkinson et al., 2016; GO FAIR Foundation, n.d.; Semarchy, n.d.; IBM, n.d.).
Data Evaluation
Define data quality dimensions and assessment frameworks.
Distinguish validation from verification in data pipelines.
Apply statistical methods and data profiling for evaluation.
Balance automated and manual (human-in-the-loop) evaluation approaches.
Implement tools, workflows, and governance for data evaluation.
Data evaluation ensures datasets are fit-for-purpose, reliable, and trustworthy throughout the analytical lifecycle. It encompasses systematic assessment of data quality dimensions (completeness, accuracy, consistency, validity, timeliness, uniqueness), validation and verification processes, and strategic application of statistical methods and data profiling. Quality dimensions provide measurable criteria for determining whether data meets analytical requirements, while assessment frameworks translate these into actionable metrics enabling objective measurement and contextual interpretation.
Validation and verification are complementary processes essential for data integrity. Validation checks occur at entry points, preventing bad data through constraint checks, format validation, and business rule enforcement. Verification involves post-collection checks ensuring data remains accurate and consistent over time and across system boundaries, supporting reproducibility and traceability. Together, validation acts as a gatekeeper while verification provides ongoing quality assurance.
Statistical methods form the technical foundation for evaluation. Outlier detection techniques (z-score, IQR, DBSCAN) identify anomalous observations requiring investigation. Distribution checks assess whether data conforms to expected patterns, while profiling describes dataset structure, missingness patterns, and statistical properties. Regression analysis and hypothesis testing diagnose quality issues and quantify relationships between quality metrics and analytical outcomes.
Modern data evaluation balances automated and manual approaches. Automated evaluation offers speed, scalability, and consistency for large datasets through rule-based validation, statistical profiling, and machine learning-based anomaly detection. Manual evaluation contributes domain expertise, contextual understanding, and interpretative judgment that automated systems cannot replicate. Human-in-the-loop approaches combine automation’s efficiency with human interpretability, optimizing both throughput and quality.
Tools, workflows, and governance frameworks provide infrastructure for systematic evaluation across the data lifecycle. Data profiling tools (e.g., Pandas Profiling, Great Expectations, Deequ) automate quality assessment. Validation frameworks embed checks into ETL/ELT pipelines. Data lineage tracking and metadata management support traceability and impact analysis. Governance frameworks establish roles, responsibilities, and processes aligning evaluation practices with regulatory requirements and reproducibility needs.
ggplot2 for layered grammar-based visualization. See the example above.
some more R libraries:
readr for ingestion of CSV and other flat files.
readxl for ingestion of Excel files.
lubridate for date-time handling.
purrr for functional iteration.
stringr for text handling.
forcats for factor handling.
for ingestion, functional iteration, text, and factor handling.
Data Visualization Principles
Choose encodings appropriate to variable types:
Continuous (Quantitative) variables → Position, Length Examples: x/y coordinates in scatter plots, bar heights, line positions
Categorical (Nominal) variables → Color hue, Shape, Facets Examples: different colors for groups, point shapes, separate panels
Ordinal variables → Ordered position, Color saturation Examples: ordered categories on axis, gradient colors from light to dark
Temporal variables → Position along x-axis, Line connections Examples: time on horizontal axis, connected points showing progression
Compositional (Part-to-whole) → Stacked position, Area Examples: stacked bars, proportional areas
Emphasize clarity: reduce chart junk; apply perceptual best practices:
Clarity in data visualization requires removing unnecessary elements that distract from the data while applying principles of human perception to enhance understanding.
Remove decorative elements (3D effects, shadows, gradients)
Avoid unnecessary legends when direct labeling is possible
Use position over angle for quantitative comparisons (bar charts > pie charts)
Maintain consistent scales across comparable charts
Respect aspect ratios that emphasize meaningful patterns
Choose colorblind-friendly palettes
Ensure sufficient contrast between data and background
# Loading necessary librarieslibrary(ggplot2)library(dplyr)# Creating data for demonstrationcountry_subset <- df_clean %>%filter(country %in%c("Germany", "France", "United Kingdom")) %>%filter(year >=2015)# Example: Clean, minimal visualizationggplot(country_subset, aes(x = year, y = death_rate, color = country)) +geom_line(linewidth =1) +geom_point(size =2) +labs(title ="Preventable Death Rates (2015-2023)",x ="Year",y ="Deaths per 100,000",color ="Country" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), # Removing unnecessary gridlineslegend.position ="bottom",plot.title =element_text(face ="bold") )
Support comparison, trend detection, and anomaly spotting:
Effective visualizations should facilitate three key analytical tasks: comparing values across groups, identifying trends over time, and detecting unusual patterns.
Support comparison:
Align items on common scales for direct comparison
Use small multiples (facets) for comparing across categories
Order categorical variables meaningfully (by value, alphabetically, or logically)
Keep consistent ordering across related charts
Enable trend detection:
Use connected lines for temporal data to show continuity
Add trend lines (linear, loess) to highlight overall patterns
Display sufficient time periods to establish meaningful trends
Avoid over-smoothing that hides important variations
Facilitate anomaly spotting:
Use reference lines or bands for expected ranges
Highlight outliers through color or annotation
Include context (confidence intervals, historical ranges)
Maintain consistent scales to make deviations visible
# Loading necessary librarieslibrary(ggplot2)library(dplyr)# Calculating statistics for anomaly detectioncountry_stats <- df_clean %>%group_by(country) %>%summarise(mean_rate =mean(death_rate, na.rm =TRUE),sd_rate =sd(death_rate, na.rm =TRUE) )# Joining back to identify anomaliesdf_annotated <- df_clean %>%left_join(country_stats, by ="country") %>%mutate(z_score = (death_rate - mean_rate) / sd_rate,is_anomaly =abs(z_score) >2# Flagging values > 2 standard deviations )# Example: Visualization supporting comparison, trends, and anomaly detectionselected_countries <-c("Germany", "France", "United Kingdom", "Italy", "Spain")df_subset <- df_annotated %>%filter(country %in% selected_countries)ggplot(df_subset, aes(x = year, y = death_rate, color = country)) +geom_line(linewidth =0.8) +geom_point(aes(size = is_anomaly, alpha = is_anomaly)) +scale_size_manual(values =c(1.5, 3), guide ="none") +scale_alpha_manual(values =c(0.6, 1), guide ="none") +facet_wrap(~ country, ncol =2) +# Small multiples for comparisonlabs(title ="Preventable Death Rates: Trends and Anomalies",subtitle ="Larger points indicate statistical anomalies (>2 SD from country mean)",x ="Year",y ="Deaths per 100,000",color ="Country" ) +theme_minimal() +theme(legend.position ="none",strip.text =element_text(face ="bold"),plot.title =element_text(face ="bold") )
Detecting Outliers and Anomalies
Rule-based methods (IQR, z-scores).
Robust statistics (median, MAD).
Model-based or multivariate detection (e.g., Mahalanobis distance, clustering residuals).
Distinguish errors vs. novel but valid observations.
Dimensionality Reduction
Motivation: mitigate multicollinearity, noise, and curse of dimensionality.
Cuadrado-Gallego, J. J., Y. Demchenko, J. G. Pérez, et al. (2023). “Data Analytics: A Theoretical and Practical View from the EDISON Project”. In: Data Analytics: A Theoretical and Practical View from the EDISON Project, pp. 1-477. DOI: 10.1007/978-3-031-39129-3/COVER.
Kumar, U. D. (2017). “Business analytics: The science of data-driven decision making.”
Lee, K., N. G. Weiskopf, and J. Pathak (2018). “A framework for data quality assessment in clinical research datasets”. In: AMIA Annual Symposium Proceedings. , pp. 1080–1089. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977591/.
Weiskopf, N. G. and C. Weng (2016). “Data quality assessment framework to assess electronic medical record data for use in research”. In: International Journal of Medical Informatics 90, pp. 40–47. URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC12335082/.
— (2013). “Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research”. In: Journal of the American Medical Informatics Association 20.1, pp. 144–151. DOI: 10.1136/amiajnl-2011-000681.
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, et al. (2016). “The FAIR guiding principles for scientific data management and stewardship”. In: Scientific Data 3.160018. URL: https://www.nature.com/articles/sdata.2016.18.