Delete the row, or impute (mean/median/predicted)

Deduplication (often via a unique key)

Investigate; cap, transform, or remove if erroneous

Normalize (0–1) or standardize (z-score)

Type conversion / casting (e.g., text to date)

Parsing and standardizing (dates, units, casing)

Start bar-chart axes at zero

Truncating the y-axis to exaggerate differences

Pick the chart that fits the data

Forcing a 3-D or fancy chart that distorts

Label clearly; show units and source

Clutter (chartjunk) that hides the message

Match KPIs to the audience's decision

Dumping every metric onto one screen

Labels data by sensitivity to apply the right controls

Replaces sensitive values with realistic fakes for safe use

Removes identifiers so individuals can't be re-identified

Protects data at rest and in transit from unauthorized reading

Limits who can see or change data (least privilege)

Keeps data only as long as needed, then securely destroys it

What is a passing score on CompTIA Data+?

You need a scaled score of 675 on a 100–900 scale. It is weighted scoring, not a simple percentage of questions correct, so don't try to convert it to a percent — focus on mastering every domain. You get your pass/fail result immediately.

What are the five Data+ DA0-002 domains and their weights?

Data Analysis (24%), Data Acquisition & Preparation (22%), Data Concepts & Environments (20%), Visualization & Reporting (20%), and Data Governance, Quality & Controls (14%). Data Analysis carries the most weight, so prioritize statistics and analytical techniques.

How should I use this Data+ study guide?

Study by weight: lead with Data Analysis (24%) and Data Acquisition & Preparation (22%) — master descriptive statistics, correlation vs. causation, and data cleansing first. Read each module, take the checkpoint quiz, then drill gaps with our free practice test and flashcards.

What changed from DA0-001 to DA0-002?

DA0-002 rebalanced the blueprint: it renamed the old 'Data Mining' domain to 'Data Acquisition & Preparation,' broadened 'Visualization' to 'Visualization & Reporting,' and added current content on cloud data environments, AI, and modern governance. Study to the DA0-002 objectives.

Do I need experience or prerequisites for Data+?

There are no required prerequisites. CompTIA recommends 18–24 months in a data analyst or similar role, with exposure to databases, analytical tools, basic statistics, and data visualization. Anyone can register, but the recommended background makes the exam much more manageable.

How long is Data+ valid and how do I renew it?

The certification is valid for three years. You renew through CompTIA's Continuing Education program — earning 30 continuing-education units (CEUs) over the three years, or by passing a higher-level CompTIA certification.

Is this Data+ study guide free?

Yes — this study guide, the module checkpoints, the glossary, the concept questions, the practice test, and the flashcards are 100% free with no account required.

How hard is the CompTIA Data+ exam?

Data+ is considered moderately challenging — the difficulty is breadth (data concepts, statistics, mining, visualization, and governance) plus performance-based questions that test applied skills. Broad, organized review and lots of practice questions are the key to passing.

What is the difference between a data warehouse and a data lake?

A data warehouse stores structured, modeled data optimized for analysis and reporting (schema-on-write). A data lake stores vast amounts of raw data in its native format — structured, semi-structured, and unstructured — and applies structure only when read (schema-on-read), making it cheaper and more flexible. Warehouses suit known, repeatable reporting; lakes suit exploratory analytics, machine learning, and big, varied data. A data lakehouse is a hybrid that adds warehouse-style management and structure on top of a lake.

What is structured vs. unstructured data?

Structured data fits a defined schema of rows and columns, like a relational database table. Unstructured data has no predefined model — text, images, audio, and video. Semi-structured data falls in between, carrying tags or markers (JSON, XML) without a rigid table structure. Most enterprise data is unstructured, which is why data lakes exist to store it cheaply in native format. Data type also describes measurement scale: nominal, ordinal, interval, and ratio, or simply qualitative vs. quantitative.

What is the difference between OLTP and OLAP?

OLTP (online transaction processing) runs day-to-day operations — many small, fast reads and writes against a normalized database. OLAP (online analytical processing) supports analysis — complex queries and aggregations over large, historical datasets, typically in a data warehouse modeled with star or snowflake schemas. OLTP prioritizes speed and data integrity for single records; OLAP prioritizes query performance over huge volumes. Data moves from OLTP systems into OLAP stores through ETL or ELT pipelines.

Big data refers to datasets too large or complex for traditional tools, commonly described by the V's: volume (scale), velocity (speed of generation), variety (different formats), veracity (trustworthiness), and value (business worth). Handling it requires distributed storage and processing such as data lakes and cloud platforms. The volume, velocity, and variety of big data drive the need for schema-on-read storage and scalable compute. Veracity and value remind analysts that more data is only useful if it is trustworthy and answers a business question.

What is the difference between ETL and ELT?

ETL extracts data, transforms (cleans and shapes) it, then loads it into the target — the classic approach for structured data warehouses. ELT extracts and loads raw data first, then transforms it inside the target system, which suits cloud warehouses, data lakes, and large, varied big-data workloads. ETL transforms on a separate engine before loading; ELT uses the target platform's own compute to transform in place. Both are forms of data integration — combining data from multiple sources into a unified store for analysis.

What is data cleansing?

Data cleansing is the process of detecting and fixing errors and inconsistencies to improve data quality before analysis. It includes handling missing values, removing duplicates, correcting invalid entries, addressing outliers, standardizing formats, and converting data types so the dataset is accurate, complete, and consistent. Common techniques: imputation or deletion for missing values, deduplication, parsing, recoding, and normalization or standardization. Cleansing typically consumes most of an analyst's time — and skipping it produces misleading results (garbage in, garbage out).

What is the difference between normalization and standardization?

Normalization rescales numeric values to a fixed range, usually 0 to 1, so features are comparable. Standardization rescales values to have a mean of 0 and a standard deviation of 1 (a z-score). Both put variables on a common scale so no single feature dominates analysis or a model. Normalization is sensitive to outliers; standardization handles them better because it centers on the mean and spread. Choose based on the algorithm and whether the data is roughly normally distributed.

What are data mining techniques?

Data mining discovers patterns and relationships in large datasets. Core techniques include classification (assigning items to categories), clustering (grouping similar records), regression (modeling a numeric relationship), and association rules (finding items that occur together, e.g., the Apriori algorithm for market-basket analysis). Classification and regression are supervised (they learn from labeled data); clustering and association are unsupervised. Overfitting — a model that memorizes the training data and fails on new data — is a key risk to recognize.

What is the difference between mean, median, and mode?

The mean is the arithmetic average of all values. The median is the middle value when data is sorted. The mode is the most frequently occurring value. The mean is sensitive to outliers, while the median is robust to them — so skewed data is often better summarized by the median. These three are measures of central tendency; spread is measured by range, variance, and standard deviation. On a right-skewed distribution the mean sits above the median — a tested signal of skew.

What is standard deviation?

Standard deviation measures how spread out values are around the mean. A low standard deviation means values cluster close to the mean; a high one means they are widely dispersed. It is the square root of the variance and is expressed in the same units as the data, making it easy to interpret. It quantifies variability — two datasets can share a mean but differ greatly in standard deviation. In a normal distribution, about 68% of values fall within one standard deviation of the mean.

What is the difference between correlation and causation?

Correlation means two variables move together in a measurable pattern; causation means one variable directly causes a change in the other. Correlation does not prove causation — two variables can be correlated by coincidence or because a third, confounding variable drives both. Establishing causation requires controlled testing. A correlation coefficient ranges from −1 (perfect negative) to +1 (perfect positive); 0 means no linear relationship. Mistaking correlation for causation is one of the most common and most-tested analytical errors.

What are the four types of analytics?

Descriptive analytics summarizes what happened (reports, KPIs). Diagnostic analytics explains why it happened (drill-down, correlation). Predictive analytics forecasts what will happen (statistical models). Prescriptive analytics recommends what action to take (optimization). Each level delivers more value but requires more sophisticated techniques. Most early-career data work is descriptive and diagnostic; predictive and prescriptive layer on modeling. Knowing which type a scenario calls for is a recurring exam pattern.

What is a p-value in hypothesis testing?

A p-value is the probability of observing results at least as extreme as the data, assuming the null hypothesis is true. A small p-value (commonly below 0.05) suggests the result is unlikely by chance, so you reject the null hypothesis. A large p-value means you fail to reject it. The threshold (significance level, often 0.05) is set before the test; it is not proof, only evidence against the null. A confidence interval expresses the same idea as a range of plausible values for the true parameter.

When should you use a scatter plot?

Use a scatter plot to show the relationship or correlation between two numeric variables. Each point represents one observation plotted by its x and y values. The pattern reveals whether the variables rise together, move oppositely, or have no relationship, and exposes clusters and outliers. Add a trend line to summarize the direction and strength of the relationship. A bubble chart extends a scatter plot by encoding a third variable as point size.

What is the difference between a histogram and a bar chart?

A bar chart compares values across distinct categories, with gaps between bars. A histogram shows the distribution of a single continuous variable by grouping values into ranges (bins), with bars touching. In short: bar charts compare categories; histograms reveal the shape and spread of one numeric variable. Use a histogram to spot skew, modality, and outliers; use a bar chart to rank or compare groups. A box plot is another distribution view that highlights the median, quartiles, and outliers.

What makes a good dashboard?

A good dashboard presents the right key performance indicators (KPIs) for its audience clearly and at a glance. It uses appropriate chart types, consistent formatting, and interactivity such as filters and drill-downs, and avoids clutter (chartjunk) and misleading scales. The goal is fast, accurate insight that drives a decision. Match the report type to the need: ad hoc for one-off questions, recurring for monitoring, self-service for exploration. Misleading visuals — truncated axes, distorted proportions — are an ethics and accuracy issue the exam tests.

What is data governance?

Data governance is the framework of policies, roles, standards, and processes that control how data is managed across its lifecycle. It defines ownership and stewardship, sets quality and security standards, and ensures data is accurate, available, and used in compliance with regulations — so the organization can trust and rely on its data. Key roles include data owners (accountable) and data stewards (responsible for day-to-day quality). Supporting tools include a data catalog (an inventory of data assets) and data lineage (tracking data's origin and movement).

What are the dimensions of data quality?

Data quality is measured across dimensions: accuracy (values are correct), completeness (no missing required values), consistency (values agree across systems), timeliness (data is current), uniqueness (no unintended duplicates), validity (values match defined rules), and integrity (relationships are maintained). High-quality data is fit for its intended purpose. A failure in any dimension can invalidate analysis — this is why cleansing and governance matter. Organizations profile data and set quality rules to monitor these dimensions continuously.

What is Master Data Management (MDM)?

Master Data Management is the practice of creating and maintaining a single, authoritative version of an organization's core business entities — customers, products, suppliers — across all systems. This 'golden record' eliminates conflicting duplicates so reports and analysis are consistent everywhere the data is used. MDM is a governance discipline that improves consistency, uniqueness, and accuracy across the enterprise. Without it, the same customer might appear differently in sales, support, and finance systems.

What is the difference between PII and data masking?

PII (personally identifiable information) is any data that can identify an individual, such as a name, Social Security number, or email. Data masking is a control that hides or replaces sensitive values with realistic but fake data, so PII can be used for testing or analysis without exposing the real information. Other controls include anonymization, encryption (at rest and in transit), access controls, and data classification. Regulations such as GDPR, HIPAA (PHI), PCI-DSS, and CCPA dictate how PII must be protected and retained.

Data organized into a defined schema of rows and columns, such as a relational database table.

Data with no predefined model — text documents, images, audio, and video — that cannot be stored in simple rows and columns.

Data that carries tags or markers (JSON, XML) but does not fit a rigid table structure.

Descriptive, categorical data (e.g., colors, names) — also called nominal or ordinal.

Numeric, measurable data that supports mathematical operations (interval or ratio scales).

A structured store organizing data into related tables linked by keys; queried with SQL.

FREE Data+ Study Guide 2026 + Quizzes & Readiness Score

This free CompTIA Data+ study guide walks through every content domain the Data+ (DA0-002) exam tests, organized to the current CompTIA exam objectives.^[1]

It’s interactive, not a wall of text: every module has built-in checkpoint quizzes, flashcards, and practice questions, so you learn by doing — not just reading.

Data+ tests five official domains, and we teach them as five study modules, all five organized to the official blueprint. Read a module, test yourself at each checkpoint, then drill gaps with our free practice test and flashcards. This guide is a high-yield overview that maps the official content — not a full data-analytics textbook.

CompTIA Data+ is one of the 14 CompTIA certifications — explore our CompTIA study guides to compare and prep across the whole family.

Data+ Exam Snapshot

CompTIA Data+ (DA0-002) at a glance
Detail	Data+ Exam
Exam code	DA0-002 (V2; current — replaced DA0-001)
Questions	Maximum of 90 (multiple choice + performance-based)
Time	90 minutes
Passing score	675 on a 100–900 scale (scaled score, not a percentage)
Certifying body	CompTIA (delivered by Pearson VUE)
Cost	About $255 (voucher; ~$304 with retake assurance)
Prerequisites	None required (18–24 months in a data role recommended)
Validity	3 years
Renewal	30 CEUs over 3 years, or pass a higher CompTIA cert

Data+ covers five domains. The largest — Data Analysis — and the next, Data Acquisition & Preparation, together make up nearly half the exam (46%), so that is where to invest first.^[1] Study by weight:

Data+ DA0-002 weighting by domain (CompTIA exam objectives)

3.0 Data Analysis24% · Statistics + techniques

2.0 Data Acquisition & Preparation22% · ETL/ELT, cleansing

1.0 Data Concepts & Environments20% · Types, databases, lakes

4.0 Visualization & Reporting20% · Charts, dashboards

5.0 Data Governance, Quality & Controls14% · Governance, privacy

Every analysis follows the same arc — define the question, get and clean the data, analyze it, then visualize and communicate the result. Keep this lifecycle in mind as you work through the modules:

The data analytics lifecycle (in order)

1
1. Business question
Define the problem and the decision the analysis must support. Everything starts with the question, not the data.
2
2. Acquire data
Collect or extract data from sources — databases, APIs, files, surveys — and integrate it (ETL/ELT).
3
3. Prepare & clean
Profile, clean, and transform: handle missing values, duplicates, and outliers; recode and normalize.
4
4. Analyze
Apply descriptive, diagnostic, predictive, or prescriptive techniques and statistics to find patterns.
5
5. Visualize & report
Turn findings into the right charts, dashboards, and reports for the audience.
6
6. Communicate & act
Tell the data story, drive a decision, and monitor the outcome — governance applies throughout.

Data governance and quality control wrap the entire cycle — not just the end.

Module 1 · Data Concepts & Environments

One official domain, 20% of the exam. This is the foundation — the kinds of data you work with and the systems that store it. Nail the vocabulary here and the rest of the exam reads far more clearly.

1.1 Data Types & Structures

Start by classifying data two ways. By structure: fits neat rows and columns; (text, images, audio, video) does not; and (JSON, XML) sits between, carrying tags without a rigid table. By measurement: is categorical (nominal or ordinal), while is numeric (interval or ratio) and supports math.^[1]

Data types and measurement scales
Type	Scale	Example	Math allowed?
Qualitative	Nominal (categories, no order)	Eye color, country	Count only
Qualitative	Ordinal (ordered categories)	Survey: poor/fair/good	Order, not arithmetic
Quantitative	Interval (no true zero)	Temperature in °C	Add / subtract
Quantitative	Ratio (true zero)	Sales, height, count	All arithmetic

1.2 Databases, Warehouses & Lakes

Data lives in different places for different jobs. A runs operations day-to-day () — normalized tables linked by a and . For analysis (), data flows into a (structured and modeled, often a ) or a (raw, any format). A blends both.^[1]

Where data lives: database vs. warehouse vs. lake

Operational database (OLTP)
Runs the business day-to-day. Normalized, structured, optimized for fast reads/writes of single records.
Data warehouse (OLAP)
Central analytical store. Structured and modeled (star/snowflake), schema-on-write, optimized for queries and reporting.
Data lake
Holds vast raw data in its native format (structured, semi-, and unstructured). Schema-on-read; cheap and flexible.
Data lakehouse
A hybrid that adds warehouse-style structure and management on top of a lake — one platform for both.

Warehouse = schema-on-write (structure first); lake = schema-on-read (structure when you query).

OLTP vs. OLAP
Aspect	OLTP (operations)	OLAP (analysis)
Purpose	Run the business (transactions)	Analyze the business (insight)
Workload	Many small reads/writes	Few large, complex queries
Design	Normalized for integrity	Denormalized/modeled for speed
Example	Order-entry system	Sales data warehouse

1.3 Big Data & the Analytics Lifecycle

is data too large or complex for traditional tools, described by the V’s: volume, velocity, variety, veracity, and value.^[5] Its scale and variety are exactly why data lakes and cloud platforms exist. Whatever the size, work follows the — and the question always comes before the data.

Checkpoint · Data Concepts & Environments

Question 1 of 10

In data analytics, what does the term "Data Lake" primarily refer to?

Module 2 · Data Acquisition & Preparation

One official domain, 22% of the exam. This domain — renamed from “Data Mining” in the DA0-001 era — is where raw data becomes analysis-ready. In practice it is where analysts spend most of their time, and it is heavily tested.

2.1 Acquiring & Integrating Data (ETL/ELT)

Data is acquired from databases, files, APIs, web scraping, surveys, and sensors, then combined through . The two pipeline patterns to know cold are (transform before loading — the classic warehouse approach) and (load raw, then transform in the target — the cloud/lake/big-data approach).^[1]

ETL vs. ELT — when to transform

ETL (Extract, Transform, Load)

Extract → Transform → Load
Data is cleaned/shaped BEFORE loading
Transformation on a separate engine
Classic, structured data warehouses
Good when targets need clean, modeled data

ELT (Extract, Load, Transform)

Extract → Load → Transform
Raw data loaded FIRST, transformed in place
Transformation uses the target's compute
Cloud warehouses, data lakes, big data
Good for large, varied, fast-changing data

When you can’t (or shouldn’t) use a whole population, you sample it. Good sampling — random, representative, large enough — keeps conclusions valid; biased sampling quietly breaks every downstream result.

2.2 Cleansing & Preparing Data

fixes the problems that would otherwise poison analysis. Handle missing values (delete the record, or with the mean, median, or a predicted value), remove duplicates, and investigate each (error or genuine extreme?). Then make values comparable with or , and convert data types as needed.^[1]

Common data-preparation tasks
Problem	Technique
Missing values	Delete the row, or impute (mean/median/predicted)
Duplicate records	Deduplication (often via a unique key)
Outliers	Investigate; cap, transform, or remove if erroneous
Different scales	Normalize (0–1) or standardize (z-score)
Wrong data type	Type conversion / casting (e.g., text to date)
Inconsistent formats	Parsing and standardizing (dates, units, casing)

2.3 Data Mining Techniques

finds patterns in large datasets. The four techniques to know are (assign to known categories), (group similar records with no labels), regression (model a numeric relationship), and (items that co-occur — the Apriori algorithm and market-basket analysis). Watch for , where a model memorizes the training data and fails on new data.^[1]

Data mining techniques
Technique	Learning type	Use it to…
Classification	Supervised	Sort records into known categories (spam / not spam)
Regression	Supervised	Predict a numeric value (next month's sales)
Clustering	Unsupervised	Group similar records with no labels (customer segments)
Association rules	Unsupervised	Find items bought together (market-basket analysis)

Checkpoint · Data Acquisition & Preparation

Question 1 of 10

Which concept in data management focuses on the use of data across different domains and formats for improved decision-making?

Module 3 · Data Analysis

One official domain, 24% of the exam — the single heaviest. This is the statistical core: summarizing data, measuring relationships, and choosing the right kind of analysis. Invest the most time here.

3.1 Descriptive Statistics

Descriptive statistics summarize a dataset. Central tendency: the (average, outlier-sensitive), the (middle, robust), and the (most frequent). Spread: the range, , and (spread around the mean, in the data’s own units). A marks the value below which a given share of data falls.^[1]

Descriptive statistics — what each one tells you
Measure	What it tells you	Watch out for
Mean	The arithmetic average	Distorted by outliers / skew
Median	The middle value	Best for skewed data
Mode	The most common value	Can be none or several
Range	Max minus min (total spread)	Driven by extremes
Standard deviation	Typical distance from the mean	Same units as the data
Percentile / quartile	Position within the distribution	—

3.2 Relationships & Inference

measures how two variables move together, from −1 to +1, but it never proves — a confounding third variable or coincidence can drive both. weighs sample evidence against a claim using a (a small p-value, often below 0.05, lets you reject the null hypothesis), and a gives a plausible range for the true value.^[1]

3.3 Types of Analytics

Match the analysis to the question. says what happened, why, what will happen, and what to do about it. Each step up the ladder delivers more value and demands more sophisticated technique.^[1]

The four types of analytics (rising value & complexity)

1. Descriptive
What happened?
Summarizes past data (reports, KPIs).
2. Diagnostic
Why did it happen?
Finds causes (drill-down, correlation).
3. Predictive
What will happen?
Forecasts future outcomes (models).
4. Prescriptive
What should we do?
Recommends an action (optimization).

Each step delivers more business value — and demands more sophisticated data work.

Checkpoint · Data Analysis

Question 1 of 10

In data mining, what is the primary purpose of the Apriori algorithm?

Module 4 · Visualization & Reporting

One official domain, 20% of the exam. Analysis only matters if it’s communicated. This domain is about choosing the right chart, building clear dashboards and reports, and not misleading your audience.

4.1 Choosing the Right Chart

The single most-tested visualization skill is matching a chart to a goal. Use a to compare categories, a line chart for a trend over time, a pie or stacked bar for parts of a whole, a for the relationship between two variables, a or for distribution, and a to find the vital few.^[1]

Picking the right chart: goal → chart type

Your goalBest chart

Compare values across categoriesBar / column chart

Show a trend over timeLine chart

Show parts of a wholePie / stacked bar

Show a relationship between two variablesScatter plot

Show the distribution of one variableHistogram / box plot

Find the vital few (80/20)Pareto chart

Show data by locationMap / choropleth

Match the chart to the question — comparison, trend, part-to-whole, relationship, or distribution.

High-yield chart types and when to use them
Chart	Best for	Example
Bar / column	Comparing categories	Sales by region
Line	Trends over time	Monthly revenue
Pie / stacked bar	Parts of a whole	Market share
Scatter plot	Relationship between two variables	Ad spend vs. sales
Histogram	Distribution of one variable	Customer ages
Box plot	Distribution + outliers	Salary spread by team
Pareto chart	The vital few (80/20)	Top defect causes
Heat map	Magnitude across two dimensions	Activity by hour/day

4.2 Dashboards & Reports

A surfaces the right for an audience at a glance, with interactivity such as filters and drill-downs. Choose the report type for the need — ad hoc (one-off), recurring (monitoring), or self-service (exploration) — and design for clarity. Above all, never mislead: truncated axes, distorted proportions, and cherry-picked ranges are accuracy and ethics failures.^[1]

Designing honest, useful visuals
Do	Avoid
Start bar-chart axes at zero	Truncating the y-axis to exaggerate differences
Pick the chart that fits the data	Forcing a 3-D or fancy chart that distorts
Label clearly; show units and source	Clutter (chartjunk) that hides the message
Match KPIs to the audience's decision	Dumping every metric onto one screen

Checkpoint · Visualization & Reporting

Question 1 of 10

What is the primary purpose of using a box plot in data analysis?

Module 5 · Data Governance, Quality & Controls

One official domain, 14% of the exam. Smaller in weight but conceptually important — this domain is about trusting your data and using it responsibly: governance, quality, privacy, and security controls.

5.1 Governance & Master Data

is the framework of policies, roles, and standards controlling data across its lifecycle. Data owners are accountable; a handles day-to-day quality. Supporting tools include a (an inventory of data assets), (where data came from and how it moved), and (one authoritative “golden record” per entity).^[1]

5.2 Data Quality

is measured across dimensions — accuracy, completeness, consistency, timeliness, uniqueness, validity, and integrity. A failure in any one can invalidate an entire analysis, which is exactly why cleansing (Module 2) and governance exist.^[1]

The dimensions of data quality

Accuracy

Values correctly reflect the real-world fact.

Completeness

No required values are missing.

Consistency

Values agree across systems and records.

Timeliness

Data is current and available when needed.

Uniqueness

No unintended duplicate records exist.

Validity

Values conform to the defined format and rules.

Integrity

Relationships between data are maintained.

High-quality data is fit for its purpose across every dimension — quality issues invalidate analysis.

5.3 Privacy, Security & Controls

Sensitive data demands controls. Identify (and PHI for health data), then protect it with , or anonymization, encryption (at rest and in transit), and access controls.^[6] Regulations dictate the rules: (EU privacy), (U.S. health), PCI-DSS (payment cards), and CCPA (California).

Privacy and security controls
Control	What it does
Data classification	Labels data by sensitivity to apply the right controls
Data masking	Replaces sensitive values with realistic fakes for safe use
Anonymization	Removes identifiers so individuals can't be re-identified
Encryption	Protects data at rest and in transit from unauthorized reading
Access controls	Limits who can see or change data (least privilege)
Retention & disposal	Keeps data only as long as needed, then securely destroys it

Checkpoint · Data Governance, Quality & Controls

Question 1 of 10

What does the term "Data Governance" primarily refer to?

How to Use This Data+ Study Guide

This guide is built to be worked, not just read. The most efficient path to a pass:

Study by weight. Data Analysis (24%) and Data Acquisition & Preparation (22%) are nearly half the exam — master statistics, correlation vs. causation, and data cleansing first.
Check off as you go. Use the Study Guide Contents to mark each section done; it raises your exam-readiness score.
Take every checkpoint. The end-of-module quizzes show you exactly which domains need another pass.
Drill the weak domain. Send your weak area into the flashcards and a practice test until the score climbs.
Practice the PBQs. Performance-based questions reward applied skill — read a dataset, pick the right chart, and interpret a statistic until it’s automatic.

Data+ Concept Questions

Common Data+ concepts candidates search while studying — each answered briefly and backed by an official source. Test yourself, then drill them as flashcards.

Data+ Glossary

The high-yield Data+ terms in one place — hover any dotted term in the guide, or flip the whole deck here as a self-grading flashcard set.

Association rules: Finding items that frequently occur together (e.g., the Apriori algorithm for market-basket analysis).
Bar chart: A chart that compares values across distinct categories (bars have gaps).
Big data: Datasets too large or complex for traditional tools, characterized by the V's: volume, velocity, variety, veracity, value.
Box plot: A chart showing a distribution's median, quartiles, and outliers.
Causation: A relationship in which one variable directly causes a change in another; not proven by correlation alone.
Classification: A supervised technique that assigns records to predefined categories.
Clustering: An unsupervised technique that groups similar records without predefined labels.
Confidence interval: A range of plausible values for a population parameter, with a stated level of confidence.
Correlation: A measure of how two variables move together, from −1 (perfect negative) to +1 (perfect positive).
Dashboard: An interactive display of the most important metrics and visuals for an audience, at a glance.
Data analytics lifecycle: The end-to-end process: define the question, acquire, prepare, analyze, visualize, then communicate and act.
Data catalog: An organized inventory of an organization's data assets with descriptions and metadata.
Data classification: Labeling data by sensitivity (e.g., public, internal, confidential) to apply the right controls.
Data cleansing: Detecting and correcting errors and inconsistencies — missing values, duplicates, outliers — to improve data quality.
Data governance: The framework of policies, roles, and standards controlling how data is managed across its lifecycle.
Data integration: Combining data from multiple sources into a unified store for analysis.
Data lake: A repository that stores vast amounts of raw data in its native format (schema-on-read); cheap and flexible.
Data lakehouse: A hybrid architecture that adds warehouse-style structure and management on top of a data lake.
Data lineage: A record of data's origin and how it moves and transforms through systems.
Data mart: A subject-specific subset of a data warehouse serving a single department or function.
Data masking: Replacing sensitive values with realistic but fake data so it can be used without exposing the real values.
Data mining: Discovering patterns and relationships in large datasets using techniques like classification and clustering.
Data quality: The degree to which data is fit for purpose across dimensions like accuracy, completeness, and consistency.
Data steward: A person responsible for the day-to-day quality and proper use of a data domain.
Data warehouse: A central analytical store of structured, modeled data (schema-on-write) optimized for reporting and analysis.
Descriptive analytics: Analytics that summarizes what happened (reports, KPIs).
Diagnostic analytics: Analytics that explains why something happened (drill-down, correlation).
ELT: Extract, Load, Transform — load raw data first, then transform it inside the target (cloud/lake/big-data pattern).
ETL: Extract, Transform, Load — clean and shape data before loading it into the target (classic data-warehouse pattern).
Foreign key: A column that references the primary key of another table, enforcing relationships between tables.
GDPR: General Data Protection Regulation — the EU law governing personal-data privacy and protection.
Heat map: A chart that uses color intensity to show magnitude across two dimensions.
HIPAA: U.S. law protecting health information (PHI — Protected Health Information).
Histogram: A chart showing the distribution of one continuous variable by grouping values into bins (bars touch).
Hypothesis testing: A method for deciding whether sample evidence supports a claim about a population, using a p-value.
Imputation: Filling in missing values using a strategy such as the mean, median, or a predicted value.
KPI: Key Performance Indicator — a measurable value that shows how well a goal is being met.
Master Data Management (MDM): Maintaining a single authoritative version of core business entities (the 'golden record').
Mean: The arithmetic average of all values; sensitive to outliers.
Median: The middle value of sorted data; robust to outliers.
Mode: The most frequently occurring value in a dataset.
Normalization: Rescaling numeric values to a fixed range, typically 0 to 1, so features are comparable.
OLAP: Online Analytical Processing — systems optimized for complex queries and aggregations over large, historical datasets.
OLTP: Online Transaction Processing — systems optimized for many fast, small reads and writes that run daily operations.
Outlier: A value far outside the typical range of a dataset; may be an error or a genuine extreme to investigate.
Overfitting: When a model memorizes training data and performs poorly on new, unseen data.
p-value: The probability of seeing results at least as extreme as the data if the null hypothesis is true.
Pareto chart: A bar chart ordered largest-to-smallest with a cumulative line, highlighting the vital few (80/20).
Percentile: A value below which a given percentage of observations fall (e.g., the 90th percentile).
PII: Personally Identifiable Information — data that can identify an individual (name, SSN, email).
Predictive analytics: Analytics that forecasts what is likely to happen using models.
Prescriptive analytics: Analytics that recommends what action to take (optimization).
Primary key: A column (or set of columns) whose value uniquely identifies each row in a table.
Qualitative data: Descriptive, categorical data (e.g., colors, names) — also called nominal or ordinal.
Quantitative data: Numeric, measurable data that supports mathematical operations (interval or ratio scales).
Relational database: A structured store organizing data into related tables linked by keys; queried with SQL.
Scatter plot: A chart that plots two numeric variables as points to reveal their relationship or correlation.
Semi-structured data: Data that carries tags or markers (JSON, XML) but does not fit a rigid table structure.
Standard deviation: A measure of spread around the mean, in the same units as the data; the square root of variance.
Standardization: Rescaling values to a mean of 0 and standard deviation of 1 (a z-score).
Star schema: A data-warehouse design with a central fact table linked to surrounding dimension tables.
Structured data: Data organized into a defined schema of rows and columns, such as a relational database table.
Unstructured data: Data with no predefined model — text documents, images, audio, and video — that cannot be stored in simple rows and columns.
Variance: A measure of how far values spread from the mean (the square of the standard deviation).

Data+ Study Guide FAQ

DA0-002 (V2) is the current version — it launched October 14, 2025, and DA0-001 (V1) retired in English on April 14, 2026. The exam has a maximum of 90 questions (multiple choice plus performance-based questions) and a 90-minute time limit.

References

1.CompTIA. “CompTIA Data+ (DA0-002) Certification — Exam Details & Objectives.” comptia.org. ↑
2.CompTIA. “CompTIA Data+ (DA0-001) — Retiring Version.” comptia.org. ↑
3.CompTIA. “CompTIA Data+: Your Questions Answered (FAQ).” comptia.org. ↑
4.CompTIA. “Continuing Education — Renewal Fees & CEU Requirements.” comptia.org. ↑
5.National Institute of Standards and Technology. “NIST Big Data Program.” nist.gov. ↑
6.National Institute of Standards and Technology. “NIST Privacy Framework.” nist.gov. ↑

Career Employer

Career Employer is the ultimate resource to help you get started working the job of your dreams. We cover topics from general career information, career searching, exam preparation with free study materials, career interviewing, and becoming successful in your career of choice.

All Posts

Career Employer’s Editorial Process

Here at Career Employer, we focus a lot on providing factually accurate information that is always up to date. We strive to provide correct information using strict editorial processes, article editing, and fact-checking for all of the information found on our website. We only utilize trustworthy and relevant resources. To find out more, make sure to read our full editorial process page here.