- Structured data
- Data in a defined schema of rows and columns, like a relational database table.
- Unstructured data
- Data with no predefined model — text, images, audio, video.
- Semi-structured data
- Data with tags/markers (JSON, XML) but no rigid table structure.
- Qualitative data
- Descriptive, categorical data (nominal or ordinal).
- Quantitative data
- Numeric, measurable data (interval or ratio) that supports math.
- Nominal data
- Categorical data with no inherent order (e.g., colors, countries).
- Ordinal data
- Categorical data with a meaningful order but unequal gaps (e.g., poor/fair/good).
- Interval data
- Numeric data with order and equal gaps but no true zero (e.g., °C).
- Ratio data
- Numeric data with a true zero, allowing all arithmetic (e.g., sales, height).
- Discrete data
- Countable data taking whole-number values (e.g., number of orders).
- Continuous data
- Measurable data taking any value in a range (e.g., temperature, weight).
- Relational database
- A store organizing data into related tables linked by keys; queried with SQL.
- Primary key
- A column whose value uniquely identifies each row in a table.
- Foreign key
- A column referencing another table's primary key, enforcing relationships.
- Normalization (database)
- Organizing tables to reduce redundancy and improve integrity.
- Denormalization
- Adding redundancy to a schema to speed up analytical reads.
- OLTP
- Online Transaction Processing — fast, small reads/writes that run daily operations.
- OLAP
- Online Analytical Processing — complex queries/aggregations over large historical data.
- Data warehouse
- A central analytical store of structured, modeled data (schema-on-write).
- Data mart
- A subject-specific subset of a data warehouse for one department.
- Data lake
- A repository holding vast raw data in native format (schema-on-read).
- Data lakehouse
- A hybrid adding warehouse structure/management on top of a data lake.
- Schema-on-write
- Structure is defined before loading (data warehouse).
- Schema-on-read
- Structure is applied only when the data is queried (data lake).
- Star schema
- A warehouse design: a central fact table linked to dimension tables.
- Snowflake schema
- A star schema whose dimension tables are further normalized.
- Fact table
- A warehouse table of measurable events/metrics, with keys to dimensions.
- Dimension table
- A warehouse table of descriptive attributes used to filter/group facts.
- Metadata
- Data about data — descriptions, types, source, and definitions of fields.
- Big data
- Datasets too large/complex for traditional tools — the V's.
- Volume (big data)
- The sheer scale of data generated and stored.
- Velocity (big data)
- The speed at which data is generated and must be processed.
- Variety (big data)
- The range of data formats — structured, semi-, unstructured.
- Veracity (big data)
- The trustworthiness and quality of the data.
- Value (big data)
- The business worth extracted from the data.
- Data analytics lifecycle
- Question → acquire → prepare → analyze → visualize → communicate/act.
- Cloud data environment
- Storing and processing data on managed cloud platforms for scale/elasticity.
- On-premises data
- Data stored and processed on an organization's own hardware.
- Data source
- Any origin of data — database, file, API, sensor, survey, web.
- Record vs. field
- A record (row) is one entity; a field (column) is one attribute of it.
- Flat file
- A plain file (e.g., CSV) storing data with no relational structure.
- CSV
- Comma-Separated Values — a common flat-file format for tabular data.
- JSON
- JavaScript Object Notation — a lightweight semi-structured data format.
- XML
- Extensible Markup Language — a tag-based semi-structured data format.
- SQL
- Structured Query Language — the language for querying relational databases.
- NoSQL database
- A non-relational store (document, key-value, graph, column) for flexible/large data.
- ETL
- Extract, Transform, Load — clean/shape data before loading (classic warehouse).
- ELT
- Extract, Load, Transform — load raw first, transform in the target (cloud/lake).
- Data integration
- Combining data from multiple sources into one unified store.
- Data ingestion
- Bringing data into a system from its sources (batch or streaming).
- Batch processing
- Processing data in scheduled, grouped chunks.
- Stream processing
- Processing data continuously as it arrives in real time.
- API
- Application Programming Interface — a defined way for systems to exchange data.
- Web scraping
- Extracting data from web pages programmatically.
- Data acquisition
- Gathering data from sources: databases, APIs, files, surveys, sensors.
- Sampling
- Selecting a subset of data to represent a larger population.
- Random sampling
- Each member of the population has an equal chance of selection.
- Sampling bias
- A non-representative sample that distorts conclusions.
- Data cleansing
- Detecting and fixing errors/inconsistencies to improve quality.
- Missing value
- An absent entry; handled by deletion or imputation.
- Imputation
- Filling missing values using a strategy (mean, median, predicted).
- Deduplication
- Removing duplicate records, often via a unique key.
- Outlier
- A value far outside the typical range; may be an error or true extreme.
- Sentinel value
- A placeholder (999, -1, 'N/A') that stands in for missing/invalid data.
- Normalization (scaling)
- Rescaling numeric values to a fixed range, typically 0 to 1.
- Standardization (scaling)
- Rescaling values to mean 0 and standard deviation 1 (z-score).
- Data type conversion
- Casting a value from one type to another (e.g., text to date).
- Parsing
- Breaking raw text into structured fields (e.g., splitting a full name).
- Recoding
- Replacing values with standardized codes or categories.
- Data validation
- Checking that data conforms to defined rules and formats.
- Data profiling
- Examining data to summarize its content, structure, and quality.
- Data blending
- Combining data from different sources for a single analysis.
- Data transformation
- Converting data into a structure/format suitable for analysis.
- Aggregation
- Summarizing detailed data into totals (sum, count, average) by group.
- Filtering
- Keeping only rows that meet a condition.
- Sorting
- Ordering rows by one or more columns.
- Join
- Combining rows from two tables based on a related key.
- Inner join
- Returns only rows with matching keys in both tables.
- Outer join
- Returns matching rows plus unmatched rows from one or both tables.
- Pivot / transpose
- Reshaping data by turning rows into columns or vice versa.
- Data mining
- Discovering patterns/relationships in large datasets.
- Classification
- Supervised technique assigning records to known categories.
- Regression (mining)
- Supervised technique modeling a numeric relationship for prediction.
- Clustering
- Unsupervised technique grouping similar records with no labels.
- Association rules
- Finding items that frequently co-occur (market-basket analysis).
- Apriori algorithm
- A classic algorithm for mining frequent itemsets / association rules.
- Supervised learning
- Learning from labeled data (classification, regression).
- Unsupervised learning
- Finding structure in unlabeled data (clustering, association).
- Overfitting
- A model that memorizes training data and fails on new data.
- Training data
- The dataset used to build (fit) a model.
- Feature
- An input variable (column) used in analysis or a model.
- Descriptive statistics
- Statistics that summarize a dataset (center and spread).
- Mean
- The arithmetic average of all values; sensitive to outliers.
- Median
- The middle value of sorted data; robust to outliers.
- Mode
- The most frequently occurring value.
- Range
- The difference between the maximum and minimum values.
- Variance
- A measure of spread — the average squared distance from the mean.
- Standard deviation
- Spread around the mean, expressed in the data's own units (variance has squared units).
- Percentile
- The value below which a given percentage of data falls.
- Quartile
- Values splitting sorted data into four equal parts (Q1, Q2, Q3).
- Interquartile range (IQR)
- Q3 minus Q1 — the middle 50% of the data.
- Central tendency
- The center of a distribution — mean, median, or mode.
- Skewness
- The asymmetry of a distribution (left- or right-skewed).
- Right-skewed distribution
- A long tail to the right; mean sits above the median.
- Left-skewed distribution
- A long tail to the left; mean sits below the median.
- Normal distribution
- A symmetric bell curve; ~68% of values within 1 standard deviation.
- Frequency
- How often a value or category appears in a dataset.
- Correlation
- How two variables move together, from -1 to +1.
- Correlation coefficient
- A number from -1 to +1 measuring linear relationship strength/direction.
- Positive correlation
- Both variables move in the same direction.
- Negative correlation
- One variable rises as the other falls.
- Causation
- One variable directly causes a change in another; not proven by correlation.
- Correlation vs. causation
- Correlation shows variables move together; it does not prove cause.
- Confounding variable
- A hidden third factor that influences both correlated variables.
- Hypothesis testing
- Deciding whether sample evidence supports a claim about a population.
- Null hypothesis
- The default claim of no effect/difference, which a test tries to reject.
- p-value
- Probability of results this extreme if the null hypothesis is true.
- Significance level (alpha)
- The threshold (often 0.05) for rejecting the null hypothesis.
- Confidence interval
- A range of plausible values for a population parameter.
- Regression analysis
- Modeling how a dependent variable changes with predictors.
- Linear regression
- Fitting a straight line to model a relationship between variables.
- Trend analysis
- Identifying the direction of data over time.
- Time-series analysis
- Analyzing data points ordered by time to find patterns/seasonality.
- Cohort analysis
- Comparing groups that share a characteristic over time.
- Exploratory data analysis (EDA)
- Initial investigation to summarize and visualize data's main features.
- Descriptive analytics
- Analytics that summarizes what happened (reports, KPIs).
- Diagnostic analytics
- Analytics that explains why something happened.
- Predictive analytics
- Analytics that forecasts what is likely to happen.
- Prescriptive analytics
- Analytics that recommends what action to take.
- Analytics maturity ladder
- Descriptive → diagnostic → predictive → prescriptive (rising value).
- Outlier (analysis)
- An extreme value that can distort the mean and other statistics.
- Variability
- How spread out the values in a dataset are.
- Weighted average
- An average where some values count more than others.
- Statistical significance
- A result unlikely to be due to chance alone.
- Sample vs. population
- A sample is a subset; a population is the entire group of interest.
- Bar chart
- Compares values across distinct categories (bars have gaps).
- Column chart
- A vertical bar chart comparing categories.
- Line chart
- Shows a trend in a value over time.
- Pie chart
- Shows parts of a whole as slices; best for few categories.
- Stacked bar chart
- Shows parts of a whole within each category bar.
- Scatter plot
- Plots two numeric variables as points to show their relationship.
- Bubble chart
- A scatter plot encoding a third variable as point size.
- Histogram
- Shows the distribution of one continuous variable using bins (bars touch).
- Box plot
- Shows a distribution's median, quartiles, and outliers.
- Pareto chart
- Bars ordered largest-to-smallest with a cumulative line (80/20).
- Heat map
- Uses color intensity to show magnitude across two dimensions.
- Tree map
- Shows hierarchical parts of a whole as nested rectangles.
- Waterfall chart
- Shows how an initial value rises/falls through sequential changes.
- Geographic / choropleth map
- Shows data by location using shading or markers.
- Gantt chart
- A bar chart of tasks over time, used for project schedules.
- Bar vs. histogram
- Bar compares categories (gaps); histogram shows distribution (bars touch).
- Chart selection
- Match the chart to the goal: compare, trend, part-of-whole, relationship, distribution.
- Dashboard
- An interactive at-a-glance display of key metrics and visuals.
- KPI
- Key Performance Indicator — a measurable value showing progress to a goal.
- Metric
- A quantifiable measure used to track performance.
- Drill-down
- Navigating from a summary into more detailed data.
- Filter (visualization)
- Limiting a view to data meeting a condition.
- Ad hoc report
- A one-off report created to answer a specific question.
- Recurring report
- A scheduled report produced regularly for monitoring.
- Self-service analytics
- Tools letting non-analysts explore data on their own.
- Data storytelling
- Communicating insight with a clear narrative around the visuals.
- Chartjunk
- Decorative clutter that distracts from a chart's message.
- Misleading axis
- Truncating or distorting an axis to exaggerate differences.
- Truncated y-axis
- A bar-chart axis not starting at zero, overstating differences.
- Annotation
- A note added to a chart to highlight or explain a point.
- Legend
- A key explaining the colors/symbols used in a chart.
- Trend line
- A line summarizing the direction of points in a scatter plot.
- Outlier (visualization)
- A point far from the others, easily spotted on scatter/box plots.
- Audience-appropriate reporting
- Tailoring detail and KPIs to who will read the report.
- Sparkline
- A tiny inline chart showing a trend without axes.
- Gauge chart
- A dial showing a single value against a target range.
- Funnel chart
- Shows values dropping through sequential stages (e.g., a sales funnel).
- Color encoding
- Using color to represent a category or magnitude in a visual.
- Accessibility (visuals)
- Designing charts readable by all, e.g., colorblind-safe palettes.
- Report vs. dashboard
- Reports are detailed/static; dashboards are summary/interactive.
- Data governance
- Policies, roles, and standards controlling data across its lifecycle.
- Data owner
- The person accountable for a data domain and its policy.
- Data steward
- The person responsible for day-to-day data quality and proper use.
- Data custodian
- The role managing the technical storage and security of data.
- Master Data Management (MDM)
- Maintaining one authoritative 'golden record' per core entity.
- Golden record
- The single, trusted version of an entity across all systems.
- Data catalog
- An organized inventory of data assets with descriptions/metadata.
- Data lineage
- A record of data's origin and how it moves/transforms through systems.
- Data dictionary
- Documentation defining each field's name, type, and meaning.
- Data quality
- The degree to which data is fit for its intended purpose.
- Accuracy (quality)
- Values correctly reflect the real-world fact.
- Completeness (quality)
- No required values are missing.
- Consistency (quality)
- Values agree across systems and records.
- Timeliness (quality)
- Data is current and available when needed.
- Uniqueness (quality)
- No unintended duplicate records exist.
- Validity (quality)
- Values conform to the defined format and rules.
- Integrity (quality)
- Relationships between data are correctly maintained.
- PII
- Personally Identifiable Information — data that can identify an individual.
- PHI
- Protected Health Information — health data covered by HIPAA.
- Data classification
- Labeling data by sensitivity to apply the right controls.
- Data masking
- Replacing sensitive values with realistic fakes for safe use.
- Anonymization
- Removing identifiers so individuals cannot be re-identified.
- Pseudonymization
- Replacing identifiers with reversible tokens.
- Encryption at rest
- Encrypting stored data so it can't be read if accessed.
- Encryption in transit
- Encrypting data as it moves across a network.
- Access control
- Limiting who can view or change data (least privilege).
- Least privilege
- Granting only the access a role actually needs.
- Role-based access control (RBAC)
- Granting permissions based on a user's role.
- Data retention
- Keeping data only as long as required by policy or law.
- Data disposal
- Securely destroying data that is no longer needed.
- GDPR
- EU General Data Protection Regulation governing personal-data privacy.
- HIPAA
- U.S. law protecting health information (PHI).
- PCI-DSS
- The security standard for handling payment-card data.
- CCPA
- The California Consumer Privacy Act governing consumer data rights.
- Compliance
- Adhering to laws, regulations, and internal policies for data use.
- Data privacy
- The right to control how personal data is collected and used.
- Data security
- Protecting data from unauthorized access, change, or loss.
- Audit trail
- A record of who accessed or changed data and when.
- Data breach
- Unauthorized access to or disclosure of protected data.
- Data ethics
- Responsible, fair, and transparent use of data.
- Consent
- A person's permission to collect and use their personal data.
- Data minimization
- Collecting only the personal data actually needed.
- Master data
- The core, shared entities of a business (customers, products, suppliers).
- Reference data
- Standardized lookup values used across systems (e.g., country codes).
- Database vs. data warehouse
- A database runs operations (OLTP); a warehouse is built for analysis (OLAP).
- ER diagram
- Entity-Relationship diagram — a visual model of tables and their relationships.
- Cardinality
- The nature of a relationship between tables (one-to-one, one-to-many, many-to-many).
- Index (database)
- A structure that speeds up lookups on a column at the cost of extra storage.
- Query
- A request to retrieve or manipulate data, typically written in SQL.
- ACID properties
- Atomicity, Consistency, Isolation, Durability — guarantees of reliable transactions.
- Data type
- The kind of value a field holds (integer, string, date, boolean, float).
- Boolean data
- A value that is either true or false.
- Streaming data
- Data generated continuously in real time (e.g., sensor or clickstream data).
- Data integrity
- Maintaining accuracy and consistency of data over its lifecycle.
- Schema
- The defined structure of a database — its tables, fields, types, and relationships.
- Surrogate key
- An artificial unique key (e.g., an auto-number) with no business meaning.
- Granularity
- The level of detail stored in a dataset (e.g., daily vs. monthly).
- Dimensional modeling
- Designing a warehouse around facts and dimensions for analysis.
- Data wrangling
- The hands-on work of cleaning and reshaping raw data for analysis.
- Data pipeline
- An automated series of steps that moves and transforms data.
- Source system
- The original system where data is created or first captured.
- Staging area
- A temporary store where data lands before transformation/loading.
- Data enrichment
- Adding value to data by combining it with additional sources.
- Concatenation
- Joining text fields end to end (e.g., first + last name).
- Binning
- Grouping continuous values into ranges (bins/buckets).
- Encoding (categorical)
- Converting categories into numeric form for analysis/modeling.
- Survey data
- Data collected directly from respondents via questionnaires.
- Observational data
- Data collected by watching/measuring without intervention.
- Structured query
- A precise data request using SQL filters, joins, and aggregations.
- Left join
- Returns all rows from the left table plus matches from the right.
- Union
- Stacks the rows of two result sets with matching columns.
- Cross-validation
- Splitting data to test how well a model generalizes.
- Decision tree
- A model that splits data on feature values to classify or predict.
- Inferential statistics
- Using a sample to draw conclusions about a population.
- Descriptive vs. inferential
- Descriptive summarizes data; inferential generalizes to a population.
- t-test
- A test for whether two group means differ significantly.
- Chi-square test
- A test for association between two categorical variables.
- Dependent variable
- The outcome being measured or predicted.
- Independent variable
- An input thought to influence the outcome.
- R-squared
- The share of variance in the outcome explained by a regression model.
- Forecasting
- Predicting future values from historical data.
- Seasonality
- Repeating patterns in time-series data tied to the calendar.
- Moving average
- An average over a sliding window that smooths short-term noise.
- Bias (statistical)
- A systematic error that skews results away from the truth.
- Margin of error
- The range of uncertainty around a survey or estimate.
- Distribution
- How values of a variable are spread across their range.
- Aggregate function
- A calculation over many rows: SUM, COUNT, AVG, MIN, MAX.
- Key driver analysis
- Identifying which variables most influence an outcome.
- Stacked area chart
- Shows the trend of parts of a whole over time.
- Combo chart
- Combines two chart types (e.g., bars + line) on one axis set.
- Dual-axis chart
- Plots two measures with different scales on separate y-axes.
- Data label
- Text on a chart showing a point's exact value.
- Axis scale
- The range and intervals of a chart axis (linear or log).
- Categorical axis
- An axis listing discrete categories (e.g., regions).
- Continuous axis
- An axis representing a numeric range.
- Visualization best practice
- Clear labels, honest scales, the right chart, minimal clutter.
- Highlighting
- Emphasizing key data points with color or annotation.
- Interactive report
- A report users can filter, sort, and drill into.
- Static report
- A fixed report that does not change after it is produced.
- Executive dashboard
- A high-level dashboard of strategic KPIs for leaders.
- Operational dashboard
- A real-time dashboard for monitoring day-to-day activity.
- 3-D chart pitfall
- Adding 3-D effects that distort proportions and mislead readers.
- Data lifecycle
- The stages of data from creation through use, storage, and disposal.
- Data quality dimension
- An attribute used to measure quality (accuracy, completeness, etc.).
- Data quality rule
- A defined check data must pass (e.g., 'email must contain @').
- Stewardship
- Day-to-day responsibility for a data domain's quality and use.
- Regulatory compliance
- Meeting legal requirements for how data is handled.
- Sensitive data
- Data requiring extra protection (PII, PHI, financial, credentials).
- Tokenization
- Replacing sensitive data with a non-sensitive token reference.
- Data sovereignty
- The principle that data is subject to the laws of where it is stored.
- Privacy by design
- Building privacy protections into systems from the start.
- Right to be forgotten
- A GDPR right to have one's personal data erased.
- Data quality assessment
- Measuring data against quality dimensions and rules.
- Confidentiality
- Ensuring data is accessible only to authorized parties.
- Availability (data)
- Ensuring data is accessible to authorized users when needed.
- Data archiving
- Moving inactive data to long-term, lower-cost storage.