From Documentation Theater to Systematic Enforcement: Rethinking Data Classification
The Problem Nobody Wants to Admit
Your organization probably has a data governance policy. It might live in Confluence, in Alation, in a carefully maintained spreadsheet. It documents what counts as PII, which columns require masking, which regulations apply. Someone worked hard on it.
And in your actual data systems — the BigQuery tables, the Snowflake schemas, the S3 buckets — most of that policy is not enforced. Tags drift. Schemas change without review. A “PII-free” test environment gets seeded from production. A sensitive column gets duplicated into a downstream reporting table that nobody thought to classify.
The gap is not that organizations lack policies. It is that policy lives in one place and data lives in another, with nothing connecting them automatically. Governance depends on humans remembering to apply it — and human processes do not scale, do not stay current, and do not survive organizational change.
The fix is not better documentation. It is treating policy enforcement as an engineering problem: a central classification store that connects policy definitions to the systems that act on them, CI gates that make violations fail the pipeline, and drift detection that surfaces everything that slipped through before any of this existed.
The Architecture
The core idea is to replace the human-in-the-loop documentation cycle with a continuous automated loop — the system discovers what is sensitive, CI enforces policy at code-ship time, and drift detection catches what slips through.
flowchart TD
PM([Policy Makers]) -->|define policies| GovTools[Governance Tools<br/>Alation · Collibra · Confluence]
GovTools -->|sync to| F
A([Developer]) -->|pushes code| B[CI Pipeline]
B --> C{Policy checks pass?}
C -- No: fail --> A
C -- Yes --> D[Deploy to Production]
D --> E[Scheduled Discovery]
E --> F[(Central Classification Store)]
F --> G[Drift Detection]
G -- violations --> H[Alert]
H -. refine policies .-> PM
F -- informs checks --> B
style F fill:#4285F4,color:#fff
style H fill:#EA4335,color:#fff
style C fill:#FBBC04,color:#000
style PM fill:#34A853,color:#fff
style GovTools fill:#34A853,color:#fff
Three things make this work. The central classification store is the single source of truth — everything writes to it and reads from it. CI uses accumulated classification knowledge to catch violations before production. Drift detection runs on a schedule to catch what CI cannot: manual console changes, emergency hotfixes, and newly landed data.
Policy makers sit outside this loop but feed into it: they define the rules in whatever governance tool the organization already uses, and those rules sync into the classification store where automated systems can act on them.
The Central Classification Store
A common design mistake is putting everything in one place — rules, scan results, and confirmed state all mixed together. These three concerns have different owners, different change frequencies, and different consumers.
flowchart TD PR[Policy Registry<br/>What the rules ARE<br/>Owner: Governance team<br/>Managed in Git] -->|informs checks| Gate[CI Policy Gate] Scanners[Discovery Scanners<br/>Structured · JSON · Unstructured] -->|append findings| Findings[Scan Findings<br/>What the system DISCOVERED<br/>Owner: Automated scanners] Findings -->|high-confidence auto-promote| Classification[Confirmed Classifications<br/>What each asset IS CONFIRMED to contain<br/>Owner: Human-reviewed output] Findings -->|medium/low → review queue| Review[Human Review] Review -->|approved| Classification Classification -->|known-sensitive list| Gate Classification -->|expected state| Drift[Drift Detection] style PR fill:#34A853,color:#fff style Classification fill:#4285F4,color:#fff style Findings fill:#FBBC04,color:#000
Policy Registry — the authoritative definition of what sensitive data types exist, what controls they require, and which regulation drives the requirement. Managed like infrastructure code — one update propagates to all downstream enforcement automatically.
Scan Findings — an append-only log written exclusively by automated scanners. Humans never write to it directly. Raw evidence: what was found, where, when, how, and with what confidence.
Confirmed Classifications — the reviewed state of what each asset actually contains. High-confidence findings are auto-promoted here; lower-confidence findings go to a human review queue. This is what drift detection compares against and what CI reads for its known-sensitive list.
What the store actually is
“Central classification store” is a concept, not a specific technology. What it looks like in practice depends on where you are in the adoption curve.
For many teams, it starts as a metadata-as-code repository — a set of YAML or JSON files in Git that declare what each sensitive asset contains and what controls apply. Version-controlled, diff-able, and reviewed like any other code change. Simple to start, easy to audit.
As coverage grows, it often migrates to a lightweight database (Postgres is common) that scanners write to and CI queries against via a small internal API. This makes it easier to query across large asset inventories and to build the auto-promotion logic for high-confidence findings.
Some teams skip the custom build entirely and use the API of an existing data catalog — Alation, Collibra, or a cloud-native catalog like Google Dataplex — as the store, with a thin adapter layer that translates between the catalog’s data model and what the scanners and CI checks expect.
Whichever form it takes, a single finding should carry enough context to be actionable. For example:
{
"asset": "warehouse.customers.email",
"pii_type": "email_address",
"detected_by": "regex",
"confidence": "HIGH",
"status": "confirmed",
"reviewed_by": "auto-promoted",
"detected_at": "2026-02-15T08:00:00Z",
"required_tag": "projects/my-org/taxonomies/pii/policyTags/email"
}
The status field is what the rest of the system acts on: CI checks for confirmed findings before allowing a deployment, and drift detection alerts when a confirmed asset no longer has its required_tag applied in the live environment.
The catalog tool, if you have one, should read from confirmed classifications — not be the primary place humans write to.
Working with existing governance tools
Most enterprises already have policies documented somewhere — a Confluence page, a catalog like Alation or Collibra, a spreadsheet that one person maintains. The policy registry is not a replacement for those. It is a sync point: a machine-readable version of whatever already exists, structured so that scanners and CI gates can query it programmatically.
How you populate it depends on what you have. Some teams export from Alation via API and transform the output. Others parse policy documentation from Confluence. Some maintain it directly in Git as YAML or JSON files. What matters is that there is one reliable, queryable place downstream tooling reads from — not ten.
A word on bidirectional push. Several enterprise tools — Alation, Collibra, some cloud-native catalogs — offer the ability to push policy controls directly to Snowflake, BigQuery, or other data systems. This sounds convenient. It is not a good idea.
When a governance tool pushes directly to production systems, it bypasses the normal code review and deployment process. Changes happen outside CI, without an audit trail in version control, and without going through the checks your team built. It also splits ownership: who is responsible when a policy push conflicts with a schema change in progress?
The better model: governance tools own defining policy, engineering teams own enforcing it. Policy information flows from Alation or Collibra into the policy registry — via sync, export, or manual update in Git — and from there it is enforced through the CI pipeline like any other infrastructure change. Each team controls their own data provisioning process. The governance tool becomes an input, not an operator.
Discovery: Know What You Have
Sensitive data lives in different forms, and no single tool covers all of them well. A regex scanner is fast and cheap for structured columns but blind to what is inside a JSON blob. A cloud DLP service handles PDFs and images with OCR but may be expensive to run against every table. An LLM can reason about ambiguous cases but is overkill for a column named email. The right approach is to match the scanning method to the data type — and accept that different teams or environments may use different tools entirely.
What matters more than the choice of tool is where findings land. Every scanner, whatever it is, should write to the same central classification store. That is what makes the rest of the system work: CI and drift detection have one place to read from, regardless of how the data was discovered.
Structured data (warehouse columns) — Column names catch the obvious cases. Value sampling catches legacy or obfuscated names. An LLM call handles what neither name nor values make clear. Each layer adds confidence, and findings accumulate in the store over time.
Semi-structured data (JSON blobs) — Generic columns like metadata or event_payload are a common blind spot. Scanning their contents often surfaces PII that has no schema-level signal at all. When something is found inside a blob, the right remediation is usually to extract it into a typed column — not to mask the whole blob and move on.
Unstructured data (files, documents) — This is where most organizations have the least coverage and the most exposure. DLP tools handle a wide range of file types, but even a simple bucket-level scan that flags which storage locations contain sensitive content is far better than no visibility at all. The precision is lower than column-level tagging, but the alternative is flying blind.
CI/CD Gates: Policy as Code
Every proposed change goes through a policy gate before reaching production — the same pattern already standard for infrastructure provisioning. Infrastructure should be safe by default: S3 buckets are private unless explicitly opened, IAM policies require least-privilege review, TLS is non-negotiable. These constraints live in code, enforced automatically, not by asking someone to remember.
Data governance is heading in the same direction but is earlier in that journey. Most teams still rely on process and documentation to keep sensitive data properly controlled. CI gates are how you close that gap — making policy violations fail the pipeline the same way a misconfigured security group would.
Here is what this looks like in practice. A developer opens a PR that adds a user_id column to the orders table. The CI gate runs, queries the classification store, and finds that user_id is a confirmed sensitive field — but the new column definition has no policy tag attached. The PR gets a check failure with a message:
Policy gate: Column
orders.user_idis a known sensitive field (user_id_internal) but is missing a required classification tag. Add the tag tometadata.yamlor mark this as a reviewed exception.
The developer adds the tag, pushes again, the check passes. The entire interaction takes two minutes and happens inside the normal PR workflow — no ticket to a governance team, no email thread, no waiting. Governance stops being someone else’s problem because the feedback arrives exactly where the developer is already working.
Three checks cover the most common patterns:
Schema check — Verifies sensitive columns have policy tags or masking rules before a schema change is deployed. Also flags generic blob columns that could be hiding nested PII without a schema-level signal.
SQL lineage check — Scans transformation definitions in the repo. Fails if a sensitive source table flows into an open reporting layer without an explicit masking or filtering step in between. This is the most technically demanding of the three checks — it requires parsing SQL to trace column-level lineage, understanding not just which tables are referenced, but whether a masked column is being passed through unmasked in a downstream view or model. A few starting points: SQLGlot handles AST traversal across SQL dialects; dbt-checkpoint enforces model-level constraints at commit time for dbt teams; SQLMesh has column-level lineage built in. For teams without any of these, a lighter fallback is to scan SQL files for known sensitive table names and flag references that lack a corresponding masking function in the same query.
Infrastructure config check — Validates that storage resources designated for sensitive content have the right access controls set before they are provisioned.
Policy checks on schema content are only half the picture. The other half is controlling how schemas change in the first place. Tools like Atlas bring database migrations into the CI pipeline as versioned, reviewable changes — so schema evolution goes through the same gate as everything else, rather than being applied directly against a live database outside any review process.
Handling False Positives
Automated scanners — especially regex and LLM-based ones — will misclassify things. A column named customer_score might match a pattern for sensitive data. A JSON field called phone_model is not a phone number. If the CI gate blocks on these and developers have no recourse, the system quickly becomes an obstacle rather than a safety net. People start routing around it, which is worse than not having it at all.
The solution is an explicit exception pattern. When a developer believes a finding is a false positive, they should be able to mark it as such — not by removing the check, but by adding a structured override directly in the codebase: a metadata annotation on the column, an entry in a version-controlled exceptions file, or a suppression comment tied to a specific finding ID. The key properties are that the override is visible (it lives in the repo, not in someone’s inbox), reviewable (it goes through the normal PR process), and auditable (the classification store records it with a reason and a reviewer).
This keeps the pipeline fluid without creating a silent bypass. A legitimate exception is just another code change — traceable, reversible, and subject to the same review as anything else. Over time, the pattern of exceptions also feeds back into improving the scanner: recurring false positives on certain column names or patterns are a signal to tune the detection logic rather than accumulate suppressions.
Data Contracts
Data contracts are a natural extension of this model. A contract is an explicit agreement between a data producer and its consumers about what a dataset contains — its schema, field types, quality guarantees, and sensitivity levels. When a producer declares in a contract that a field is pii_type: email_address, that declaration needs to be verifiable, not just documented.
The classification store is what makes it verifiable. The CI gate checks that the schema being shipped actually matches the sensitivity levels declared in the contract. Drift detection checks that the live data continues to match over time. A contract without this feedback loop is just another form of documentation theater — it describes intent, but nothing enforces it automatically.
Drift Detection: Catch What CI Misses
CI gates are the right long-term mechanism, but they only govern what passes through the pipeline going forward. In practice, most enterprises are only beginning to roll out CI gates — which means the bulk of existing data assets were provisioned before any gate existed. Tables created years ago, buckets set up by hand, schemas that predate any governance process: none of these ever went through a policy check. Drift detection is how you deal with that reality. Rather than requiring a full remediation effort before you can adopt the new model, you run scheduled checks against live state, surface what is out of compliance, and clean it up incrementally while CI prevents new violations from being added.
There is also a subtler gap that drift detection covers, even for teams that already have CI in place. A common assumption is that Terraform handles this — and it does, but for a different concern. Terraform detects infrastructure drift: live cloud resources diverging from what is declared in .tf files. Governance drift is a layer above that. Someone can remove a policy tag from a BigQuery column via the cloud console without changing a single line of Terraform. An emergency hotfix can land through a side channel. A new table can be populated by a migration script that bypassed review. The infrastructure is exactly as Terraform expects — but the policy controls are not. That gap is what governance drift detection is for.
flowchart LR Declared[Classification Store<br/>Expected State] --> Compare[Reconcile] Live[Live Cloud State] --> Compare Compare -->|in sync| OK[Compliant] Compare -->|gap found| Alert[Alert and Remediate]
The mechanism is straightforward: scheduled jobs compare live cloud state against what the classification store says should be true, and alert on any gap. For data assets, that means checking whether policy tags on sensitive columns are still in place, and whether storage resources designated for sensitive content still have the right access controls. Both types of check share the same pattern — expected state comes from the classification store, not from hardcoded configuration — which means there is always one authoritative source and no ambiguity about what “correct” looks like.
Practical Sequencing
The three pillars — discovery, CI gates, and drift detection — are designed to work together, but you do not need all of them in place before you get value. Trying to instrument everything at once is a reliable way to stall. The better approach is to pick a high-risk domain, get one layer working end to end, and let the classification store grow from there. Each step builds on the previous one: discovery populates the store, CI gates use it to block new violations, and drift detection uses it to surface existing ones.
flowchart LR S1[Step 1<br/>Structured Tables<br/>Schema CI check<br/>Value scanning] --> S2[Step 2<br/>JSON Columns<br/>Blob scanning<br/>Remediation flags] S2 --> S3[Step 3<br/>Object Storage<br/>DLP integration<br/>Bucket policy CI] S1 & S2 & S3 --> Store[(Classification Store<br/>grows with each step)]
Step 1 — Structured tables first. Schema CI check and value scanning. Seeds the classification store and starts changing developer habits.
Step 2 — JSON columns next. Scan event and log tables — usually where the biggest surprises are. Flag findings for schema remediation and extend CI to warn on generic blob columns.
Step 3 — Object storage last. DLP handles the hard detection work; you are mostly wiring its output to the existing classification store and enforcing bucket policies in CI.
At every stage, mine warehouse query logs to understand which tables and buckets are actually accessed and by whom. Usage patterns give you a risk-ranked backlog far more reliable than any documentation exercise.
Conclusion
The gap between documented governance and actual governance is architectural, not technological. The tools exist. The problem is that classification still depends on humans filling in forms, enforcement depends on humans following procedures, and policy lives in Confluence while the data lives somewhere else entirely.
The model proposed here is different in one key way: the central classification store becomes the connective tissue. Policy makers define rules in whatever governance tool they already use — Alation, Collibra, a Git-managed registry — and those rules sync into a single queryable place. Scanners write findings to the same store. CI gates read from it to block new violations. Drift detection reads from it to surface existing ones. Everything is connected, and the connection is systematic rather than procedural.
Nor is this a big-bang transformation. Most organizations already have governance tools, existing policies, and some CI infrastructure. The work is to wire them together incrementally — structured data first, JSON and unstructured data later — letting coverage grow as the store grows. Drift detection carries its own weight from day one by exposing what the current state actually is, which is often the most useful thing to know.
Two principles worth holding onto as you go: govern by origin, not environment — data derived from production carries production-level classification regardless of where it lands. And treat direct pushes from governance tools as a red flag — when a catalog pushes policy changes directly to a data system, it bypasses the pipeline that your team built and owns. The governance tool should be an input to your process, not an operator of it.
The catalog stops being what humans maintain and starts being what the system generates. That is the shift worth making.
Working Example
A minimal implementation of all three pillars — classification store, CI gates, and drift detection — is available at github.com/tongqqiu/policy-as-code-example. Clone it and run make demo to see the checks in action locally with no credentials required. BigQuery is used as the reference warehouse, but the pattern applies to any system that exposes schema metadata.