Owner: Vadim Rudakov, rudakow
Overview¶
This document outlines the development plan for extracting the validation scripts from tools/scripts/ into a standalone, reusable Python package. See ADR 26012 for the architectural decision and rationale.
Motivation: Why Extract?¶
The Problem¶
As the number of documentation repositories grows, maintaining identical validation logic across them becomes unsustainable:
Copy-Paste Drift: Scripts copied between repos diverge over time as fixes are applied inconsistently.
Duplicated Testing: Each repo maintains its own test suite for identical logic.
Inconsistent Behavior: Different repos may have subtly different validation rules, causing confusion.
Bootstrap Overhead: New repos require significant setup effort to replicate the validation infrastructure.
The Solution¶
A standalone pip package that:
Provides a single source of truth for validation logic
Allows semantic versioning for controlled updates
Integrates seamlessly with pre-commit hooks
Supports per-repo configuration without code changes
Target Consumers¶
ai_engineering_book (this repository) - Primary development testbed
Future documentation repositories - Planned projects using MyST/Jupytext workflow
Community - Potential open-source release for MyST/Jupyter ecosystem
Architecture¶
Package Structure¶
docs-validation-engine/
├── pyproject.toml # Package metadata, dependencies, entry points
├── README.md # Usage documentation
├── LICENSE # MIT or similar
├── src/
│ └── docs_validator/
│ ├── __init__.py # Version, public API
│ ├── cli.py # Click-based CLI dispatcher
│ ├── config.py # Configuration loader (pyproject.toml)
│ ├── core/
│ │ ├── __init__.py
│ │ ├── file_finder.py # Shared file discovery logic
│ │ ├── link_extractor.py
│ │ └── reporter.py # Shared reporting/output logic
│ ├── validators/
│ │ ├── __init__.py
│ │ ├── broken_links.py
│ │ ├── link_format.py
│ │ ├── api_keys.py
│ │ ├── json_files.py
│ │ └── script_suite.py
│ └── sync/
│ ├── __init__.py
│ ├── jupytext_sync.py
│ └── jupytext_verify.py
└── tests/
├── conftest.py # Shared fixtures
├── test_config.py
├── test_broken_links.py
├── test_link_format.py
└── ...Configuration Schema¶
Consuming repositories configure the engine via pyproject.toml:
[tool.docs-validator]
# Directory exclusions (applied to all validators)
exclude_dirs = [
"drafts",
".venv",
"node_modules",
".git",
]
# File exclusions
exclude_files = [
".aider.chat.history.md",
"CHANGELOG.md",
]
# Link strings to ignore (for broken links checker)
exclude_link_strings = [
"example.com",
"placeholder",
"your-domain.com",
]
# Script suite paths (for script_suite validator)
[tool.docs-validator.script_suite]
scripts_dir = "tools/scripts"
tests_dir = "tools/tests"
docs_dir = "tools/docs/scripts_instructions"
excluded_scripts = ["paths.py", "__init__.py"]CLI Interface¶
# Individual validators
docs-validator check-broken-links [--paths PATH...] [--verbose]
docs-validator check-link-format [--paths PATH...] [--fix | --fix-all]
docs-validator check-api-keys [--paths PATH...]
docs-validator check-json-files [--paths PATH...]
docs-validator check-script-suite [--verbose]
# Jupytext operations
docs-validator jupytext-sync [--paths PATH...]
docs-validator jupytext-verify [--paths PATH...]
# Run all validators
docs-validator check-all [--paths PATH...]Pre-commit Integration¶
# .pre-commit-config.yaml in consuming repos
repos:
- repo: https://github.com/username/docs-validation-engine
rev: v0.1.0
hooks:
- id: check-broken-links
types: [markdown]
- id: check-link-format
types: [markdown]
- id: check-api-keys
types_or: [python, markdown, json]
- id: check-json-files
types: [json]
- id: jupytext-sync
types_or: [markdown, jupyter]
- id: jupytext-verify
types_or: [markdown, jupyter]
- id: check-script-suite
types: [python]Development Phases¶
Phase 1: Foundation (Week 1-2)¶
Goal: Create package skeleton with configuration system.
Tasks:
Initialize new repository with
uv initSet up package structure (
src/docs_validator/)Implement
config.py- load frompyproject.tomlCreate CLI skeleton with Click
Add CI/CD pipeline (GitHub Actions)
Write configuration tests
Deliverable: Empty package that reads configuration and provides CLI help.
Phase 2: Core Extraction (Week 3-4)¶
Goal: Extract and refactor existing validators.
Tasks:
Extract shared utilities (
file_finder.py,reporter.py,link_extractor.py)Port
check_broken_links.pywith configuration supportPort
check_link_format.pywith configuration supportPort remaining validators (
api_keys,json_files,script_suite)Migrate existing tests, adapt for new structure
Ensure 100% test coverage
Deliverable: All validators working via CLI with configuration.
Phase 3: Jupytext Integration (Week 5)¶
Goal: Extract notebook synchronization tools.
Tasks:
Port
jupytext_sync.pyPort
jupytext_verify_pair.pyAdd Jupytext as optional dependency
Write integration tests with real notebook fixtures
Deliverable: Full Jupytext support in package.
Phase 4: Pre-commit Hooks (Week 6)¶
Goal: Enable pre-commit integration.
Tasks:
Create
.pre-commit-hooks.yamlin package repoDefine hook entry points for each validator
Test integration with this repository as consumer
Document hook configuration options
Deliverable: Package usable as pre-commit repo source.
Phase 5: Migration & Documentation (Week 7-8)¶
Goal: Migrate this repository to use the package.
Tasks:
Add
docs-validation-engineas dev dependencyUpdate
.pre-commit-config.yamlto use remote hooksRemove local scripts (keep as reference/archive)
Update all documentation to reflect new workflow
Write comprehensive README for the package
Create migration guide for future repos
Deliverable: This repo fully migrated; package ready for v1.0.0.
Design Decisions¶
Why Click for CLI?¶
Industry standard for Python CLIs
Built-in help generation
Easy subcommand composition
Decorator-based, clean syntax
Why pyproject.toml for Configuration?¶
Already present in Python projects
Standard location for tool configuration (
[tool.X])No additional config files needed
Supports complex nested structures
Why Zero External Dependencies (Core)?¶
Follows SVA (Smallest Viable Architecture) principle
Faster installation
No dependency conflicts
Jupytext is optional (only needed for sync commands)
Why Semantic Versioning?¶
Clear compatibility guarantees
Consumers can pin to major versions
Breaking changes are explicit
Risk Mitigation¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Configuration complexity | Medium | Medium | Sensible defaults; minimal required config |
| Breaking changes affecting consumers | Medium | High | Strict semver; deprecation warnings before removal |
| Test coverage gaps during extraction | Low | High | Migrate tests alongside code; CI enforces coverage |
| Performance regression | Low | Medium | Benchmark before/after; optimize hot paths |
| Scope creep | Medium | Medium | Stick to existing functionality; new features after v1.0 |
Success Criteria¶
Functional Parity: All existing validation behavior preserved
Test Coverage: >= 90% line coverage
Documentation: README, migration guide, API docs
Performance: No measurable slowdown vs current scripts
Adoption: This repository successfully migrated
Usability: New repo setup < 15 minutes with package
Open Questions¶
Package Name:
docs-validation-engine,myst-docs-validator,jupytext-validator?Hosting: GitHub (public) or private registry?
License: MIT? Apache 2.0?
Minimum Python Version: 3.11? 3.12? 3.13?