Owner: Vadim Rudakov, rudakow
Version: 0.4.1
Birth: 2026-01-07
Last Modified: 2026-01-24
1. Architectural Overview: The SVA Principle¶
This script performs fast validation of relative file links within a directory and its subdirectories. While optimized for Markdown files (.md), it can scan any file containing Markdown-style links (e.g., .ipynb).
This tool is designed to serve as a high-quality diagnostic step in CI/CD, providing clear, parsable feedback to automate documentation maintenance.
It adheres to the Smallest Viable Architecture (SVA) principle.
Key Architectural Improvements in v0.4.0¶
The script handles the MyST type of links:
{include} path/to/file.md
2. Key Capabilities & Logic¶
The script identifies and validates three distinct types of references:
A. Markdown Links
Standard syntax: [text](link) or .
Regex:
r"\[[^\]]*\]\(([^)]+)\)"
B. MyST Include Directives
Used for file transclusion. The script identifies targets within MyST code blocks.
Syntax:
{include} path/to/file.mdRegex:
r"```\{include\}([^\n]+)"`Special Handling: The script automatically strips a single leading space (common in MyST formatting) to ensure the path resolves correctly.
C. Directory Resolution
If a link points to a directory (e.g., [Intro](./intro/)), the validator marks it as valid only if the directory contains an index file:
index.ipynbREADME.ipynb
Other features:
Git Root Awareness: The script attempts to find the Git project root using
git rev-parse --show-toplevel. This allows it to correctly resolve “root-absolute” links (e.g.,/docs/images/logo.png) relative to the repository base.Resolution Logic:
Relative Paths: Resolved relative to the source file.
Root-Relative Paths: Resolved starting from the Git root directory.
Directory Links: Validates that a directory exists and contains an index file (e.g.,
README.ipynb).Skips: Automatically ignores external URLs (
https://...), email links (mailto:), and internal document fragments (#anchor).Directory & File Exclusion: Automatically skips common noise directories like
.venvand.ipynb_checkpoints.
3. Technical Architecture¶
The script is organized into specialized classes to maintain clarity:
FileFinder: Handles recursive traversal. Implements exclusion logic for.ipynb_checkpointsand user-defined patterns.LinkExtractor: Scans file content line-by-line using regex. It captures both standard Markdown and{include}patterns.LinkValidator: The core engine. It determines if a link is an external URL, a fragment, or a local path, then resolves it against the filesystem.Reporter: CollectsBROKEN LINKstrings into a temporary file and exits withcode 1if the file is non-empty.
4. Operational Guide¶
Configuration Reference¶
Primary Script:
tools/scripts/check_broken_links.pyExclusion Logic: Managed via
tools/scripts/paths.py(e.g., ignoring.venv,in_progress/, and.ipynb_checkpoints).Pre-commit Config:
.pre-commit-config.yamlCI Config:
.github/workflows/quality.yml
Command Line Interface¶
check_broken_links.py [--paths PATH] [--pattern PATTERN] [options]
| Argument | Description | Default |
|---|---|---|
--paths | One or more directories or specific file paths to scan. | . (Current Dir) |
--pattern | Glob pattern for files to scan. | *.md |
--exclude-dirs | List of directory names to ignore. | in_progress, pr, .venv |
--exclude-files | List of specific filenames to ignore. | .aider.chat.history.ipynb |
--verbose | Shows detailed logs of skipped URLs and valid links. | False |
Manual Execution Commands¶
Run these from the repository root using uv for consistent environment resolution:
| Task | Command |
|---|---|
| Full Repo Audit (all .md) | uv run tools/scripts/check_broken_links.py |
| Scan Specific Directories | uv run tools/scripts/check_broken_links.py --paths tools/docs/ architecture/ |
| Scan Multiple Files | uv run tools/scripts/check_broken_links.py --paths file1.md file2.md |
| Notebook Audit | uv run tools/scripts/check_broken_links.py --pattern "*.ipynb" |
Examples¶
cd ../../../Check all
*.mdfiles in the current directory and subdirectories:
check_broken_links.pyUsing Git root as project root: ai_engineering_book
Found 72 files in: ai_engineering_book/
✅ All links are valid!
Check all
*.ipynbfiles recursively from thetools/docsdirectory:
check_broken_links.py --paths tools/docs --pattern "*.ipynb"Using Git root as project root: ai_engineering_book
Found 19 files in: tools/docs
✅ All links are valid!
Use exclusions (default exclusion are overidden, so be careful):
check_broken_links.py --exclude-dirs 4_orchestration in_porgress --exclude-files README.ipynb | head -n 10Using Git root as project root: ai_engineering_book
Found 806 files in: ai_engineering_book/
❌ 3008 Broken links found:
BROKEN LINK: File '.aider.chat.history.md:614' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p2.md
BROKEN LINK: File '.aider.chat.history.md:777' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p1.md
BROKEN LINK: File '.aider.chat.history.md:1610' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p1.md
BROKEN LINK: File '.aider.chat.history.md:1612' contains broken link: ./4_orchestration/patterns/llm_usage_patterns.md
BROKEN LINK: File '.aider.chat.history.md:1693' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p2.md
BROKEN LINK: File '.aider.chat.history.md:1695' contains broken link: ./2_model/selection/choosing_model_size.md
Traceback (most recent call last):
File "/home/commi/bin/check_broken_links.py", line 429, in <module>
main()
~~~~^^
File "/home/commi/bin/check_broken_links.py", line 34, in main
app.run()
~~~~~~~^^
File "/home/commi/bin/check_broken_links.py", line 185, in run
Reporter.report(temp_path, broken_links_found)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/commi/bin/check_broken_links.py", line 421, in report
print(report_content, end="")
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
Check the given file:
check_broken_links.py --paths 0_intro/00_onboarding.ipynbUsing Git root as project root: ai_engineering_book
Found 1 file in: 0_intro/00_onboarding.ipynb
✅ All links are valid!
check_broken_links.py --paths 0_intro/00_onboarding.ipynb README.mdUsing Git root as project root: ai_engineering_book
Found 2 files in:
- 0_intro/00_onboarding.ipynb
- README.md
✅ All links are valid!
Use verbose mode:
check_broken_links.py --verbose
5. Validation Layers¶
Layer 1: Local Pre-commit Hook (Delta Validation)¶
The first line of defense runs automatically during the git commit process to prevent broken links from entering the history.
Scope: All
.mdfiles are validated because if the developer changes their file name other files will not be able to reach it, so the developer must fix all the links they have broken.Efficiency: Fast execution ensures no significant delay in the developer’s workflow.
Logic Tests: Includes a meta-check (
test-check-broken-links) that triggers whenever the script itself or its tests change, ensuring the tool’s logic remains sound.
Layer 2: GitHub Action (Continuous Integration)¶
The CI pipeline in quality.yml validates ALL .md files when any documentation changes, ensuring renamed or moved files don’t break links across the repository.
Full Repository Scan: When any
.mdfile changes, the workflow scans ALL.mdfiles — not just the changed ones. This catches broken links in unchanged files that reference renamed/moved files.Trigger Optimization: Uses
tj-actions/changed-filesto detect when docs change, but runs the full scan to ensure consistency with the pre-commit hook.Environment Parity: Utilizes
uvfor high-performance dependency and environment management, mirroring the local development stack.Failure Isolation: Separates logic tests from link validation to pinpoint exactly where a failure occurs.
Layer 3: Manual Infrastructure Checks¶
Used for deep repository audits or post-refactoring cleanup.
Full Scan: Can be executed manually to scan the entire repository or specific directories.
Custom Patterns: Supports custom file patterns (e.g., scanning
.mdor.rstfiles) and exclusion lists.
CI Workflow Diagram¶
Test Suite¶
The script is accompanied by a comprehensive test suite (test_check_broken_links.py) that ensures reliability across different file structures and link types.
The test suite for check_broken_links.py is a robust validation layer designed to ensure the script accurately identifies broken local references while ignoring external URLs and specific environment-related directories. It uses pytest and focuses on unit testing core logic and end-to-end CLI behavior.
Core Components Tested¶
Link Extraction: Verifies that Markdown-style links
[text](link)and image linksare correctly identified, including edge cases like empty files or files with encoding issues.Validation Logic:
Relative & Absolute Paths: Ensures links like
file.ipynband/project/root/file.ipynbresolve correctly.Directory Indexing: Validates that links to a directory (e.g.,
docs/) are considered valid only if anindex.ipynborREADME.ipynbexists within it.Exclusions: Confirms that external URLs (
https://...) and internal fragments (#section) are safely skipped.File Discovery:
Tests the recursive search functionality.
Ensures excluded directories (like
.venvorin_progress) and auto-save folders (like.ipynb_checkpoints) are ignored.CLI & Environment:
Git Integration: Mocks Git environments to test how the script determines the project root.
Cross-Platform Behavior: Tests case-sensitivity (critical for Linux environments).
Exit Codes: Ensures the script returns
0for success and1when broken links are found, making it CI/CD friendly.
Running the Tests¶
To run the full suite, ensure you have pytest installed and execute the following in your terminal from the repo’s root dir:
$ uv run pytest path/to/test_check_broken_links.pyenv -u VIRTUAL_ENV uv run pytest tools/tests/test_check_broken_links.py -q......................................... [100%]
41 passed in 0.08s