Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Instruction on check_broken_links.py script


Owner: Vadim Rudakov, rudakow.wadim@gmail.com
Version: 0.4.1
Birth: 2026-01-07
Last Modified: 2026-01-24


1. Architectural Overview: The SVA Principle

This script performs fast validation of relative file links within a directory and its subdirectories. While optimized for Markdown files (.md), it can scan any file containing Markdown-style links (e.g., .ipynb).

This tool is designed to serve as a high-quality diagnostic step in CI/CD, providing clear, parsable feedback to automate documentation maintenance.

It adheres to the Smallest Viable Architecture (SVA) principle.

Key Architectural Improvements in v0.4.0

The script handles the MyST type of links:

{include} path/to/file.md

2. Key Capabilities & Logic

The script identifies and validates three distinct types of references:

A. Markdown Links

Standard syntax: [text](link) or ![alt](image).

  • Regex: r"\[[^\]]*\]\(([^)]+)\)"

B. MyST Include Directives

Used for file transclusion. The script identifies targets within MyST code blocks.

  • Syntax: {include} path/to/file.md

  • Regex: r"```\{include\}([^\n]+)"`

  • Special Handling: The script automatically strips a single leading space (common in MyST formatting) to ensure the path resolves correctly.

C. Directory Resolution

If a link points to a directory (e.g., [Intro](./intro/)), the validator marks it as valid only if the directory contains an index file:

  1. index.ipynb

  2. README.ipynb

Other features:

  • Git Root Awareness: The script attempts to find the Git project root using git rev-parse --show-toplevel. This allows it to correctly resolve “root-absolute” links (e.g., /docs/images/logo.png) relative to the repository base.

  • Resolution Logic:

    • Relative Paths: Resolved relative to the source file.

    • Root-Relative Paths: Resolved starting from the Git root directory.

  • Directory Links: Validates that a directory exists and contains an index file (e.g., README.ipynb).

  • Skips: Automatically ignores external URLs (https://...), email links (mailto:), and internal document fragments (#anchor).

  • Directory & File Exclusion: Automatically skips common noise directories like .venv and .ipynb_checkpoints.

3. Technical Architecture

The script is organized into specialized classes to maintain clarity:

  • FileFinder: Handles recursive traversal. Implements exclusion logic for .ipynb_checkpoints and user-defined patterns.

  • LinkExtractor: Scans file content line-by-line using regex. It captures both standard Markdown and {include} patterns.

  • LinkValidator: The core engine. It determines if a link is an external URL, a fragment, or a local path, then resolves it against the filesystem.

  • Reporter: Collects BROKEN LINK strings into a temporary file and exits with code 1 if the file is non-empty.

4. Operational Guide

Configuration Reference

  • Primary Script: tools/scripts/check_broken_links.py

  • Exclusion Logic: Managed via tools/scripts/paths.py (e.g., ignoring .venv, in_progress/, and .ipynb_checkpoints).

  • Pre-commit Config: .pre-commit-config.yaml

  • CI Config: .github/workflows/quality.yml

Command Line Interface

check_broken_links.py [--paths PATH] [--pattern PATTERN] [options]
ArgumentDescriptionDefault
--pathsOne or more directories or specific file paths to scan.. (Current Dir)
--patternGlob pattern for files to scan.*.md
--exclude-dirsList of directory names to ignore.in_progress, pr, .venv
--exclude-filesList of specific filenames to ignore..aider.chat.history.ipynb
--verboseShows detailed logs of skipped URLs and valid links.False

Manual Execution Commands

Run these from the repository root using uv for consistent environment resolution:

TaskCommand
Full Repo Audit (all .md)uv run tools/scripts/check_broken_links.py
Scan Specific Directoriesuv run tools/scripts/check_broken_links.py --paths tools/docs/ architecture/
Scan Multiple Filesuv run tools/scripts/check_broken_links.py --paths file1.md file2.md
Notebook Audituv run tools/scripts/check_broken_links.py --pattern "*.ipynb"

Examples

cd ../../../
  1. Check all *.md files in the current directory and subdirectories:

check_broken_links.py
Using Git root as project root: ai_engineering_book
Found 72 files in: ai_engineering_book/

✅ All links are valid!
  1. Check all *.ipynb files recursively from the tools/docs directory:

check_broken_links.py --paths tools/docs --pattern "*.ipynb"
Using Git root as project root: ai_engineering_book
Found 19 files in: tools/docs

✅ All links are valid!
  1. Use exclusions (default exclusion are overidden, so be careful):

check_broken_links.py --exclude-dirs 4_orchestration in_porgress --exclude-files README.ipynb | head -n 10
Using Git root as project root: ai_engineering_book
Found 806 files in: ai_engineering_book/

❌ 3008 Broken links found:
BROKEN LINK: File '.aider.chat.history.md:614' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p2.md
BROKEN LINK: File '.aider.chat.history.md:777' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p1.md
BROKEN LINK: File '.aider.chat.history.md:1610' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p1.md
BROKEN LINK: File '.aider.chat.history.md:1612' contains broken link: ./4_orchestration/patterns/llm_usage_patterns.md
BROKEN LINK: File '.aider.chat.history.md:1693' contains broken link: /home/commi/Yandex.Disk/it_working/projects/ai/ai_engineering_book/4_orchestration/patterns/llm_usage_patterns_p2.md
BROKEN LINK: File '.aider.chat.history.md:1695' contains broken link: ./2_model/selection/choosing_model_size.md
Traceback (most recent call last):
  File "/home/commi/bin/check_broken_links.py", line 429, in <module>
    main()
    ~~~~^^
  File "/home/commi/bin/check_broken_links.py", line 34, in main
    app.run()
    ~~~~~~~^^
  File "/home/commi/bin/check_broken_links.py", line 185, in run
    Reporter.report(temp_path, broken_links_found)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/commi/bin/check_broken_links.py", line 421, in report
    print(report_content, end="")
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
  1. Check the given file:

check_broken_links.py --paths 0_intro/00_onboarding.ipynb
Using Git root as project root: ai_engineering_book
Found 1 file in: 0_intro/00_onboarding.ipynb

✅ All links are valid!
check_broken_links.py --paths 0_intro/00_onboarding.ipynb README.md
Using Git root as project root: ai_engineering_book
Found 2 files in:
- 0_intro/00_onboarding.ipynb
- README.md

✅ All links are valid!
  1. Use verbose mode:

    check_broken_links.py --verbose

5. Validation Layers

Layer 1: Local Pre-commit Hook (Delta Validation)

The first line of defense runs automatically during the git commit process to prevent broken links from entering the history.

  • Scope: All .md files are validated because if the developer changes their file name other files will not be able to reach it, so the developer must fix all the links they have broken.

  • Efficiency: Fast execution ensures no significant delay in the developer’s workflow.

  • Logic Tests: Includes a meta-check (test-check-broken-links) that triggers whenever the script itself or its tests change, ensuring the tool’s logic remains sound.

Layer 2: GitHub Action (Continuous Integration)

The CI pipeline in quality.yml validates ALL .md files when any documentation changes, ensuring renamed or moved files don’t break links across the repository.

  • Full Repository Scan: When any .md file changes, the workflow scans ALL .md files — not just the changed ones. This catches broken links in unchanged files that reference renamed/moved files.

  • Trigger Optimization: Uses tj-actions/changed-files to detect when docs change, but runs the full scan to ensure consistency with the pre-commit hook.

  • Environment Parity: Utilizes uv for high-performance dependency and environment management, mirroring the local development stack.

  • Failure Isolation: Separates logic tests from link validation to pinpoint exactly where a failure occurs.

Layer 3: Manual Infrastructure Checks

Used for deep repository audits or post-refactoring cleanup.

  • Full Scan: Can be executed manually to scan the entire repository or specific directories.

  • Custom Patterns: Supports custom file patterns (e.g., scanning .md or .rst files) and exclusion lists.

CI Workflow Diagram

Test Suite

The script is accompanied by a comprehensive test suite (test_check_broken_links.py) that ensures reliability across different file structures and link types.

The test suite for check_broken_links.py is a robust validation layer designed to ensure the script accurately identifies broken local references while ignoring external URLs and specific environment-related directories. It uses pytest and focuses on unit testing core logic and end-to-end CLI behavior.

Core Components Tested

  • Link Extraction: Verifies that Markdown-style links [text](link) and image links ![alt](image) are correctly identified, including edge cases like empty files or files with encoding issues.

  • Validation Logic:

  • Relative & Absolute Paths: Ensures links like file.ipynb and /project/root/file.ipynb resolve correctly.

  • Directory Indexing: Validates that links to a directory (e.g., docs/) are considered valid only if an index.ipynb or README.ipynb exists within it.

  • Exclusions: Confirms that external URLs (https://...) and internal fragments (#section) are safely skipped.

  • File Discovery:

  • Tests the recursive search functionality.

  • Ensures excluded directories (like .venv or in_progress) and auto-save folders (like .ipynb_checkpoints) are ignored.

  • CLI & Environment:

  • Git Integration: Mocks Git environments to test how the script determines the project root.

  • Cross-Platform Behavior: Tests case-sensitivity (critical for Linux environments).

  • Exit Codes: Ensures the script returns 0 for success and 1 when broken links are found, making it CI/CD friendly.

Running the Tests

To run the full suite, ensure you have pytest installed and execute the following in your terminal from the repo’s root dir:

$ uv run pytest path/to/test_check_broken_links.py
env -u VIRTUAL_ENV uv run pytest tools/tests/test_check_broken_links.py -q
.........................................                                [100%]
41 passed in 0.08s