Architecture¶
Pipeline¶
Crossfire follows a linear pipeline:
Module overview¶
Orchestration¶
analyzer.py— Coordinates the full pipeline. Entry point for programmatic use.cli.py— Click CLI withscan,compare,validate,generate-corpus,evaluate,evaluate-git,diffcommands.
Core pipeline¶
regex.py— Regex compilation with optional RE2 acceleration. Triesgoogle-re2first (Thompson NFA, 10-100x faster), falls back to stdlibreper-pattern for PCRE-only features (backreferences, lookahead).loader.py— Format-agnostic rule loading (JSON/YAML/CSV/TOML) with fail-fast validation. Delegates to plugins for tool-specific formats. Usesregex.pyfor compilation.generator.py— Corpus generation viarstr+ fallback. Generates positive and negative samples per rule, with per-rule timeout and deduplication. Parallelized across rules for large rule sets.evaluator.py— Parallel cross-rule regex evaluation (NxN match matrix). UsesProcessPoolExecutorwith corpus-chunked parallelism (each worker gets all rules + a corpus slice). Usesregex.pyfor compilation in workers.classifier.py— Relationship classification (duplicate/subset/superset/overlap/disjoint), clustering with keep/drop recommendations, Wilson score confidence intervals.reporter.py— Output rendering (JSON/table/CSV/summary) with quality insights.
Quality and evaluation¶
quality.py— Per-rule quality scoring: specificity, false positive potential, unique coverage, broad pattern detection, pattern complexity (via regex AST).confidence.py— Wilson score confidence intervals for overlap proportions.corpus.py— Real-world corpus loading (JSONL + git history), labeled evaluation (precision/recall/F1), differential analysis (coverage drift).
Data model¶
models.py— Core dataclasses:Rule,CorpusEntry,OverlapResult,AnalysisReport.errors.py—CrossfireError,ValidationError,LoadError,GenerationError.logging.py— Structured logging (text + JSON formats).
Plugin system¶
plugins/__init__.py— Plugin registry withRuleAdapterprotocol and entry_point discovery.plugins/gitleaks.py— GitLeaks.gitleaks.tomladapter.plugins/semgrep.py— Semgrep YAML adapter (extractspattern-regex).plugins/yara.py— YARA.yaradapter (regex strings fromstrings:section).plugins/sigma.py— Sigma YAML adapter (|remodifier patterns).plugins/snort.py— Snort/Suricata.rulesadapter (pcrepatterns).
Key design decisions¶
- Fail-fast by default: Invalid regex, empty pattern, duplicate names cause immediate failure.
--skip-invalidfor lenient mode. - Corpus-based, not structural: Generates strings from regexes and tests empirically. Handles any Python
refeature. rstrfor string generation (BSD license), not exrex (AGPL).ProcessPoolExecutorfor parallel evaluation — regexes re-compiled in workers since Pattern objects aren't serializable. Corpus-chunked (not rule-chunked) to minimize serialization overhead.- Optional RE2:
google-re2used when installed for 10-100x faster matching. Automatic per-pattern fallback to stdlibre. - Type-safe: mypy strict mode enforced across the entire codebase.