Auditing developer machines for supply chain exposure with bumblebee

🎯 TL;DR

Bumblebee is a single Go binary that walks on-disk package metadata and emits a structured NDJSON stream of what it finds. No package managers are invoked, no network calls are made during scans, and no credentials appear in the output. It answers one specific question: which packages and versions are installed on this developer machine right now?

Choose your path:

🤔 Why I Looked at This

Supply-chain incidents have been a recurring pattern in 2026. Two of them landed close together and kept surfacing the same question for me. In March 2026, axios@1.14.1 and axios@0.30.4 were published with an injected dependency that ran a platform-specific remote access trojan during installation. In May 2026, 84 malicious artifacts across 42 @tanstack/* packages reached npm through a GitHub Actions compromise.

Neither incident was subtle. Both were caught relatively quickly. But in the hours between “advisory published” and “all machines confirmed clean,” the question that kept surfacing was the same: which developer machines have the affected version installed right now?

The usual tools answer different things. An SBOM tells you what shipped in a production artifact. An EDR tells you what ran or touched the network. Neither gives you a fast, read-only answer about the lockfiles, toolchain installs, and extension manifests scattered across developer laptops.

I have a ~/code directory with several projects living under it: a Hugo site, a Go CLI, a couple of Python utilities. I wanted to understand what Bumblebee could tell me about that tree, and how it works internally. Perplexity open-sourced it in May 2026 ➡️ under the Apache 2.0 license.

💡 The Problem

When an advisory names a specific package and version, the gap between “we know about it” and “we know who is affected” can be surprisingly wide.

Grepping lockfiles by hand works if you know which lockfiles exist and where they are. Asking developers to check their own machines works if you have time and a way to collect the answers. Neither approach is fast or auditable.

The information exists. It is just spread across dozens of files in slightly different formats per ecosystem: package-lock.json for npm, pnpm-lock.yaml for pnpm, go.sum for Go, *.dist-info/METADATA for Python, and so on. A scanner that already knows where these files live and how to read them is the faster path.

✨ The Solution

Bumblebee is a single static binary with no non-standard library dependencies. It runs, scans, emits output, and exits. There is no daemon and no persistent state between runs.

The core constraint is that it is read-only. It never invokes a package manager, never reads source files, and never makes network calls during a scan. It reads only the metadata files that package managers leave on disk. This matters because one class of supply-chain attacks relies on malicious post-install scripts. A scanner that never invokes a package manager cannot trigger them.

The output is NDJSON: one JSON record per line, written to stdout by default. Each record carries a record_type field:

package: one discovered installation
finding: a match against the exposure catalog
scan_summary: the run terminator with aggregate counts
diagnostic: written to stderr only, never to the records sink

🚀 Quick Start

Install

Follow the installation instructions in the Bumblebee announcement ➡️ to get the binary for your platform. The tool requires Go 1.25 or later if you are building from source.

Bumblebee runs on macOS and Linux only. Windows is not supported in v0.1.

Several commands below pipe output to jq for readable terminal output. Install it before running those commands: on macOS run brew install jq, on Linux use your package manager (apt install jq, dnf install jq, etc.).

Run a baseline scan

The baseline profile covers global toolchain installs, editor extensions (VS Code, Cursor, Windsurf, VSCodium), MCP configuration files, and browser extension profiles. It does not walk project directories.

A developer machine with a full toolchain produces hundreds of records. The stream is designed to feed into a file or a downstream system, not to be read line by line in a terminal.

To capture everything for later analysis:

bumblebee scan --profile baseline > baseline-$(date +%Y%m%d).ndjson

To just check the summary in the terminal:

bumblebee scan --profile baseline | jq 'select(.record_type == "scan_summary")'

Run a project scan

The project profile walks a fixed set of development directories: ~/code, ~/src, ~/Developer, ~/Projects, and ~/workspace. All 11 ecosystem parsers run.

bumblebee scan --profile project

The project profile only looks in those five directory names. If your dev tree lives somewhere else (for example ~/Development), use the deep profile with an explicit root. The output is equally verbose, so save it to a file or check just the summary:

bumblebee scan --profile deep --root ~/Development > scan-$(date +%Y%m%d).ndjson
bumblebee scan --profile deep --root ~/Development | jq 'select(.record_type == "scan_summary")'

Check for a specific exposure

The Bumblebee repository includes a threat_intel/ directory with pre-built catalog files for documented supply-chain incidents. If you built from source the directory is already in your repository root. If you installed a pre-built binary, download the repository separately to get the catalogs, then point --exposure-catalog at the absolute path.

Since --findings-only suppresses the full package stream, the output is limited to matches and stays readable in a terminal without redirecting to a file:

bumblebee scan --profile deep --root ~/code \
  --exposure-catalog /path/to/bumblebee-repo/threat_intel/ \
  --findings-only \
  | jq '.'

You can also write your own catalog JSON using the format described in the project README.

🏗️ How It Works Inside

The three profiles

Profile selection is the primary way to balance scan breadth against runtime cost.

Profile	What it walks	Typical cadence	Use for
`baseline`	Global and user-level toolchain installs, editor extensions, MCP config directories, browser extension profiles. No project trees.	Every 15 min or at login	Fleet-wide toolchain and tool inventory
`project`	Configured development directories: `~/code`, `~/src`, `~/Developer`, `~/Projects`, `~/workspace`. All ecosystems apply.	Daily	Per-project lockfile and dependency inventory
`deep`	Any explicit `--root` path, including bare home directories. Requires at least one `--root` argument.	On demand	Incident-response sweeps against a specific advisory

baseline and project refuse bare home roots. If you pass $HOME or /home/alice to either of them, the CLI will reject it. Only deep accepts a bare home root.

Filesystem walk

Before dispatching any file to a parser, the walker applies a default exclusion list:

Credential directories: .ssh, .aws, .kube, .gnupg, .docker
VCS internals: .git, .hg
macOS system caches: Library/Caches, Library/Mail, Library/Messages
Browser application data outside the enumerated extension profile paths

Symlink loops are detected via inode tracking rather than path comparison, so the walk terminates correctly on any directory tree. Permission errors (EACCES, EPERM) emit a debug-level diagnostic and continue. Missing optional roots emit an info-level diagnostic. You can add extra directories to skip with --exclude.

Ecosystem parsers

The scanner initializes 11 ecosystem-specific parser structs at startup. As the walker visits files, it matches each filename against a dispatch table and sends matching files to the appropriate parser via a worker pool. The default concurrency is 4 workers, configurable with --concurrency.

Each parser opens only the specific file it receives. It never walks a directory or calls a package manager.

File	Parser	Source type
`package-lock.json`, `npm-shrinkwrap.json`	npm	`npm-lockfile`
`pnpm-lock.yaml`	pnpm	`pnpm-lockfile`
`yarn.lock`	yarn	`yarn-lockfile`
`bun.lock`	bun	`bun-lockfile`
`node_modules/<pkg>/package.json`	npm	`npm-node_modules`
`*.dist-info/METADATA`	pypi	`pypi-dist-info`
`go.sum`	go	`go-sum`
`go.mod`	go	`go-mod`
`Gemfile.lock`	rubygems	`gemfile-lock`
`composer.lock`	composer	`composer-lockfile`
`claude_desktop_config.json`, `mcp.json`, `.mcp.json`, `~/.gemini/settings.json`	mcp	`mcp-config`
`package.json` inside `.vscode/extensions/…`	editor-ext	`editor-extension`
`manifest.json` inside a Chromium extension profile	browser-ext	`chromium-extension`

The --ecosystem flag restricts which parsers are active for a run, which is useful for targeted scans or performance tuning.

Record types and what they carry

Every record carries a common header: record_type, record_id, schema_version, scanner_name, scanner_version, run_id, scan_time, and endpoint (hostname, OS, arch, username).

A package record represents one discovered installation:

{
    "record_type": "package",
    "ecosystem": "npm",
    "package_name": "axios",
    "normalized_name": "axios",
    "version": "1.14.1",
    "source_file": "/home/alice/code/myapp/package-lock.json",
    "source_type": "npm-lockfile",
    "confidence": "high",
    "has_lifecycle_scripts": false,
    "root_kind": "project_root",
    "profile": "deep"
}

The has_lifecycle_scripts field (npm, pnpm, and yarn only) tells you whether the package defines install hooks. It does not mean those hooks ran during the scan; it means they would run if the package were installed through a package manager. That distinction matters when triaging exposure.

A finding record is emitted when a package matches the exposure catalog. It carries all identifying fields from the package record plus catalog_id, catalog_name, severity, and evidence:

{
    "record_type": "finding",
    "ecosystem": "npm",
    "package_name": "axios",
    "version": "1.14.1",
    "source_file": "/home/alice/code/myapp/package-lock.json",
    "catalog_name": "axios supply chain compromise March 2026",
    "severity": "critical",
    "evidence": "exact name+version match (version=1.14.1)"
}

A scan_summary record is always emitted last. Its status field is complete, partial (if --max-duration was reached or the scan was interrupted), or error. Receivers should only promote a run to current state after status=complete. That matters for recurring scans where an interrupted run should not overwrite a valid previous result.

Deduplication

Every package record is assigned a record_id that is a SHA-256 hash of a canonical identity tuple: ecosystem, normalized name, version, source file, profile, root kind, and a few other fields. If two parsers encounter the same logical package within the same run (for example, a package that appears in both package-lock.json and node_modules/), only the first is emitted. The record_id is stable across runs, so the same package observed identically on consecutive scans produces the same ID. Receivers can use it as a deduplication key when building current-state tables.

Exposure catalog matching

When --exposure-catalog is provided, every accepted package record is matched against the catalog using exact (ecosystem, normalized_name, version) matching. No semver ranges, no fuzzy matching. A match produces one finding record per matching catalog entry. The Bumblebee repository ships a threat_intel/ directory with pre-built catalogs maintained from public supply-chain reporting — point --exposure-catalog at that directory or at any individual JSON catalog file you write yourself.

📦 Scanning Multiple Projects

The deep profile accepts any directory as a root, which makes it the right choice for scanning a dev tree regardless of how it is named. Pass your development directory with --root and Bumblebee walks everything under it. Each package record carries root_kind: "project_root" and a source_file with the absolute path, so you can tell exactly which project a dependency came from.

To get a flat view of everything installed across all your projects:

bumblebee scan --profile deep --root ~/code \
  | jq -r 'select(.record_type == "package") | [.ecosystem, .package_name, .version, .source_file] | @tsv'

For large dev trees the output can be long. Redirect to a file to inspect it at leisure:

bumblebee scan --profile deep --root ~/code \
  | jq -r 'select(.record_type == "package") | [.ecosystem, .package_name, .version, .source_file] | @tsv' \
  > packages-$(date +%Y%m%d).tsv

To narrow to a specific ecosystem:

bumblebee scan --profile deep --root ~/code --ecosystem npm \
  | jq -r 'select(.record_type == "package") | [.package_name, .version, .source_file] | @tsv'

To check all projects for known-compromised versions in one pass, point the scan at the threat_intel/ directory from the Bumblebee repository. With --findings-only the stream is limited to matches, so the terminal output stays manageable without redirecting to a file:

bumblebee scan --profile deep --root ~/code \
  --exposure-catalog /path/to/bumblebee-repo/threat_intel/ \
  --findings-only \
  | jq '.'

If the summary shows findings_emitted: 0, none of the projects under your dev directory match the catalog. If findings appear, each one gives you the exact file, ecosystem, version, and severity.

Incident response with the deep profile

If you need to sweep a home directory directly, use the deep profile. It accepts bare home roots and any explicit --root path:

bumblebee scan --profile deep \
  --root /home/alice \
  --exposure-catalog axios-advisory.json \
  --findings-only

The deep profile walks the entire root path you provide, subject only to the default exclusion list and any --exclude flags you add. On a home directory, it will cover every project tree, toolchain install, and config file it can reach. This is intentional for incident response but takes longer than a project or baseline scan.

On macOS, the --all-users flag expands baseline and project scans across every /Users/<name> home without requiring a bare home root. That makes it practical for a single MDM-deployed invocation to cover all developer accounts on a machine.

⚖️ Honest Trade-offs

What you gain	What you lose
Read-only scan: no risk of triggering post-install scripts during analysis	Exact-version matching only: no semver ranges, no wildcard expressions
Works entirely from on-disk state: no registry access, no network calls	macOS and Linux only in v0.1: no Windows support
Covers MCP and AI tool configs that no other scanner currently inventories	Scheduling is the operator's responsibility: cron, launchd, or MDM (Bumblebee does not manage cadence itself)
Stable `record_id` across runs makes deduplication trivial for downstream receivers	The exposure catalog must be maintained: the scanner is only as useful as the catalog entries it ships with or that you keep current
NDJSON pipes cleanly to jq, databases, or any HTTP endpoint	`confidence: medium` or `confidence: low` records exist for partial metadata: version attribution is less certain in those cases

🔑 Key Insights

The gap Bumblebee fills is specific: not “what shipped” or “what ran,” but “what is installed on this developer machine right now.”
The read-only design is a deliberate choice. Reading metadata files is slower than asking a package manager directly and deliberately avoids triggering anything the packages define.
Three profiles let you match breadth to cost. Baseline covers the global toolchain and AI tool configs. Project covers all projects in your standard dev directories. Deep covers everything when you need it.
The exposure catalog decouples threat intelligence from the scanner binary. You can update the catalog and re-run without touching the binary or waiting for a tool update.
MCP config coverage matters now. AI coding tools are standard on developer machines, and their configurations are a meaningful attack surface. Bumblebee inventories them the same way it inventories npm packages.

Final Thoughts

What I find useful about Bumblebee is that it is narrow. It does not try to replace your EDR or your SBOM pipeline. It answers one question and it answers it from on-disk state without executing anything.

That narrowness is what makes it practical to run on a schedule or to drop into an incident response workflow. A scan finishes, emits a scan_summary with status=complete, and exits. Downstream tooling handles the rest.

If you have a development directory with several projects living under it, running bumblebee scan --profile deep --root ~/code and piping through jq is worth doing once just to see what the full picture looks like. You may find versions you did not expect still pinned in older lockfiles.

What ecosystems or config types would you want to see added to a future version? I am curious whether Cargo or Maven coverage would change how useful this is for teams that are primarily Rust or JVM shops. Let me know on LinkedIn ➡️ .

Photo by Kai Wenzel ➡️ on Unsplash ➡️

Auditing Developer Machines for Supply Chain Exposure with Bumblebee

🎯 TL;DR

🤔 Why I Looked at This

💡 The Problem

✨ The Solution

🚀 Quick Start

Install

Run a baseline scan

Run a project scan

Check for a specific exposure

🏗️ How It Works Inside

The three profiles

Filesystem walk

Ecosystem parsers

Record types and what they carry

Deduplication

Exposure catalog matching

📦 Scanning Multiple Projects

Incident response with the deep profile

⚖️ Honest Trade-offs

🔑 Key Insights

Final Thoughts

Jonathan Búcaro

Navigation

Auditing Developer Machines for Supply Chain Exposure with Bumblebee

🎯 TL;DR

🤔 Why I Looked at This

💡 The Problem

✨ The Solution

🚀 Quick Start

Install

Run a baseline scan

Run a project scan

Check for a specific exposure

🏗️ How It Works Inside

The three profiles

Filesystem walk

Ecosystem parsers

Record types and what they carry

Deduplication

Exposure catalog matching

📦 Scanning Multiple Projects

Incident response with the deep profile

⚖️ Honest Trade-offs

🔑 Key Insights

Final Thoughts

You might be interested in

Jonathan Búcaro

Navigation