Skip to main content
The MongoDB resource exposes MongoDB databases, collections, and documents as a virtual filesystem mounted at some prefix such as /mongodb/. For connection setup, see MongoDB Setup.

Config

import os

from mirage import MountMode, Workspace
from mirage.resource.mongodb import MongoDBConfig, MongoDBResource

config = MongoDBConfig(
    uri=os.environ["MONGODB_URI"],
    default_doc_limit=1000,
    default_search_limit=100,
    max_doc_limit=5000,
    elide_fields={"sample_mflix.movies": ["plot_embedding"]},
)
resource = MongoDBResource(config=config)
ws = Workspace({"/mongodb": resource}, mode=MountMode.READ)
Config fieldRequiredDefaultDescription
uriyesMongoDB connection URI
databasesnoList of database names to mount (omit for all)
default_doc_limitno1000Default doc cap for one-shot reads (not used by streaming cat)
default_search_limitno100Default result cap for collection/db-level grep
max_doc_limitno5000Hard cap for head -n K / tail -n K
elide_fieldsno{}{"<db>.<coll>": ["field", "nested.path"]}; listed fields are dropped from documents.jsonl

Filesystem Layout

The mount mirrors MongoDB’s cluster → database → collection / view model. The database directory always appears in the path, even when databases filters to a single entry.
/mongodb/
  <database>/
    database.json                  # collections + views + counts
    collections/
      <collection>/
        schema.json                # sampled types + indexes + validator
        documents.jsonl            # streamed BSON Extended JSON
    views/
      <view>/
        schema.json                # view-aware (no indexes / validator)
        documents.jsonl            # streamed, same format as collections
Example:
/mongodb/
  sample_mflix/
    database.json
    collections/
      movies/
        schema.json
        documents.jsonl
      comments/
        schema.json
        documents.jsonl
    views/
      top_rated_movies/
        schema.json
        documents.jsonl

documents.jsonl

One JSON object per line, encoded with BSON Relaxed Extended JSON. BSON-specific types round-trip through their canonical $ wrappers:
{"_id":{"$oid":"573a1390f29313caabcd4135"},"title":"Casablanca","released":{"$date":"1943-01-23T00:00:00Z"},"runtime":102}
cat / head / tail / grep / jq all stream from this file; nothing materializes the full collection in memory.

schema.json

Generated from a 100-document $sample. Includes:
  • field path → observed BSON type frequencies (nested paths unioned across the sample)
  • indexes from listIndexes plus access counts from $indexStats
  • the $jsonSchema validator if one is registered
Views skip the index and validator sections.

database.json

Lists every collection and view under the database with their document counts. Useful for cat /mongodb/<db>/database.json to get an overview without recursing into each entity.

Elided fields

Fields listed under elide_fields are dropped entirely from documents.jsonl output. The type stays documented in schema.json, so heavy fields (embeddings, large binary, raw text blobs) can be hidden from agent reads without losing the schema signal:
config = MongoDBConfig(
    uri=os.environ["MONGODB_URI"],
    elide_fields={
        "sample_mflix.movies": ["plot_embedding"],
        "rag.docs": ["metadata.embedding"],
    },
)
Nested paths use dot notation. Elision applies to both cat (one-shot streaming reads) and tail -f (live change-stream follows).

Streaming and Limits

cat, grep, head, and tail -f consume documents lazily through a batched PyMongo async cursor (or change stream); the consumer cancels to stop fetching. There is no truncation notice because nothing is forced into memory ahead of the consumer.
CommandBehavior
catStreams the whole collection sorted by _id; pipe to head to cap
head -n K / tail -n KK is capped at max_doc_limit; server-side sort + limit
tail -fOpens a change stream; yields each new insert as a JSONL line
grep (file level)Streams from documents.jsonl; supports -m for short-circuit
grep (collection/db level)Server-side query, capped at default_search_limit
jqInherits the streaming cat source
wcUses countDocuments() server-side; zero download
statMetadata only (doc count, indexes; views skip the index lookup)

Smart Commands

grep at different scopes

grep uses MongoDB’s query engine at directory scopes instead of streaming all documents through the regex pipeline:
# FILE level - streams documents.jsonl, runs the regex locally
grep "Godfather" "/mongodb/sample_mflix/collections/movies/documents.jsonl"

# COLLECTION level - server-side query against the collection
grep "Godfather" "/mongodb/sample_mflix/collections/movies/"

# DATABASE level - searches across every collection in sample_mflix
grep "Godfather" "/mongodb/sample_mflix/"

# ROOT level - fans out across every mounted database
grep "Godfather" "/mongodb/"
At collection or higher scope the resource picks the best server-side strategy from the indexes available:
  1. Text index exists → uses $text (ranked by relevance)
  2. Atlas Search index exists → uses $search (fuzzy, Lucene-based)
  3. Neither → falls back to $regex on sampled string fields
Scope detection is handled by mirage/core/mongodb/scope.py.

head / tail / tail -f

head and tail use server-side sort + limit; the requested count is capped at max_doc_limit. tail -f opens a Mongo change stream filtered to insert events and yields each new document as a JSONL line in the same format as cat:
# First 10 docs (sorted by _id ascending)
head -n 10 "/mongodb/sample_mflix/collections/movies/documents.jsonl"

# Last 10 docs (sorted by _id descending)
tail -n 10 "/mongodb/sample_mflix/collections/movies/documents.jsonl"

# Live-follow new inserts; consumer cancels to stop
tail -f "/mongodb/sample_mflix/collections/movies/documents.jsonl"
tail -f requires the cluster to be a replica set; Atlas already satisfies this. Views fall through to the non-streaming path because change streams aren’t defined on views.

Cache

The MongoDB resource uses IndexCacheStore (same as RAM/S3/disk/GitHub) for listings: database names, collection names, and document counts. Document content is not cached. The resource leaves caches_reads at its default of False, so cat, grep, head, and tail always query the live collection instead of serving a stored snapshot. This keeps reads consistent with a mutable database and ensures tail -f follows the live change stream rather than replaying cached bytes.

Example

import asyncio
import os

from dotenv import load_dotenv

from mirage import MountMode, Workspace
from mirage.resource.mongodb import MongoDBConfig, MongoDBResource

load_dotenv(".env.development")

config = MongoDBConfig(uri=os.environ["MONGODB_URI"])
resource = MongoDBResource(config=config)


async def main():
    ws = Workspace({"/mongodb": resource}, mode=MountMode.READ)

    # List all databases
    r = await ws.execute("ls /mongodb/")
    print(await r.stdout_str())

    # List the entities under a database (database.json, collections/, views/)
    r = await ws.execute("ls /mongodb/sample_mflix/")
    print(await r.stdout_str())

    # List collections
    r = await ws.execute("ls /mongodb/sample_mflix/collections/")
    print(await r.stdout_str())

    # Read first 5 movies
    r = await ws.execute(
        'head -n 5 "/mongodb/sample_mflix/collections/movies/documents.jsonl"')
    print(await r.stdout_str())

    # Read last 5 movies
    r = await ws.execute(
        'tail -n 5 "/mongodb/sample_mflix/collections/movies/documents.jsonl"')
    print(await r.stdout_str())

    # Inspect the sampled schema and indexes
    r = await ws.execute(
        'cat "/mongodb/sample_mflix/collections/movies/schema.json"')
    print(await r.stdout_str())

    # Extract titles with jq
    r = await ws.execute(
        'jq -r ".[] | .title" "/mongodb/sample_mflix/collections/movies/documents.jsonl"'
    )
    print(await r.stdout_str())

    # Search across a database (uses MongoDB query engine)
    r = await ws.execute('grep "Godfather" "/mongodb/sample_mflix/"')
    print(await r.stdout_str())

    # Count documents (server-side, no download)
    r = await ws.execute(
        'wc -l "/mongodb/sample_mflix/collections/movies/documents.jsonl"')
    print(await r.stdout_str())

    # View the database overview without recursing
    r = await ws.execute('cat "/mongodb/sample_mflix/database.json"')
    print(await r.stdout_str())


if __name__ == "__main__":
    asyncio.run(main())

Runnable examples

Three working examples live under examples/python/mongodb/:
  • mongodb.py — agent-shell workflow: ls, tree, cat, head, tail, wc, stat, grep/rg at every scope, jq, find, cd + relative paths.
  • mongodb_vfs.py — in-process VFS: os.listdir and open() walk every readdir level (root, database, collections/, views/, entity) and read database.json, schema.json, documents.jsonl (collection + view).
  • mongodb_fuse.py — same coverage as the VFS example, but the tree is mounted as a real filesystem so other processes can cat/ls/head the mountpoint directly.
All three default to the mirage_test database seeded by python/scripts/seed_mongodb_test.py.

Finding IDs

_id is serialized as {"$oid": "..."} under Extended JSON:
# List the first 10 ObjectId values
jq -r '.[] | ._id["$oid"]' \
  "/mongodb/sample_mflix/collections/movies/documents.jsonl" | head -n 10

# Find a specific document by ID
grep "573a1390f29313caabcd42e8" \
  "/mongodb/sample_mflix/collections/movies/documents.jsonl"

# Extract a few fields together
jq -r '.[] | "\(._id["$oid"]) \(.title) \(.year)"' \
  "/mongodb/sample_mflix/collections/movies/documents.jsonl"

Working with Large Collections

Streaming is the default; reach for these patterns when you want to keep the round trip small:
# Document count (server-side, no download)
wc -l "/mongodb/sample_mflix/collections/comments/documents.jsonl"

# Most recent documents (sorted by _id desc)
tail -n 10 "/mongodb/sample_mflix/collections/comments/documents.jsonl"

# Server-side query at collection scope avoids streaming the whole jsonl
grep "love" "/mongodb/sample_mflix/collections/comments/"

# Drop heavy fields via elide_fields, then take the slice you want
head -n 20 "/mongodb/sample_mflix/collections/movies/documents.jsonl"
Hide embeddings or large blobs from agent reads with elide_fields; schema.json still documents the original type so the agent can decide when to ask for the raw bytes through a different path.

Shell Commands

Standard commands available on the mounted MongoDB tree:
CommandNotes
lsList databases, collections, views, and per-entity files
catStream documents.jsonl (or read schema.json / database.json)
head / tailSmart: server-side sort + limit (capped at max_doc_limit)
tail -fLive-follow new inserts via Mongo change stream
grep / rgSmart: MongoDB query engine at collection/db scope, regex at file scope
jqQuery JSON; use .[] prefix when iterating JSONL files
wcSmart: uses countDocuments() server-side
statMetadata (doc count, indexes; views skip the index lookup)
findList databases/collections/views with -name, -maxdepth
treeDirectory tree view