Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mirage.strukto.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Paperclip resource exposes 8M+ biomedical papers from bioRxiv, medRxiv, and PubMed Central as a virtual filesystem organized by source, year, and month. For credential setup, see Paperclip Setup.

Config

from mirage import MountMode, Workspace
from mirage.resource.paperclip import PaperclipConfig, PaperclipResource

config = PaperclipConfig()
resource = PaperclipResource(config=config)
ws = Workspace({"/paperclip": resource}, mode=MountMode.READ)

Filesystem Layout

/paperclip/
  biorxiv/
    2025/
      12/
        bio_07cb291a7ce4/
          meta.json
          content.lines
          sections/
            introduction.lines
            methods.lines
            results.lines
            discussion.lines
          figures/
            fig1.tif
            fig2.gif
            fig3.jpg
          supplements/
            table_s1.csv
            methods_s1.docx
            notes.md.lines
  medrxiv/
    2025/
      11/
        med_a3f19bc2e810/
          meta.json
          content.lines
          sections/
          figures/
          supplements/
  pmc/
    2024/
      06/
        pmc_9e12d4a7b5c3/
          meta.json
          content.lines
          sections/
          figures/
          supplements/

Sources

DirectoryDescription
biorxivbioRxiv preprints
medrxivmedRxiv preprints
pmcPubMed Central peer-reviewed

Year / Month

Each source directory contains year directories, each containing month directories. Listing a month directory issues a SQL query against the Paperclip API for that source and time range, returning up to default_limit (500) papers.

Paper Directory

Each paper is a directory containing:
EntryTypeDescription
meta.jsonJSONPaper metadata (title, authors, DOI, dates)
content.linesTextFull text with L<n>: line prefixes
sections/DirPer-section *.lines files
figures/DirFigure files (.tif, .gif, .jpg)
supplements/DirSupplementary files (.docx, .csv, .md.lines)

Command Passthrough

File-level cat, head, tail, and grep commands are passed through to the Paperclip API rather than downloading the full file first.

grep Flags

FlagSupportedNotes
-iYesCase-insensitive
-cYesCount matches
-mYesMax match count
-vYesInvert match
-ENoFails silently

grep at Different Scopes

# PAPER level - passthrough to Paperclip API
grep "CRISPR" "/paperclip/biorxiv/2025/12/bio_07cb291a7ce4/content.lines"

# MONTH level - search pre-filter, then grep within results
grep "CRISPR" "/paperclip/biorxiv/2025/12/"

Cache

The Paperclip resource uses IndexCacheStore (same as RAM/S3/disk/GitHub). Index entries store source listings and paper metadata. There is no separate content cache - file content caching is handled by the workspace IOResult mechanism.

Example

import asyncio

from mirage import MountMode, Workspace
from mirage.resource.paperclip import PaperclipConfig, PaperclipResource

config = PaperclipConfig()
resource = PaperclipResource(config=config)


async def main():
    ws = Workspace({"/paperclip": resource}, mode=MountMode.READ)

    # List sources
    r = await ws.execute("ls /paperclip/")
    print(await r.stdout_str())

    # List papers in a month
    r = await ws.execute("ls /paperclip/biorxiv/2025/12/")
    print(await r.stdout_str())

    paper = r.stdout_str().strip().splitlines()[0].strip()
    base = f"/paperclip/biorxiv/2025/12/{paper}"

    # Read paper metadata
    r = await ws.execute(f'cat "{base}/meta.json"')
    print(await r.stdout_str())

    # Read first 20 lines of content
    r = await ws.execute(f'head -n 20 "{base}/content.lines"')
    print(await r.stdout_str())

    # Search within a paper
    r = await ws.execute(f'grep "CRISPR" "{base}/content.lines"')
    print(await r.stdout_str())

    # Search across a month
    r = await ws.execute('grep "CRISPR" "/paperclip/biorxiv/2025/12/"')
    print(await r.stdout_str())

    # List sections
    r = await ws.execute(f'ls "{base}/sections/"')
    print(await r.stdout_str())

    # List figures
    r = await ws.execute(f'ls "{base}/figures/"')
    print(await r.stdout_str())


if __name__ == "__main__":
    asyncio.run(main())

Shell Commands

Standard commands available on the mounted Paperclip tree:
CommandNotes
lsList sources, years, months, papers, files
catRead full file (passthrough to API)
head / tailRead N lines from start/end (passthrough to API)
grep / rgSmart: passthrough at paper level, pre-filter at month level
wcLine/word/byte counts
statFile metadata (name, size, type)
findRecursive search with -name, -maxdepth
treeDirectory tree view

Resource-Specific Commands

Full-text search across the Paperclip corpus.
search --query "CRISPR cas9" --source biorxiv --limit 50
FlagRequiredDefaultDescription
--queryYesSearch query string
--sourceNoallRestrict to a single source
--limitNo50Max results
--sinceNoFilter by date (YYYY-MM-DD)

lookup

Retrieve a paper by DOI or Paperclip ID.
lookup --doi "10.1101/2025.12.01.123456"
lookup --id bio_07cb291a7ce4

scan

Stream through all papers in a month, applying a filter expression.
scan --source biorxiv --year 2025 --month 12 --filter "authors contains 'Zhang'"

map

Run a transformation across matched papers and collect results.
map --query "CRISPR" --source pmc --extract "title,authors,doi"