Skip to content

Conversation

@SamMorrowDrums
Copy link
Collaborator

Optimize get_file_contents: Eliminate Raw API Call

Summary

This PR optimizes the get_file_contents tool by reducing API calls from 2 to 1, resulting in ~50% latency reduction.

Changes

Core Optimization

  • Before: Contents API call (for SHA) + Raw API call (for content)
  • After: Single Contents API call (content is base64-encoded in response)

New MIME Type Inference System

Added mimetype.go with a hybrid approach:

  1. Custom extension map for accurate code file detection (~120 mappings)
    • Handles edge cases like .ts (stdlib returns Qt Linguist instead of TypeScript)
    • Covers JavaScript/TypeScript, Go, Rust, Python, Ruby, Java, C family, shell scripts, configs, etc.
  2. Magic byte detection via gabriel-vasile/mimetype library for:
    • Binary vs text detection (critical to avoid sending binary garbage to LLMs)
    • Shebang script identification (#!/bin/bash, #!/usr/bin/env python)

Additional Features

  • Added raw_content parameter for clients that don't support embedded resources

Benefits

  • ~50% latency reduction (one round-trip instead of two)
  • Lower rate limit consumption (one API call instead of two)
  • Simpler error handling (single failure point)

Testing

  • All existing tests updated to match new behavior
  • Added comprehensive tests for MIME type inference (mimetype_test.go)
  • Covers: code files, binary detection (PNG/JPEG/ZIP/ELF), shebangs, special filenames (Dockerfile, Makefile)

Trade-offs

The MIME type inference is ~98% accurate vs 100% from GitHub's raw API. This is acceptable because:

  • LLMs don't care about subtle MIME type differences (text/javascript vs application/javascript)
  • Binary vs text detection is the critical distinction, and magic byte detection handles this well
  • Shebang detection is valuable for extensionless scripts

Dependencies

Added: github.com/gabriel-vasile/mimetype v1.4.11


This PR is intended to be merged into #1582 (almaleksia/get_file_contents_fixes).

Copilot AI review requested due to automatic review settings December 12, 2025 15:25
@SamMorrowDrums SamMorrowDrums requested a review from a team as a code owner December 12, 2025 15:25
@SamMorrowDrums SamMorrowDrums force-pushed the SamMorrowDrums/optimize-file-contents branch from 18d267c to 92a129e Compare December 12, 2025 15:27
This optimization reduces the API calls for get_file_contents from 2 to 1 by:

1. Using the Content field from GitHub's Contents API directly instead of
   fetching content separately via the raw API
2. Inferring Content-Type locally using file extension mapping and mimetype
   library for magic byte detection

Benefits:
- ~50% latency reduction (one round-trip instead of two)
- Lower rate limit consumption
- Simpler error handling

The mimetype inference uses a hybrid approach:
- Custom extension map for accurate code file type detection (handles edge
  cases like .ts which stdlib maps incorrectly to Qt Linguist)
- Magic byte detection via gabriel-vasile/mimetype library for binary vs
  text detection and shebang script identification

Also adds raw_content parameter for clients that don't support embedded
resources (returns plain text instead of resource format).
@SamMorrowDrums SamMorrowDrums force-pushed the SamMorrowDrums/optimize-file-contents branch from 92a129e to 5602356 Compare December 12, 2025 15:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the get_file_contents tool by eliminating the raw API call, reducing latency by ~50% and lowering rate limit consumption. The optimization leverages the fact that the GitHub Contents API already returns base64-encoded file content, eliminating the need for a second API call.

Key changes:

  • Replaced two-step API call pattern (Contents API for metadata + Raw API for content) with single Contents API call
  • Added local MIME type inference system with custom extension mapping and magic byte detection
  • Added raw_content parameter for backward compatibility with clients that don't support embedded resources

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/github/repositories.go Core optimization: removed raw API client usage, added base64 decoding, integrated local MIME type inference, added raw_content parameter handling
pkg/github/mimetype.go New file implementing hybrid MIME type inference with custom extension map (~120 mappings) and magic byte detection
pkg/github/mimetype_test.go Comprehensive test coverage for MIME type inference including code files, binary detection, shebangs, and special filenames
pkg/github/repositories_test.go Updated test mocks to use base64-encoded content in Contents API responses instead of separate raw API calls
pkg/github/toolsnaps/get_file_contents.snap Updated tool schema to include new raw_content parameter and reflect description changes
README.md Updated get_file_contents tool documentation to include raw_content parameter
go.mod Added github.com/gabriel-vasile/mimetype dependency
go.sum Added checksums for new mimetype library dependency

Comment on lines +691 to +694
// Check if file is too large (Contents API returns nil content for files > 1MB)
content, decodeErr := fileContent.GetContent()
if decodeErr != nil || content == "" {
// File is too large - return metadata with download URL
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling for GetContent() may not distinguish between a decode error and a legitimately empty file. According to the comment, "Contents API returns nil content for files > 1MB", but an empty file would also return an empty string. Consider checking fileSize first or examining the specific error type from decodeErr to avoid misclassifying empty files as "too large". Additionally, the decodeErr should be logged or included in the error message to help with debugging when actual decode failures occur.

Suggested change
// Check if file is too large (Contents API returns nil content for files > 1MB)
content, decodeErr := fileContent.GetContent()
if decodeErr != nil || content == "" {
// File is too large - return metadata with download URL
// Check for decode error first
content, decodeErr := fileContent.GetContent()
if decodeErr != nil {
return nil, nil, fmt.Errorf("failed to decode file content: %w", decodeErr)
}
// If the file is empty, return empty content (not "too large")
if fileSize == 0 {
return utils.NewToolResultText(""), nil, nil
}
// If the file is not empty but content is empty, treat as "too large"
if content == "" {

Copilot uses AI. Check for mistakes.

return tool, handler
}

Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function comment for ForkRepository was removed. While this may seem like a minor change, it breaks the consistency of documentation style in this file where other exported functions have descriptive comments. The guideline emphasizes code clarity for a wide audience. Consider restoring the comment to maintain documentation consistency across all exported functions in the file.

Suggested change
// ForkRepository creates a tool to fork a GitHub repository to the user's account or a specified organization.

Copilot uses AI. Check for mistakes.
@SamMorrowDrums
Copy link
Collaborator Author

get_file_contents Optimization Report

Current Implementation Analysis

File: github/github-mcp-server/pkg/github/repositories.go (GetFileContents function)

Current Flow (2 API calls minimum)

┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 1: Contents API                                                        │
│ client.Repositories.GetContents(ctx, owner, repo, path, opts)              │
│ Purpose: Get file SHA (blob SHA for update operations)                      │
│ Rate limit: Counts against authenticated user's quota                       │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 2: Raw Content API                                                     │
│ rawClient.GetRawContent(ctx, owner, repo, path, rawOpts)                   │
│ Purpose: Get actual file content + Content-Type header                      │
│ Source: raw.githubusercontent.com                                           │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Content-Type Detection (from HTTP header)                                   │
│ contentType := resp.Header.Get("Content-Type")                             │
│ isTextContent := strings.HasPrefix(contentType, "text/") || ...            │
└─────────────────────────────────────────────────────────────────────────────┘

Why Two Calls?

  1. Contents API → Returns the blob SHA (needed for update operations and MCP result metadata)
  2. Raw API → Returns actual content + reliable Content-Type header

The Contents API does return content, but:

  • It's base64 encoded (extra decode step)
  • Files >1MB truncated
  • No Content-Type header (you'd have to infer it)

Proposed Optimization

Goal: Eliminate the Raw API call by inferring Content-Type from content bytes we already fetch.

Optimized Flow (1 API call)

┌─────────────────────────────────────────────────────────────────────────────┐
│ Single Call: Contents API (already gives us content via GetContents)        │
│ OR: Raw API with HEAD first, then GET                                       │
│ OR: Just Raw GET (if SHA not strictly needed)                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Content-Type Inference (no network call)                                    │
│ 1. Try extension lookup: mime.TypeByExtension(ext) + custom overrides       │
│ 2. Fallback: mimetype.Detect(content) for magic byte detection              │
└─────────────────────────────────────────────────────────────────────────────┘

Content-Type Detection Strategy

Option 1: Extension-based (fast, imperfect)

Go stdlib mime.TypeByExtension() has gaps:

Extension stdlib Result Actual
.ts text/vnd.trolltech.linguist (Qt!) text/typescript
.tsx Tiled TSX format text/tsx
.jsx ❌ empty text/jsx
.vue ❌ empty text/x-vue
.go text/x-go correct
Makefile ❌ empty (no ext) text/x-makefile
Dockerfile ❌ empty (no ext) text/x-dockerfile

Verdict: Need a custom override map (~50-100 entries for common code files).

Option 2: Content sniffing with mimetype library

github.com/gabriel-vasile/mimetype does real magic byte detection:

Content Detection
✅ Shebang scripts text/x-shellscript, text/x-python
✅ ELF binaries application/x-elf
✅ Images image/png, image/jpeg
✅ Archives application/zip, application/gzip
✅ JSON application/json
✅ HTML/XML/SVG Correct detection
⚠️ Plain code Falls back to text/plain; charset=utf-8

Verdict: Excellent for binary vs text distinction and shebang scripts. Code files need extension fallback.

Option 3: Hybrid (Recommended)

func inferContentType(path string, content []byte) string {
    // 1. Fast path: extension lookup with overrides
    ext := filepath.Ext(path)
    if ct := codeExtensions[ext]; ct != "" {
        return ct
    }
    if ct := mime.TypeByExtension(ext); ct != "" && !isKnownBadMapping(ext) {
        return ct
    }
    
    // 2. Extensionless files: check filename
    if ct := knownFilenames[filepath.Base(path)]; ct != "" {
        return ct
    }
    
    // 3. Content sniffing for binary detection + shebangs
    mt := mimetype.Detect(content)
    return mt.String()
}

Latency Analysis

Approach Network Calls Latency Accuracy
Current (Contents + Raw GET) 2 ~100-200ms 100%
Optimized (Contents only + local inference) 1 ~50-100ms ~98%
Alternative (Raw HEAD + Raw GET) 2 (but HEAD is tiny) ~80-120ms 100%

CPU cost of mimetype.Detect(): ~1-10 microseconds (negligible vs network)


Trade-offs

Benefits of Optimization

  1. ~50% latency reduction (eliminate one round-trip)
  2. Lower rate limit consumption (one API call instead of two)
  3. Simpler error handling (one failure point)

Risks

  1. Edge cases: Unusual file types might get wrong Content-Type
  2. Maintenance: Custom extension map needs updates for new languages
  3. Behavior change: Content-Type might differ slightly from GitHub's

Mitigation

  • For MCP tool use cases (feeding content to LLMs), exact Content-Type is rarely critical
  • Binary vs text distinction is the main concern, and mimetype handles that well
  • Can add logging/metrics to track mismatches

Recommendation

For github-mcp-server specifically:

  1. Keep the current two-call approach if:

    • You need 100% Content-Type accuracy
    • Rate limiting is not a concern
    • Latency is acceptable
  2. Switch to single-call + inference if:

    • Latency/throughput is critical
    • Rate limiting is a concern
    • ~98% Content-Type accuracy is acceptable

The hybrid approach (extension + mimetype fallback) is the sweet spot for an MCP server because:

  • LLMs don't care if a file is text/javascript vs application/javascript
  • Binary detection is critical (to avoid sending garbage to LLMs)
  • Shebang detection is valuable (identifies shell/python scripts without extensions)

Implementation Sketch

See get_file_contents_optimized_example.go in the same directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants