Optimize get_file_contents: eliminate raw API call #1586

SamMorrowDrums · 2025-12-12T15:25:32Z

Optimize `get_file_contents`: Eliminate Raw API Call

Summary

This PR optimizes the get_file_contents tool by reducing API calls from 2 to 1, resulting in ~50% latency reduction.

Changes

Core Optimization

Before: Contents API call (for SHA) + Raw API call (for content)
After: Single Contents API call (content is base64-encoded in response)

New MIME Type Inference System

Added mimetype.go with a hybrid approach:

Custom extension map for accurate code file detection (~120 mappings)
- Handles edge cases like .ts (stdlib returns Qt Linguist instead of TypeScript)
- Covers JavaScript/TypeScript, Go, Rust, Python, Ruby, Java, C family, shell scripts, configs, etc.
Magic byte detection via gabriel-vasile/mimetype library for:
- Binary vs text detection (critical to avoid sending binary garbage to LLMs)
- Shebang script identification (#!/bin/bash, #!/usr/bin/env python)

Additional Features

Added raw_content parameter for clients that don't support embedded resources

Benefits

~50% latency reduction (one round-trip instead of two)
Lower rate limit consumption (one API call instead of two)
Simpler error handling (single failure point)

Testing

All existing tests updated to match new behavior
Added comprehensive tests for MIME type inference (mimetype_test.go)
Covers: code files, binary detection (PNG/JPEG/ZIP/ELF), shebangs, special filenames (Dockerfile, Makefile)

Trade-offs

The MIME type inference is ~98% accurate vs 100% from GitHub's raw API. This is acceptable because:

LLMs don't care about subtle MIME type differences (text/javascript vs application/javascript)
Binary vs text detection is the critical distinction, and magic byte detection handles this well
Shebang detection is valuable for extensionless scripts

Dependencies

Added: github.com/gabriel-vasile/mimetype v1.4.11

This PR is intended to be merged into #1582 (almaleksia/get_file_contents_fixes).

This optimization reduces the API calls for get_file_contents from 2 to 1 by: 1. Using the Content field from GitHub's Contents API directly instead of fetching content separately via the raw API 2. Inferring Content-Type locally using file extension mapping and mimetype library for magic byte detection Benefits: - ~50% latency reduction (one round-trip instead of two) - Lower rate limit consumption - Simpler error handling The mimetype inference uses a hybrid approach: - Custom extension map for accurate code file type detection (handles edge cases like .ts which stdlib maps incorrectly to Qt Linguist) - Magic byte detection via gabriel-vasile/mimetype library for binary vs text detection and shebang script identification Also adds raw_content parameter for clients that don't support embedded resources (returns plain text instead of resource format).

Copilot

Pull request overview

This PR optimizes the get_file_contents tool by eliminating the raw API call, reducing latency by ~50% and lowering rate limit consumption. The optimization leverages the fact that the GitHub Contents API already returns base64-encoded file content, eliminating the need for a second API call.

Key changes:

Replaced two-step API call pattern (Contents API for metadata + Raw API for content) with single Contents API call
Added local MIME type inference system with custom extension mapping and magic byte detection
Added raw_content parameter for backward compatibility with clients that don't support embedded resources

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/github/repositories.go	Core optimization: removed raw API client usage, added base64 decoding, integrated local MIME type inference, added raw_content parameter handling
pkg/github/mimetype.go	New file implementing hybrid MIME type inference with custom extension map (~120 mappings) and magic byte detection
pkg/github/mimetype_test.go	Comprehensive test coverage for MIME type inference including code files, binary detection, shebangs, and special filenames
pkg/github/repositories_test.go	Updated test mocks to use base64-encoded content in Contents API responses instead of separate raw API calls
pkg/github/toolsnaps/get_file_contents.snap	Updated tool schema to include new raw_content parameter and reflect description changes
README.md	Updated get_file_contents tool documentation to include raw_content parameter
go.mod	Added github.com/gabriel-vasile/mimetype dependency
go.sum	Added checksums for new mimetype library dependency

Copilot · 2025-12-12T15:29:25Z

pkg/github/repositories.go

+			// Check if file is too large (Contents API returns nil content for files > 1MB)
+			content, decodeErr := fileContent.GetContent()
+			if decodeErr != nil || content == "" {
+				// File is too large - return metadata with download URL


The error handling for GetContent() may not distinguish between a decode error and a legitimately empty file. According to the comment, "Contents API returns nil content for files > 1MB", but an empty file would also return an empty string. Consider checking fileSize first or examining the specific error type from decodeErr to avoid misclassifying empty files as "too large". Additionally, the decodeErr should be logged or included in the error message to help with debugging when actual decode failures occur.

Suggested change

// Check if file is too large (Contents API returns nil content for files > 1MB)

content, decodeErr := fileContent.GetContent()

if decodeErr != nil || content == "" {

// File is too large - return metadata with download URL

// Check for decode error first

content, decodeErr := fileContent.GetContent()

if decodeErr != nil {

return nil, nil, fmt.Errorf("failed to decode file content: %w", decodeErr)

}

// If the file is empty, return empty content (not "too large")

if fileSize == 0 {

return utils.NewToolResultText(""), nil, nil

}

// If the file is not empty but content is empty, treat as "too large"

if content == "" {

Copilot · 2025-12-12T15:29:25Z

pkg/github/repositories.go


 	return tool, handler
 }



The function comment for ForkRepository was removed. While this may seem like a minor change, it breaks the consistency of documentation style in this file where other exported functions have descriptive comments. The guideline emphasizes code clarity for a wide audience. Consider restoring the comment to maintain documentation consistency across all exported functions in the file.

Suggested change

// ForkRepository creates a tool to fork a GitHub repository to the user's account or a specified organization.

SamMorrowDrums · 2025-12-12T15:32:50Z

`get_file_contents` Optimization Report

Current Implementation Analysis

File: github/github-mcp-server/pkg/github/repositories.go (GetFileContents function)

Current Flow (2 API calls minimum)

┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 1: Contents API                                                        │
│ client.Repositories.GetContents(ctx, owner, repo, path, opts)              │
│ Purpose: Get file SHA (blob SHA for update operations)                      │
│ Rate limit: Counts against authenticated user's quota                       │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Step 2: Raw Content API                                                     │
│ rawClient.GetRawContent(ctx, owner, repo, path, rawOpts)                   │
│ Purpose: Get actual file content + Content-Type header                      │
│ Source: raw.githubusercontent.com                                           │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Content-Type Detection (from HTTP header)                                   │
│ contentType := resp.Header.Get("Content-Type")                             │
│ isTextContent := strings.HasPrefix(contentType, "text/") || ...            │
└─────────────────────────────────────────────────────────────────────────────┘

Why Two Calls?

Contents API → Returns the blob SHA (needed for update operations and MCP result metadata)
Raw API → Returns actual content + reliable Content-Type header

The Contents API does return content, but:

It's base64 encoded (extra decode step)
Files >1MB truncated
No Content-Type header (you'd have to infer it)

Proposed Optimization

Goal: Eliminate the Raw API call by inferring Content-Type from content bytes we already fetch.

Optimized Flow (1 API call)

┌─────────────────────────────────────────────────────────────────────────────┐
│ Single Call: Contents API (already gives us content via GetContents)        │
│ OR: Raw API with HEAD first, then GET                                       │
│ OR: Just Raw GET (if SHA not strictly needed)                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Content-Type Inference (no network call)                                    │
│ 1. Try extension lookup: mime.TypeByExtension(ext) + custom overrides       │
│ 2. Fallback: mimetype.Detect(content) for magic byte detection              │
└─────────────────────────────────────────────────────────────────────────────┘

Content-Type Detection Strategy

Option 1: Extension-based (fast, imperfect)

Go stdlib mime.TypeByExtension() has gaps:

Extension	stdlib Result	Actual
`.ts`	`text/vnd.trolltech.linguist` (Qt!)	`text/typescript`
`.tsx`	Tiled TSX format	`text/tsx`
`.jsx`	❌ empty	`text/jsx`
`.vue`	❌ empty	`text/x-vue`
`.go`	✅ `text/x-go`	correct
`Makefile`	❌ empty (no ext)	`text/x-makefile`
`Dockerfile`	❌ empty (no ext)	`text/x-dockerfile`

Verdict: Need a custom override map (~50-100 entries for common code files).

Option 2: Content sniffing with `mimetype` library

github.com/gabriel-vasile/mimetype does real magic byte detection:

Content	Detection
✅ Shebang scripts	`text/x-shellscript`, `text/x-python`
✅ ELF binaries	`application/x-elf`
✅ Images	`image/png`, `image/jpeg`
✅ Archives	`application/zip`, `application/gzip`
✅ JSON	`application/json`
✅ HTML/XML/SVG	Correct detection
⚠️ Plain code	Falls back to `text/plain; charset=utf-8`

Verdict: Excellent for binary vs text distinction and shebang scripts. Code files need extension fallback.

Option 3: Hybrid (Recommended)

func inferContentType(path string, content []byte) string {
    // 1. Fast path: extension lookup with overrides
    ext := filepath.Ext(path)
    if ct := codeExtensions[ext]; ct != "" {
        return ct
    }
    if ct := mime.TypeByExtension(ext); ct != "" && !isKnownBadMapping(ext) {
        return ct
    }
    
    // 2. Extensionless files: check filename
    if ct := knownFilenames[filepath.Base(path)]; ct != "" {
        return ct
    }
    
    // 3. Content sniffing for binary detection + shebangs
    mt := mimetype.Detect(content)
    return mt.String()
}

Latency Analysis

Approach	Network Calls	Latency	Accuracy
Current (Contents + Raw GET)	2	~100-200ms	100%
Optimized (Contents only + local inference)	1	~50-100ms	~98%
Alternative (Raw HEAD + Raw GET)	2 (but HEAD is tiny)	~80-120ms	100%

CPU cost of mimetype.Detect(): ~1-10 microseconds (negligible vs network)

Trade-offs

Benefits of Optimization

~50% latency reduction (eliminate one round-trip)
Lower rate limit consumption (one API call instead of two)
Simpler error handling (one failure point)

Risks

Edge cases: Unusual file types might get wrong Content-Type
Maintenance: Custom extension map needs updates for new languages
Behavior change: Content-Type might differ slightly from GitHub's

Mitigation

For MCP tool use cases (feeding content to LLMs), exact Content-Type is rarely critical
Binary vs text distinction is the main concern, and mimetype handles that well
Can add logging/metrics to track mismatches

Recommendation

For github-mcp-server specifically:

Keep the current two-call approach if:
- You need 100% Content-Type accuracy
- Rate limiting is not a concern
- Latency is acceptable
Switch to single-call + inference if:
- Latency/throughput is critical
- Rate limiting is a concern
- ~98% Content-Type accuracy is acceptable

The hybrid approach (extension + mimetype fallback) is the sweet spot for an MCP server because:

LLMs don't care if a file is text/javascript vs application/javascript
Binary detection is critical (to avoid sending garbage to LLMs)
Shebang detection is valuable (identifies shell/python scripts without extensions)

Implementation Sketch

See get_file_contents_optimized_example.go in the same directory.

Copilot AI review requested due to automatic review settings December 12, 2025 15:25

SamMorrowDrums requested a review from a team as a code owner December 12, 2025 15:25

Copilot started reviewing on behalf of SamMorrowDrums December 12, 2025 15:25 View session

SamMorrowDrums force-pushed the SamMorrowDrums/optimize-file-contents branch from 18d267c to 92a129e Compare December 12, 2025 15:27

SamMorrowDrums force-pushed the SamMorrowDrums/optimize-file-contents branch from 92a129e to 5602356 Compare December 12, 2025 15:28

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize get_file_contents: eliminate raw API call #1586

Optimize get_file_contents: eliminate raw API call #1586

SamMorrowDrums commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

SamMorrowDrums commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-			// Check if file is too large (Contents API returns nil content for files > 1MB)
-			content, decodeErr := fileContent.GetContent()
-			if decodeErr != nil || content == "" {
-				// File is too large - return metadata with download URL
+			// Check for decode error first
+			content, decodeErr := fileContent.GetContent()
+			if decodeErr != nil {
+				return nil, nil, fmt.Errorf("failed to decode file content: %w", decodeErr)
+			}
+			// If the file is empty, return empty content (not "too large")
+			if fileSize == 0 {
+				return utils.NewToolResultText(""), nil, nil
+			}
+			// If the file is not empty but content is empty, treat as "too large"
+			if content == "" {



	// ForkRepository creates a tool to fork a GitHub repository to the user's account or a specified organization.

Optimize get_file_contents: eliminate raw API call #1586

Are you sure you want to change the base?

Optimize get_file_contents: eliminate raw API call #1586

Conversation

SamMorrowDrums commented Dec 12, 2025

Optimize get_file_contents: Eliminate Raw API Call

Summary

Changes

Core Optimization

New MIME Type Inference System

Additional Features

Benefits

Testing

Trade-offs

Dependencies

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

SamMorrowDrums commented Dec 12, 2025

get_file_contents Optimization Report

Current Implementation Analysis

Current Flow (2 API calls minimum)

Why Two Calls?

Proposed Optimization

Optimized Flow (1 API call)

Content-Type Detection Strategy

Option 1: Extension-based (fast, imperfect)

Option 2: Content sniffing with mimetype library

Option 3: Hybrid (Recommended)

Latency Analysis

Trade-offs

Benefits of Optimization

Risks

Mitigation

Recommendation

Implementation Sketch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize `get_file_contents`: Eliminate Raw API Call

`get_file_contents` Optimization Report

Option 2: Content sniffing with `mimetype` library