-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Optimize get_file_contents: eliminate raw API call #1586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: almaleksia/get_file_contents_fixes
Are you sure you want to change the base?
Optimize get_file_contents: eliminate raw API call #1586
Conversation
18d267c to
92a129e
Compare
This optimization reduces the API calls for get_file_contents from 2 to 1 by: 1. Using the Content field from GitHub's Contents API directly instead of fetching content separately via the raw API 2. Inferring Content-Type locally using file extension mapping and mimetype library for magic byte detection Benefits: - ~50% latency reduction (one round-trip instead of two) - Lower rate limit consumption - Simpler error handling The mimetype inference uses a hybrid approach: - Custom extension map for accurate code file type detection (handles edge cases like .ts which stdlib maps incorrectly to Qt Linguist) - Magic byte detection via gabriel-vasile/mimetype library for binary vs text detection and shebang script identification Also adds raw_content parameter for clients that don't support embedded resources (returns plain text instead of resource format).
92a129e to
5602356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the get_file_contents tool by eliminating the raw API call, reducing latency by ~50% and lowering rate limit consumption. The optimization leverages the fact that the GitHub Contents API already returns base64-encoded file content, eliminating the need for a second API call.
Key changes:
- Replaced two-step API call pattern (Contents API for metadata + Raw API for content) with single Contents API call
- Added local MIME type inference system with custom extension mapping and magic byte detection
- Added
raw_contentparameter for backward compatibility with clients that don't support embedded resources
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/github/repositories.go | Core optimization: removed raw API client usage, added base64 decoding, integrated local MIME type inference, added raw_content parameter handling |
| pkg/github/mimetype.go | New file implementing hybrid MIME type inference with custom extension map (~120 mappings) and magic byte detection |
| pkg/github/mimetype_test.go | Comprehensive test coverage for MIME type inference including code files, binary detection, shebangs, and special filenames |
| pkg/github/repositories_test.go | Updated test mocks to use base64-encoded content in Contents API responses instead of separate raw API calls |
| pkg/github/toolsnaps/get_file_contents.snap | Updated tool schema to include new raw_content parameter and reflect description changes |
| README.md | Updated get_file_contents tool documentation to include raw_content parameter |
| go.mod | Added github.com/gabriel-vasile/mimetype dependency |
| go.sum | Added checksums for new mimetype library dependency |
| // Check if file is too large (Contents API returns nil content for files > 1MB) | ||
| content, decodeErr := fileContent.GetContent() | ||
| if decodeErr != nil || content == "" { | ||
| // File is too large - return metadata with download URL |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling for GetContent() may not distinguish between a decode error and a legitimately empty file. According to the comment, "Contents API returns nil content for files > 1MB", but an empty file would also return an empty string. Consider checking fileSize first or examining the specific error type from decodeErr to avoid misclassifying empty files as "too large". Additionally, the decodeErr should be logged or included in the error message to help with debugging when actual decode failures occur.
| // Check if file is too large (Contents API returns nil content for files > 1MB) | |
| content, decodeErr := fileContent.GetContent() | |
| if decodeErr != nil || content == "" { | |
| // File is too large - return metadata with download URL | |
| // Check for decode error first | |
| content, decodeErr := fileContent.GetContent() | |
| if decodeErr != nil { | |
| return nil, nil, fmt.Errorf("failed to decode file content: %w", decodeErr) | |
| } | |
| // If the file is empty, return empty content (not "too large") | |
| if fileSize == 0 { | |
| return utils.NewToolResultText(""), nil, nil | |
| } | |
| // If the file is not empty but content is empty, treat as "too large" | |
| if content == "" { |
|
|
||
| return tool, handler | ||
| } | ||
|
|
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function comment for ForkRepository was removed. While this may seem like a minor change, it breaks the consistency of documentation style in this file where other exported functions have descriptive comments. The guideline emphasizes code clarity for a wide audience. Consider restoring the comment to maintain documentation consistency across all exported functions in the file.
| // ForkRepository creates a tool to fork a GitHub repository to the user's account or a specified organization. |
|
| Extension | stdlib Result | Actual |
|---|---|---|
.ts |
text/vnd.trolltech.linguist (Qt!) |
text/typescript |
.tsx |
Tiled TSX format | text/tsx |
.jsx |
❌ empty | text/jsx |
.vue |
❌ empty | text/x-vue |
.go |
✅ text/x-go |
correct |
Makefile |
❌ empty (no ext) | text/x-makefile |
Dockerfile |
❌ empty (no ext) | text/x-dockerfile |
Verdict: Need a custom override map (~50-100 entries for common code files).
Option 2: Content sniffing with mimetype library
github.com/gabriel-vasile/mimetype does real magic byte detection:
| Content | Detection |
|---|---|
| ✅ Shebang scripts | text/x-shellscript, text/x-python |
| ✅ ELF binaries | application/x-elf |
| ✅ Images | image/png, image/jpeg |
| ✅ Archives | application/zip, application/gzip |
| ✅ JSON | application/json |
| ✅ HTML/XML/SVG | Correct detection |
Falls back to text/plain; charset=utf-8 |
Verdict: Excellent for binary vs text distinction and shebang scripts. Code files need extension fallback.
Option 3: Hybrid (Recommended)
func inferContentType(path string, content []byte) string {
// 1. Fast path: extension lookup with overrides
ext := filepath.Ext(path)
if ct := codeExtensions[ext]; ct != "" {
return ct
}
if ct := mime.TypeByExtension(ext); ct != "" && !isKnownBadMapping(ext) {
return ct
}
// 2. Extensionless files: check filename
if ct := knownFilenames[filepath.Base(path)]; ct != "" {
return ct
}
// 3. Content sniffing for binary detection + shebangs
mt := mimetype.Detect(content)
return mt.String()
}Latency Analysis
| Approach | Network Calls | Latency | Accuracy |
|---|---|---|---|
| Current (Contents + Raw GET) | 2 | ~100-200ms | 100% |
| Optimized (Contents only + local inference) | 1 | ~50-100ms | ~98% |
| Alternative (Raw HEAD + Raw GET) | 2 (but HEAD is tiny) | ~80-120ms | 100% |
CPU cost of mimetype.Detect(): ~1-10 microseconds (negligible vs network)
Trade-offs
Benefits of Optimization
- ~50% latency reduction (eliminate one round-trip)
- Lower rate limit consumption (one API call instead of two)
- Simpler error handling (one failure point)
Risks
- Edge cases: Unusual file types might get wrong Content-Type
- Maintenance: Custom extension map needs updates for new languages
- Behavior change: Content-Type might differ slightly from GitHub's
Mitigation
- For MCP tool use cases (feeding content to LLMs), exact Content-Type is rarely critical
- Binary vs text distinction is the main concern, and
mimetypehandles that well - Can add logging/metrics to track mismatches
Recommendation
For github-mcp-server specifically:
-
Keep the current two-call approach if:
- You need 100% Content-Type accuracy
- Rate limiting is not a concern
- Latency is acceptable
-
Switch to single-call + inference if:
- Latency/throughput is critical
- Rate limiting is a concern
- ~98% Content-Type accuracy is acceptable
The hybrid approach (extension + mimetype fallback) is the sweet spot for an MCP server because:
- LLMs don't care if a file is
text/javascriptvsapplication/javascript - Binary detection is critical (to avoid sending garbage to LLMs)
- Shebang detection is valuable (identifies shell/python scripts without extensions)
Implementation Sketch
See get_file_contents_optimized_example.go in the same directory.
Optimize
get_file_contents: Eliminate Raw API CallSummary
This PR optimizes the
get_file_contentstool by reducing API calls from 2 to 1, resulting in ~50% latency reduction.Changes
Core Optimization
New MIME Type Inference System
Added
mimetype.gowith a hybrid approach:.ts(stdlib returns Qt Linguist instead of TypeScript)gabriel-vasile/mimetypelibrary for:#!/bin/bash,#!/usr/bin/env python)Additional Features
raw_contentparameter for clients that don't support embedded resourcesBenefits
Testing
mimetype_test.go)Trade-offs
The MIME type inference is ~98% accurate vs 100% from GitHub's raw API. This is acceptable because:
text/javascriptvsapplication/javascript)Dependencies
Added:
github.com/gabriel-vasile/mimetype v1.4.11This PR is intended to be merged into #1582 (
almaleksia/get_file_contents_fixes).