GitHub’s Anti-Spam System Is Struggling Against Persistent Abuse #191078
Replies: 7 comments 8 replies
-
|
💬 Your Product Feedback Has Been Submitted 🎉 Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users. Here's what you can expect moving forward ⏩
Where to look to see what's shipping 👀
What you can do in the meantime 💻
As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities. Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐ |
Beta Was this translation helpful? Give feedback.
-
|
See my reply here: microsoft/WSL#40028 (comment) |
Beta Was this translation helpful? Give feedback.
-
|
It is increasing very quickly
|
Beta Was this translation helpful? Give feedback.
-
|
Github have a idea to go out this spam issus,but you not to change it |
Beta Was this translation helpful? Give feedback.
-
|
nearing 500 repos that contains 80k issues reported, that is crazy. https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-04-01.md |
Beta Was this translation helpful? Give feedback.
-
|
Hey @xlionjuan
|
Beta Was this translation helpful? Give feedback.
-
Addressing GitHub's Persistent Spam Problem: Actionable Steps and Systemic PressureYour analysis correctly identifies a severe, recurring failure in GitHub's abuse prevention. The scale (thousands of repositories), persistence (years), and high-profile targets (Microsoft/WSL) demonstrate a prioritization gap, not just a detection gap. The AI training data contamination risk adds a critical, legally salient dimension. This is not just about annoyance; it's about platform integrity and downstream liability. Here is a concrete, multi-pronged strategy for the community to force material change. 1. Immediate Mitigation: Protect Yourself and Your ProjectsWhile systemic change is the goal, individuals and organizations can reduce exposure. A. Use Advanced Search Filters to Avoid SpamSpam often follows patterns: repetitive keywords, random strings, or promotional terms. Use GitHub's search qualifiers aggressively. # Search for common spam indicators (adjust keywords)
# Exclude known spammy terms in repo names/descriptions
gh search repos "your-project" --language=python --archived=false \
--sort=updated --order=desc | grep -v -E "(免费|代写|论文|包过|刷单|兼职|赚钱)"
# Find potentially spammy new repos in a topic
gh search repos "topic:ml" --created=">2024-01-01" | grep -i "vip|qq|微信|telegram"Official Reference: GitHub Search Syntax B. Automate Local Detection with Simple ScriptsIf you maintain a popular project, scan incoming issues/PRs for spam patterns. # Example: Simple spam keyword detector for issue/PR bodies
import re, sys
SPAM_PATTERNS = [
r"加[QQ|微信|vx]\s*\d+",
r"代[写|做|开发]",
r"包[过|成功]",
r"低价|优惠|折扣",
r"联系.*?[0-9]{5,}",
r"http[s]?://(?!github\.com|stackoverflow\.com)[^\s]+" # Suspicious non-dev links
]
def is_likely_spam(text):
text_lower = text.lower()
for pattern in SPAM_PATTERNS:
if re.search(pattern, text_lower, re.IGNORECASE):
return True
return False
if __name__ == "__main__":
with open(sys.argv[1]) as f: # Pass issue/PR body as file
if is_likely_spam(f.read()):
print("⚠️ Potential spam detected")
sys.exit(1)Integrate this into CI (e.g., GitHub Actions) to auto-label spammy PRs. C. Leverage GitHub's Built-in Tools
2. Force Systemic Accountability: Escalation DemandsYour spam reports are a start, but they get lost in the queue. Escalate with specific, referenced evidence. A. Submit Formal Abuse Reports with Legal & AI Risk FramingWhen reporting via GitHub's form, structure the description to trigger higher-priority review:
B. Parallel Public Escalation
C. Mobilize Affected ProjectsEncourage maintainers of high-profile repos (like WSL, Kubernetes, TensorFlow) to:
3. The AI Data Contamination Lever: Your Strongest ArgumentGitHub's own documentation acknowledges the risk you describe.
Action: In all escalations, explicitly state:
Cite the OpenAI Codex issue as a precedent where spam directly impacted an AI product. 4. What Success Looks Like: Concrete MilestonesDemand specific, measurable actions from GitHub:
Conclusion: Shift from "Reporting" to "Demanding"The pattern of inaction suggests cost-benefit calculations that favor spam tolerance. You must change that equation:
Start today:
The goal is not just to remove individual repos, but to force GitHub to publicly commit to specific anti-spam SLAs and transparency measures. Until the cost of spam exceeds the cost of effective moderation, the status quo will persist. Your detailed evidence is the leverage—use it systematically. |
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🏷️ Discussion Type
Product Feedback
💬 Feature/Topic Area
Other
Body
Summary
In recent days, GitHub has once again been flooded with large-scale Chinese spam. This is not a new problem, nor a rare occurrence—it is a recurring pattern that has been visible for years without any clearly effective resolution. Microsoft’s WSL repository is simply one of the latest high-profile examples.
The following discussions document the issue in detail and highlight the scale and persistence of the problem:
microsoft/WSL#40028
microsoft/WSL#21802
Request
GitHub and Microsoft need to acknowledge the seriousness of this issue—not in abstract terms, but in terms of concrete impact and accountability.
The ongoing presence of large-scale spam repositories calls into question whether current moderation and abuse-prevention mechanisms are functioning at an acceptable level. When such content remains widespread and long-lived, it is difficult to interpret this as anything other than a systemic failure to contain known abuse patterns.
More critically, GitHub is not just a hosting platform—it is widely used as a data source for training AI systems. Allowing large volumes of low-quality or malicious content to persist creates a foreseeable risk: contamination of training datasets. This concern is not hypothetical; it directly relates to cases already raised regarding OpenAI’s Codex.
openai/codex#11966
Framed this way, the issue extends beyond spam itself. It raises a broader question: to what extent are GitHub and Microsoft willing to take responsibility for the downstream consequences of the data they host and distribute at scale?
Using AI-assisted analysis, I have identified a significant number of spam repositories, with some cases traceable back to as early as 2023. The longevity of these repositories strongly suggests that this is not merely a detection problem, but a prioritization problem.
I've submitted the following reports to GitHub.
https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-03-30.md
https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-03-30-historical.md
At this point, continued inaction—or responses that fail to materially reduce the scale of the problem—will increasingly be seen not as oversight, but as tacit acceptance.
Beta Was this translation helpful? Give feedback.
All reactions