Recommendations for high-quality datasets for AI/Deep Learning training? #181971
Replies: 3 comments 1 reply
-
|
Hi @Yigtwxx Thanks for being a part of the GitHub Community, we're glad you're here! If you're looking for help for this specific topic, you might want to try asking for help somewhere that focuses on this project, such as this one. It's possible that another GitHub user might have run into this same issue and can help, but the GitHub Community on Discussions focuses primarily on topics related to GitHub itself or collaboration on project development and ideas. We want to make sure you’re getting the best support you can, but this space may not be the right place for this particular topic. Best of luck! |
Beta Was this translation helpful? Give feedback.
-
|
computer vision project |
Beta Was this translation helpful? Give feedback.
-
|
This is a great one for you to answer! The question is directly about ML datasets — something you know well from your deep-learning and machine-learning repos. Great question! Here are my go-to sources beyond Kaggle: Roboflow Universe — thousands of labeled CV datasets, many ready for YOLO For Text / NLP (Language Models, Next-Word Prediction): HuggingFace Datasets — best single source for NLP, thousands of text corpora For General Clean ML Datasets: UCI ML Repository — classic, clean, well-documented datasets Hidden gems: Zenodo — research datasets from universities, very high quality For your Next-Word Prediction project specifically, I'd start with HuggingFace Datasets — filter by language modeling task and you'll find exactly what you need instantly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Select Topic Area
Question
Body
I am a Software Engineering student currently working on Deep Learning projects, specifically focusing on Computer Vision and NLP (Next-Word Prediction).
Besides the well-known platforms like Kaggle, what are your go-to sources for finding high-quality and reliable datasets? I am looking for platforms that offer:
Labeled images for Object Detection
Text corpora for Language Models
Clean datasets for general Machine Learning tasks
Any recommendations or hidden gems would be greatly appreciated
Beta Was this translation helpful? Give feedback.
All reactions