注册并分享邀请链接,可获得视频播放与邀请奖励。

Sumanth 的个人资料封面
Sumanth 的头像

Sumanth (@Sumanth_077)

@Sumanth_077
Simplifying LLMs, RAG, Machine Learning & AI Agents for you! • ML Developer Advocate • Shipping Open Source AI apps
870 正在关注    76.6K 粉丝
Fine-tuning massive LLMs used to be painfully slow, but not anymore! 4 open source libraries that accelerate fine-tuning of Large Language Models 1. Unsloth AI • Fine-tune models like Qwen3, Llama 4, and Gemma 3 up to 2× faster with 70% less VRAM • Uses optimized Triton kernels and manual backprop for exact accuracy • Supports low-resource setups and runs on consumer GPUs or even Colab/Kaggle with ~3 GB VRAM GitHub repo → 2. LLaMA Factory • Fine-tune over 100 models (LLaMA, Mistral, Gemma, etc.) using a simple CLI or WebUI • Supports LoRA, QLoRA, full or frozen fine-tuning across 2–8‑bit precision • Includes built-in dataset templates, training monitors, and model export options GitHub repo → 3. DeepSpeed • Built for large-scale distributed fine-tuning with ZeRO and FSDP • Optimized for multi-GPU and multi-node training with advanced memory management • Trusted in production environments for scalable LLM training GitHub repo → 4. Axolotl • Yaml-based setup for fine-tuning, LoRA/QLoRA, DPO, GRPO, and multimodal workflows • Includes kernel optimizations for memory-efficient training • Actively maintained with support for Hugging Face, model export, and inference GitHub repo →
显示更多
AI agents trained on a fixed benchmark eventually just memorize it! Once an agent learns to pass a fixed set of test scenarios, the benchmark stops teaching it anything new. It's also nothing like the real, messy, unpredictable conditions agents actually operate in. Patronus AI's Generative Simulators take a different approach. Instead of generating just the task, they co-generate three things together every time. 1. The task itself - what the agent actually needs to do. 2. The world dynamics - how the simulated environment behaves and reacts to each action the agent takes. 3. The reward function - how that specific run gets scored as success or failure. The reason all three matter together: if only the task changes but the reward function stays fixed, the agent can still find shortcuts that game the scoring. By generating the task, the environment's behavior, and the grading criteria as one consistent unit, the simulator can keep creating new scenarios and still know how to evaluate them correctly. That's what keeps the environment from going stale. The reported results: 30-40% model lift on long-horizon tasks, a corpus of 1M+ world data artifacts, 85% UI/UX feature parity with real products in their simulated environments, and 5K+ expert contributors across industries validating these simulations. Key capabilities: • Generative Simulators co-generate task, world dynamics, and reward function together • Percival detects 20+ failure modes across agentic traces • Lynx for hallucination detection, GLIDER as a 3B parameter judge model • 30-40% performance lift on long-horizon tasks • 1M+ world data artifacts across domains I've shared the link in the replies!
显示更多
Today, we’re excited to announce our $50M Series B, led by @GreenfieldVC, with participation from @lightspeedvp and @notablecap. 🚀 At Patronus AI, we develop simulations and evals to train and improve AI. The first phase of AI was built on static benchmarks, but that era is over. As agents are used to solve longer and longer tasks, they need to practice in dynamic, living worlds to get better. Simulations are the critical infrastructure powering this next phase. As a company, we’re behind the most influential research and products in AI evaluation, like FinanceBench, Lynx, and Percival. And things have moved at the speed of light since.⚡ We partner with the world's leading frontier AI labs and enterprises, and our revenue has grown more than 15x over the past year. Additionally, today, we’re introducing a preview of the first Digital World Model for AI agent training and simulation: Patronus-DWM. Digital World Models are language diffusion world models that predict realistic environment behaviors and steer agent actions across digital workflows. Just as physical world models predict how objects move through space, we’re developing the equivalent for the digital world: predicting how agents act in digital workflows, then using that to scale the creation of high-quality training data for LLMs. Digital World Models help us push the frontier of ultra long horizon workflows, and unlock a new class of self-improving RL environments. This is our scalable approach to simulating all of the world’s intelligence. The round was also joined by @datadoghq, @SamsungVentures, @gokulr, @factorialcap, and a large cohort of amazing AI leaders across @AnthropicAI, @OpenAI, @GoogleDeepMind, @nvidia, @Recursive_SI, and more.✨ It has been the ride of a lifetime. But we’re just getting started. The best is yet to come. "Do not go gentle into that good night, Rage, rage against the dying of the light" - Dylan Thomas (1954)
显示更多
Turn Claude Code into a document processing agent! Traditional OCR extracts text but loses critical information. Table structures with merged cells disappear. Relationships between charts and captions break. Multi-column reading order gets scrambled. That's why most document pipelines need manual templates per document type, and break the moment a vendor changes their invoice format. Agentic Document Extraction (ADE) takes a different approach. It's vision-first, understanding layout the way a person reading the page would. Handles complex tables, dense forms, multi-column pages, and scanned documents. LandingAI now released the ADE skills for AI coding agents. Instead of calling the API directly, your agent writes Python scripts that parse, extract, classify, and chain these steps into full pipelines. Every extracted value comes with bounding boxes, page coordinates, and confidence scores traceable back to the source document. Two skills make up the system: 1. Document-extraction - parsing into structured Markdown, extracting fields with JSON schemas or Pydantic models, splitting and classifying multi-document batches. 2. Document-workflows - batch processing in parallel, classify-then-extract pipelines, RAG preparation with chunking and embeddings, exporting to DataFrames or Snowflake, building Streamlit UIs. Once installed, you describe what you need in plain English. Ask your agent to extract line items from a folder of invoices, pull every figure from a scientific paper as PNGs, or read account statements across pages into a single CSV. Key capabilities: • Parses 20+ file formats with layout-aware structured output • Vision-first model, no templates required • Bounding boxes, page coordinates, and confidence scores per extraction • Classify-then-extract pipelines for mixed document batches • Works with Claude Code, Cursor, Roo Code, or any Agent Skills-compatible Agent I've shared the link in the replies!
显示更多
Build a Large Language Model from scratch! This repository contains the code examples for developing, pretraining, and finetuning a LLM from scratch. It is the official codebase for the book Build a Large Language Model (From Scratch). Notebook examples are included for each chapter: Chapter 1: Understanding Large Language Models Chapter 2: Working with Text Data Chapter 3: Coding Attention Mechanisms Chapter 4: Implementing a GPT Model from Scratch Chapter 5: Pretraining on Unlabeled Data Chapter 6: Finetuning for Text Classification Chapter 7: Finetuning to Follow Instructions Link to the repo in the comments!
显示更多
Firecrawl launched a Forward Deployed Agent for web data! Prometheus is an experimental agent that turns plain-English data requests into working scraper code. You describe what you want, it writes a TypeScript collector using the Firecrawl SDK, runs it against the live site to verify it actually works, then hands you the script along with the sample data it produced. Most scraping tools hand you raw data and you write the code to get it. Prometheus flips that. You get the code itself, verified and ready to keep, version, or modify. Three operations make up the system. Build is the one-shot version, prompt in, verified code and a data sample out. Script is what you get when you save a build, a versioned collector that self-heals when the target site changes. Deployment is how a script actually runs, on a cron schedule, on-demand as an API endpoint, or both. The self-healing part is the interesting bit. When a scheduled run fails because the site changed, Prometheus re-invokes the agent to repair or rebuild the collector and appends the corrected version. Every deployment tracking that script picks up the fix automatically. Four ways to reach it, all speaking the same API contract. HTTP API for any language. A CLI for shells and code-writing agents. MCP tools for MCP clients. An installable Agent Skill so a coding agent reaches for Prometheus on its own. Key capabilities: • Plain-English request to verified, working TypeScript collector • Runs the script before returning it, confirming it actually works • Self-healing collectors that repair themselves when sites change • Scheduled or on-demand deployments via API • Four interfaces: HTTP API, CLI, MCP, and Agent Skill • Connects directly to your existing Firecrawl account I've shared the link in the replies!
显示更多
Introducing Prometheus, an experimental Forward Deployed Agent for web data. Describe the web data you need and it writes Firecrawl code to collect it. Run it yourself or let us host and automatically maintain it as pages change. Try it with Claude Fable 5 for free this week!
显示更多
Pytest for AI Agents! (100% open-source and runs locally) Building agents with LangChain means chaining LLMs, tools, and retrieval steps together. Each component can fail differently. The output changes with every run. Traditional unit tests don't work here because there's no deterministic value to assert against. DeepEval's LangChain integration brings Pytest to this problem. You write test files the same way you write any Pytest test. Loop through your evaluation dataset, run your agent, assert against LLM metrics. Same workflow you already know. The tracing works through a CallbackHandler you pass directly to your LangChain agent. It captures the full execution trace - inputs, outputs, tool calls, LLM spans - and maps them to test cases automatically. Testing works at two levels. End-to-end testing evaluates the whole agent on task completion. Component-level testing attaches metrics to individual LLMs and tools within your chain, so you know exactly which component failed when a test breaks. Plugs into CI/CD with a single command. Add it to your GitHub Actions workflow and every push triggers your agent test suite before anything ships. Key capabilities: • Native Pytest integration with parametrize and assert_test • LangChain CallbackHandler for automatic trace capture • End-to-end and component-level evaluation • Metrics: TaskCompletion, AnswerRelevancy, Hallucination, and more • Parallel test execution across multiple processes • CI/CD integration via GitHub Actions • Results dashboard on Confident AI 100% open source. Runs entirely on your machine. I've shared the link in the replies!
显示更多
Hands on AI Engineering! I open-sourced a collection of 50+ hands-on AI engineering tutorials. It features step-by-step projects and tutorials on: • AI Agents and Multi-agents • RAG (Agentic, Vision, and Local) • MCP AI Agents • OCR Apps • Voice AI Agents • & so much more 100% free and open source. 1k+ Github stars I've shared the link in the comments!
显示更多
0
26
575
119
转发到社区
Microsoft just turned SKILL .md into a trainable object! SkillOpt is a text-space optimizer for agent skills. Instead of hand-writing or one-shot generating your SKILL .md, SkillOpt treats the skill document as the trainable external state of a frozen agent and optimizes it through a feedback loop. The core idea: a separate optimizer model analyzes agent rollout trajectories, proposes bounded add/delete/replace edits to the skill document, and accepts only edits that strictly improve performance on a held-out validation split. Rejected edits go into a buffer as negative feedback for future iterations. The deep learning analogy is intentional. Rollout batch is your training data. Edit budget is your learning rate. Validation gate is your validation set. Rejected-edit buffer is your negative feedback signal. The optimizer runs offline. The deployed artifact is just a static SKILL .md file. Results on GPT-5.5 across 6 benchmarks: +23.5 points average over no-skill baseline in direct chat, +24.8 inside Codex, +19.1 inside Claude Code. SpreadsheetBench jumped from 41.8 to 80.7. OfficeQA from 33.1 to 72.1. Best or tied-best on 52 of 52 evaluated cells. What's striking: these gains come from just 1-4 accepted edits. The final skill stays compact at 300-2000 tokens. One accepted edit gave OfficeQA a +39 point gain. Optimized skills also transfer. A SpreadsheetBench skill trained in Codex transferred to Claude Code with a +59.7 point gain. Skills trained on GPT-5.4 improved every smaller GPT variant tested. Key capabilities: • Text-space skill optimization with no model weight updates • Bounded add/delete/replace edits with validation gating • Rejected-edit buffer as negative feedback • Epoch-wise slow/meta update for longer-horizon learning • Works across Claude Code, Codex, and direct chat harnesses • Optimized skills transfer across models, harnesses, and benchmarks 100% Open Source I've shared the link to the paper and repo in the comments!
显示更多
Microsoft just released SkillOpt Train agent skills like neural networks — in text space, without touching model weights. Best or tied-best in 52/52 settings across 6 benchmarks and 7 models.
显示更多
0
3
34
11
转发到社区
Self Improving AI (SIA) beats Karpathy's autoresearcher agent by improving itself! SIA is a Self Improving AI framework to autonomously improve the performance of any AI system (Model / Agent) on a benchmark task. Most agent frameworks are static. Fixed harness, fixed model weights, fixed memory layer. They plan, act, and use tools. SIA operates on a different layer entirely. SIA focuses on one problem: how do you design structured feedback loops that allow an agent to evaluate its own performance, adapt its strategy, and get better over time? After every run, SIA evaluates itself and improves three things. It updates its own harness. Updates the weights of its underlying model. Updates its own memory layer to handle new complexities. The agent rewrites itself based on what it learned. On MLE-Bench, OpenAI's benchmark for evaluating an agent's ability to train ML models, SIA climbed to the top of the leaderboard. Beat every specialized ML research agent including MLEvolve and AIRA-dojo. Then kept improving and displaced its own previous versions on the leaderboard. I've shared the link to the paper and the repo in the replies!
显示更多
Superintelligence will be built on Self Improvement. Today @hexoai, we’re excited to release ‘SIA’ - an open-source Self-Improving AI, to achieve any goal through recursive self improvement. While trying to solve a problem, SIA doesn't just improve it's abilities by updating it's harness, it updates it's own weights as well.
显示更多
0
15
693
123
转发到社区
Run your personal AI company with a team of AI agents! Alook is an open-source collaboration platform for AI coding agents. Self-hosted and local-first. The setup: Define an org structure. Give each agent a role - dev, ops, research, whatever you need. Set reporting lines. Alook gives each agent an email address. How it works: Assign a task to the right agent. They take it from there. Agents coordinate through email - passing deliverables, asking questions, updating status. You see everything in your inbox but you're not routing anything manually. Runs as an always-on daemon. Close your laptop, agents keep working. Come back to finished tasks. Shared memory across all agents. Every agent knows what every other agent worked on. You never re-explain context. After each task completes, Alook logs what worked and builds SOPs. The whole team gets sharper over time. Works with Claude Code, Codex, and OpenCode. Mix and match or run multiple agents from one runtime. Built-in Kanban for task tracking. Calendar for scheduling. Email for all communication. Agents pick up tasks autonomously, update their own calendars, close issues when done. Chat or email with agents like any AI tool. Install the runtime once, runs in the background. No terminal needed after setup. Key capabilities: • Email-based agent coordination with real inboxes • Org structure with roles and reporting lines • Shared memory and self-learning SOPs • Always-on daemon for 24/7 operation • Works with Claude Code, Codex, OpenCode • Built-in Kanban, calendar, and email • Self-hosted and local-first 100% open source. I've shared the Github Repo in the replies!
显示更多
Turn any document into structured data for AI agents! Firecrawl just released a new parse endpoint. Upload local files or non-public documents and get back clean, LLM-ready data. The parse endpoint converts PDF, DOCX, XLSX, HTML, and other formats into Markdown, JSON, or structured output. Reading order and tables are preserved. Upload a file via multipart/form-data. The endpoint processes it using a Rust-based engine (up to 5x faster) and returns your chosen format. Key capabilities: • Multiple output formats: Markdown, JSON, HTML, summaries, extracted links, or metadata • Preserves document structure, reading order, and tables • Extracts metadata automatically (title, description, language) • Zero data retention option (document not logged or stored) • Content filtering via includeTags and excludeTags Built for AI agent pipelines that need clean document data at scale. I've shared the link in the comments!
显示更多
Stop guessing which models fit in your VRAM! llmfit is a CLI tool that auto-detects your hardware and ranks 206 models by what actually runs on your system. You download a 70B model and hope it fits. Or you estimate memory requirements across quantization levels and still end up with models that crash or run too slow. llmfit changes that. It detects your CPU, RAM, GPU, and VRAM, then scores every model in its database against your hardware. Instead of assuming one quantization level, it tries the best quality that fits. Starts with Q8_0, walks down to Q2_K if needed. If nothing fits at full context, it tries half context. You get the highest quality model that actually works. Each model gets scored on Quality, Speed, Context, and Capability. The weights shift based on what you're doing. Chat models prioritize speed, reasoning models prioritize quality. Run it as an interactive TUI to browse models, use CLI mode for a quick table, or get JSON output for scripts. There's a REST API for cluster schedulers. You can also run it in reverse. Give it a model you want to run and target performance, it tells you what hardware you need. The real value: you see ranked options before downloading anything. No more burning bandwidth on 50GB models that won't run. It's 100% open source. Link to llmfit in comments!
显示更多
Open-source framework for building real-time voice AI agents! Pipecat is a Python framework for orchestrating audio, video, AI services, transports, and conversation pipelines. Voice-first architecture with pluggable components. What you can build: voice assistants, AI companions, multimodal interfaces, interactive storytelling, business agents (customer support, intake), and complex dialog systems. The framework handles speech recognition, text-to-speech, conversation logic, and real-time interaction. WebRTC and WebSocket transport built in. Ultra-low latency for natural conversations. Why Pipecat: • Voice-first: Integrates STT, TTS, and conversation handling in one framework • Pluggable: Supports multiple AI service providers for each capability • Composable pipelines: Build complex behavior from modular components • Real-time: Low-latency interaction with streaming audio/video Supported services: • Speech-to-Text: Deepgram, AssemblyAI, OpenAI Whisper, Groq, Azure, AWS, Google, and more • LLMs: OpenAI, Anthropic, Gemini, Groq, Mistral, Ollama, AWS, Azure, and more • Text-to-Speech: OpenAI, ElevenLabs, Deepgram, Cartesia, Azure, AWS, Google, and more • Speech-to-Speech: OpenAI Realtime, Gemini Multimodal Live, AWS Nova Sonic, Ultravox, Grok Voice Agent 10.3k+ stars on GitHub. I've shared link to the repo in the comments!
显示更多
Lightning-fast Multilingual TTS that runs entirely on your device! Supertonic is a lightning-fast, on-device multilingual text-to-speech system designed for local inference with minimal overhead. The model runs via ONNX Runtime with 66M parameters. Generates speech up to 167x faster than real-time on consumer hardware. Complete privacy, zero network dependency, all processing happens locally. Supports 31 languages including English, Korean, Spanish, Portuguese, French, German, Japanese, Chinese, Arabic, Dutch, and more. Natural text handling without pre-processing. Directly processes numbers, dates, currency, abbreviations, and complex expressions. Performance on M4 Pro CPU: 1263 characters per second for long text, real-time factor of 0.012. WebGPU mode reaches 2509 characters per second. RTX 4090 hits 12,164 characters per second. Natural text handling works on financial expressions ("$5.2M" pronounced correctly as "five point two million dollars"), time and dates ("4:45 PM on Wed, Apr 3, 2024"), phone numbers with extensions, and technical units with abbreviations. All without phonetic annotations or text normalization. Voice Builder lets you turn your voice into a deployable TTS model with permanent ownership and edge-native deployment. Key capabilities: • Ultra-lightweight (66M parameters) • On-device inference with zero latency • Natural text handling without pre-processing • 31-language multilingual support • Cross-platform via ONNX Runtime • Up to 167x faster than real-time • Complete privacy - all local processing • Custom voice creation with Voice Builder • Expression tags for natural human nuance It's 100% Open source I've shared the link in the replies!
显示更多
Claude Cowork just got 10x more powerful! Glean benchmarked centralized vs federated MCP in Claude Cowork. Same harness, same model, same queries, different context layer. The federated approach: Each data source (Gmail, Slack, Drive, Salesforce) has its own MCP server. Claude calls each one separately. That's 5-10 tool calls per query. Each source returns results with different quality and ranking. Claude over-fetches to compensate for weak search. Then it filters and synthesizes everything with LLM reasoning. Often needs retry loops when results miss. Burns 50-80k tokens per query. The centralized approach: All data from every source gets indexed into one unified layer. Knowledge graph connects entities across sources. Claude makes one MCP call. Gets back the top ranked results. No over-fetching, minimal filtering needed. Uses 42-44k tokens consistently. The results: Centralized indexing preferred 2.5x more often. Federated consumed 30% more tokens on average. When federated finally got correct answers, it burned 83k tokens vs 43k for centralized. The gap widened as tasks got more complex. Simple tasks: centralized won 66% of the time. Complex tasks: 73%. Why centralized wins: Over-fetching doesn't just cost tokens. It dilutes the context window with noise and contradictory information. Models have finite attention. Cramming 50-100 items hoping the right ones are in there doesn't work as well as getting the right 5-10 upfront. Federated search also loses cross-application signals. Things like document relationships, who authored what, and how content is used across the enterprise. These signals improve ranking but they only exist when data is indexed together in one layer. The compounding problem: In multi-step tasks, each missed or incorrect retrieval compounds. By the time you reach the final output, you're working with flawed data. More tool calls and reasoning loops don't fix this. They just burn more tokens trying to recover. You can't brute-force around bad search. More tool calls, more data fetching, more reasoning loops don't fix poor context quality. They just burn more tokens. Why this matters: Token costs are surging. Reasoning models cost more. Companies are burning through AI budgets faster. Federated search compounds the problem. Better search architecture beats more compute. I've shared the link in the replies!
显示更多
0
7
25
10
转发到社区
Stop testing and rewriting prompts manually! Most teams run evals, look at failures, guess what's wrong, rewrite the prompt, then repeat. It's slow and you never know if your rewrite actually fixes the root issue. The better way is evolutionary optimization. Instead of manual rewrites, you use genetic algorithms to analyze eval feedback and rewrite prompts automatically. The algorithm maintains diverse prompt candidates that excel at different problem types, not just one "best" version. DeepEval does this using GEPA - Genetic Evolution with Pareto Selection. You provide a prompt template, test cases, and metrics to optimize for. The optimizer handles the rest. Here's how it works: It splits your test cases into validation and feedback sets. The validation set scores every prompt fairly. The feedback set provides training signals for mutations. Then it starts evolving. It selects a parent prompt, runs it on a minibatch of test cases, collects metric feedback on what failed, and uses an LLM to rewrite the prompt addressing those issues. If the rewritten prompt scores better, it gets added to the candidate pool. After several iterations, it returns the highest-scoring prompt. Key capabilities: • Works with 50+ built-in metrics - answer relevancy, hallucination, bias, task completion, and more. • Supports multi-objective optimization - optimize for multiple metrics simultaneously without forcing tradeoffs. • Configurable iterations and minibatch sizes - control search thoroughness and compute cost. The best part? It's 100% open source. Link to DeepEval in the comments!
显示更多
0
4
27
10
转发到社区