搜索 GPT4 相关的推文与用户

2026.05.29 09:53

Notion 创始人这期分享确实很精彩。大家千万别错过 Notion CEO Ivan Zhao 在红杉聊的这期播客，观点特别有见地。甚至我觉得，这是近半年来所有创业者都应该认真精读的一期内容。相当解惑。Ivan 把 AI 时代里一个组织正在发生的变化，用一种特别形象的方式串了起来。 1、Ivan 提到一个非常有意思的概念，叫 Jazz Mode。传统公司像 marching band（行进乐队），队形整齐，节奏固定，指挥说什么大家做什么。 Notion 想做的是 jazz band，有基本结构，但更强调即兴、互相接住、每个人都能主动发挥。 Ivan 觉得 AI 时代变化太快，太像 marching band 的组织会跟不上，所以 Notion 这几年在刻意招那种高主动性、高好奇心、能自己找路的人。 2、公司并不存在完美的扁平结构，人和人之间的层级关系始终存在，在这一点上他选择承认现实。真正能设计的是一起工作的方式，把公司组织得更像一支可以即兴合奏的爵士乐队，而不是一支只会整齐队列表演的军乐队。 3、但爵士模式的前提是主旋律要清楚，相当于愿景、产品方向和少数几条铁律，类似底层和声结构。团队在这个主旋律之上拥有较大的即兴空间，可以根据用户反馈和技术变化自己处理段落，而不是在每一个决策点都往上递审批。 4、Notion 现在大约有六十个曾经做过创始人的员工，这是他刻意营造的人才结构。他更愿意招对 0 到 1 负过全责的人，这类人习惯自己发现问题、搭框架、推进落地，也更愿意在模糊地带主动出手。公司需要的是能自己写歌、也能听懂别人演奏的人，而不是只会照谱子执行任务的演奏员。 5、在工程组织上，他把 Notion 重构成一个杠铃结构。一端是非常 junior 的工程师，刚毕业或者职业早期；另一端是少数非常 senior 的架构师和技术带头人。中间那类常规中高级工程师反而被刻意压缩，整个分布像一根两头重、中间瘦的杠铃。 6、杠铃结构背后的逻辑很简单。年轻工程师可塑性强，不会被旧时代的大规模工程实践和工具链束缚思路，在大模型快速变化的环境里能更容易接受全新的开发范式。资深架构师负责定义系统级的分工和抽象，比如哪些部分交给模型，哪些部分坚持规则，如何组织数据和服务，确保整个产品作为一个系统是连贯的，而不是东一块西一块的功能拼盘。 7、当一个工程团队里大多数人都停在经验不浅、但也谈不上顶尖的这个档位时，整个组织很容易陷入一种温水状态：事情都有人做，但缺少敢颠覆旧方案的人，也少了愿意疯狂尝试新东西的人。年轻人缺少全局视角，只能局部优化，老一辈如果人数太少，声音又容易被流程淹没，这种结构在 AI 时代会变得非常迟钝。 8、他用一个很有画面感的比喻来解释大模型产品开发。传统软件工程更像修桥，强调确定性的结构分析和数学推演，只要按照规范搭建，结果会高度可预测。基于大语言模型做产品更接近酿啤酒，需要在原有配方和工艺上不断试验，调整温度、时间和原料比例，最后的标准来自人的口感而不是单一技术指标，用户体验是第一参照系。 9、他把 Notion 的 AI 能力视作对产品的再创作，而不是简单给旧界面贴一个智能按钮。在拿到 GPT4 的早期访问时，他的直觉是，工具本身的工作方式需要重想，如果从公司一开始就可以假设存在这样的模型，那么 Notion 的交互结构、功能边界和价值主张应该是完全不一样的一套设计，这需要以重启思维来对待，而不是做一个插件。 10、他的职业生涯里出现过两次真正意义上的重启。第一次发生在公司最困难的时候，团队被压缩到只剩几个核心成员，几个人躲在京都的小公寓里，用近乎白纸的心态重新问自己。 Notion 还要不要继续存在，如果要继续，哪些东西必须放弃，哪些能力哪怕再难也要保留。这一轮更偏向拿掉包袱、守住本质。 11、第二次重启发生在他拿到 GPT4 早期权限之后。当模型能力跃迁时，他意识到自己这家公司可能会失去原本的位置，也可能借此进化成下一代生产力工具，关键在于有没有勇气承认旧的产品假设正在过期。这一次的重启更偏向向前跳，是把 Notion 推向 AI 原生路线的转折点，从「文档+数据库」升级成「有理解能力的工作空间」的起点。 12、关于什么时候该重启，他给的判断标准非常实际。并不是等到财务报表撑不住，更多是看组织与时代之间的错位：技术和市场环境已经明显往前走了，内部流程、产品形态和人才结构还停在旧逻辑里。同时创始人对当下这家公司明显提不起兴趣，每天更像在维护一台运转正常但没有灵魂的机器，这种状态持续存在，基本就到了需要重启的阶段。 13、他谈创始人角色时，把重心放在创始人能量上。创始人在公司里最关键的价值不只是最后拍板，更是持续发射出一套稳定的频率，这套频率包括对产品审美的标准，对什么算好体验的直觉，对哪些细节不能妥协的执念，以及面对不确定性时愿不愿意亲自下场试。只要这股能量还在线，团队就知道自己在跟随一位有风格的乐手，不是在为一个抽象的 KPI 系统打工。 14、在人才选拔上，他逐步弱化简历本身的重要性。 Notion 的第一轮面试已经不再以简历为核心材料，更关注候选人在开放问题前的思考方式，对产品的直觉，对工具和工作方式的理解。名校和大厂的履历在他眼里容易变成噪音，他更在意一个人能不能提出自己的看法，能不能在真实的情境下把事情推进，而不仅是在纸面上合格。 15、在销售文化上，他没有把销售放在产品的对立面。他希望 Notion 的销售像懂音乐的乐手，先听清楚客户现在那首歌的节奏，再思考 Notion 这件乐器应该在什么位置加入，是主旋律，是伴奏，还是间奏，而不是一上来就把音量拉到最大只追逐签约数字。真正重要的是建立长期合作关系，帮助对方把 Notion 用深、用广，而不是满足一次性收入目标。 16、他反复把组织、产品和 AI 放在同一个坐标系里思考。组织设计要适配新的技术范式，人才结构要为试错和即兴留出空间，产品要在模型能力和用户体验之间找到新的平衡点，销售和商业化则负责把这套东西带到更大的市场里。整套思路的底层前提很简单：这家公司永远是一支在不断改编曲目的爵士乐队，而不是一台固定工序的流水线。

显示更多

0

7

149

37

转发到社区

流浪国男@zmt021

2026.05.28 00:12

国家实力再强，统治者如果是文科弱智，照样能带沟里去。。。就像给文科生充了200美元的gpt pro，他们也只会抱着聊天，哭着说：求别删，我要和gpt4o继续文爱

显示更多

0

6

17

1

转发到社区

Andrej Karpathy@karpathy

2025.02.18 05:25

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

显示更多

0

666

16.8K

2.2K

转发到社区

Rocky@Rocky_Bitcoin

2025.01.23 21:54

彻夜难眠，美国砸了几千亿的前沿AI科技，中国直接开源并基本齐平的跑分，核心还完全免费啊！微软投 OpenAi的100亿还没回本，中国已经免费了，干到白菜价了！难以想象，@deepseek_ai 做个大模型训练只花了550万美金，OpenAi整了几个亿！谷歌，亚马逊，微软下了上万张A100的订单，现在 #DeepSeek# 只要百分之一张A100显卡，做到同样效果，而且开源了所有的训练逻辑！这到底什么情况？😯这英伟达的股票？我现在都有点慌了，下一个季度英伟达的财报？这动不动融资就几十亿上百亿的AI公司，还没上市呢，直接遭受这种暴击！中国是自损100，杀敌1000，这怎么玩？最近还有中国AI科学家把GPT4o训练模型逆向工程复现了，还写了论文，这让OpenAi直接骂街了！重点是 @deepseek_ai 还是一个基金公司，还不是全职互联网科技公司，有一种扫地僧把拳王泰森KO的既视感！🫣

显示更多

0

67

231

45

转发到社区

Barret李靖@Barret_China

2024.08.20 12:02

SkyReels， AI 短剧平台，看起来有点厉害。它搞了三个大模型，分别用来制作剧本、分镜和 3D 生成。其中，剧本大模型 SkyScript 用来写爆款剧本，它的效果比 gpt4 要强很多，原因是搞了一个高质量的数据集，对海量精彩短剧做了人工标注，爽点、爆点拿捏的更准；分镜大模型 StoryboardGen 对画面组成进行了拆解，如场景、镜头、角色、动作等，每个智能体负责生成一部分，最后整合成一个场景，确保 AI 生成的前后一致性；3D 生成大模型 Sky3DGen 搞定分镜到视频的转换，另外一个牛逼之处在于，它的生成算法可以做到 180s 的连续生成，Sora 只有 60s。视频完成之后，还可以自动完成背景音乐和音效的填充，也可以一键发布到短视频平台，完成了整个内容生产闭环。非常期待开放公测。另外，我查了一下，2023 年微短剧的市场规模 373.9 亿，比 2022 年涨了两倍多，其中付费用户占比是 31.9%，在 2027 年将会达到千亿规模的市场，可以说 AI 短剧市场还是非常有前景的。

显示更多

0

4

106

24

转发到社区

Andrej Karpathy@karpathy

2024.07.23 00:47

@kylebrussell Wow, this has just become my favorite LLM test. I missed that this doesn't work but it really doesn't, even for SOTA LLMs. Seems to be a bit hit and miss, e.g. with GPT4o which failed 1/3 times, Claude failed 3/3 times.

显示更多

0

28

145

7

转发到社区

李自然 Nate Lee@nateleex

2024.07.16 20:27

如果用 iPhone 12 的算力训练GPT4，需要6万年，如果用1997年的Pentium处理器（100Mhz，9200万次浮点运算/秒）需要超过660亿年🍉🍉

显示更多

0

1

8

0

转发到社区

ruanyf@ruanyf

2024.07.05 02:00

周五软件分享 - Onefetch（图一）：显示 Git 仓库的详细统计信息 - GPT4 生成的英语单词书（图二） - 一个字符串哈希函数，生成中国车牌格式（渝G·VGUA1）更多软件 #科技爱好者周刊（第307期）#

显示更多

0

1

85

22

转发到社区

穆尼@MooenyChu

2024.01.12 02:57

昨晚C大（福利官）邀请加入了他的Team版，相当于免费使用GPT4，没开plus也可以加入，那就爽了😂。年底非常忙，接了太多商业单（视频）要做，根本没时间摸鱼。如有任何ChatGPT使用上的疑难杂症都可以咨询他@Cydiar404，再次感谢

显示更多

0

2

6

0

转发到社区

穆尼@MooenyChu

2023.11.10 07:39

#GPT4# 更新后，all tools功能又可以用了，刚好今天很空，分享下我的社交媒体头像制作方法。效果见图具体prompt见ALT。为什么我说这个比较好玩，好用。它可以双向提问，你可以直接让它帮你生成一张你想要的图片，也可以用以图生图加精准提示词的方法生成你想要的。最最最关键的，你可以找到所有好看的图片然后自己结合的需求进行二次创作，只需要一个简单prompt就能搞定，这才是发挥想象力和创造力最好玩的东西。你说这20美金值不值，附加了 #Dalle3# 的gpt-4太值了。

显示更多

0

17

297

77

转发到社区

与「GPT4」相关的搜索结果