注册并分享邀请链接,可获得视频播放与邀请奖励。

John Carmack 的个人资料封面
John Carmack 的头像

John Carmack (@ID_AA_Carmack)

@ID_AA_Carmack
AGI at Keen Technologies, former CTO Oculus VR, Founder Id Software and Armadillo Aerospace
285 正在关注    1.6M 粉丝
I'm a little disappointed with myself that the high school algebra identity didn't occur to me right away.
0
70
1.6K
45
转发到社区
I've been coding for 40 years. Here are the top 5 things I wish I knew when I started. 1. 90% of the job is debugging and fixing, not creating new code. Which is still fun if you're good at it. I used to think programming was mostly writing fresh, clever stuff. In reality, most of your time is spent in other people's (or your own past self's) messy code, chasing down why something that "should" work doesn't. Get really good at debugging early. Learn assembly reading, call stacks, and kernel debuggers. It pays off hugely. The best engineers I saw were absolute magicians at this. 2. Manage complexity from day one (ie: don't write slop and "fix it later" if it goes somewhere). Very early on, I'd hammer out code and refactor afterward. Big mistake. Now I start with clean, skeletal structure (minimalism first) and flesh it out carefully, with AI or not. Messy code compounds and becomes unfixable. Upfront discipline on architecture, naming, and simplicity saves enormous pain later, especially in large systems like Windows. 3. Tools and processes matter more than you think We suffered with basic diff/manual deltas instead of modern source control like Git. Branching, testing, and good tooling would have made porting and collaboration way smoother. Invest in your environment, automation, and reproducible builds early. Good tools amplify your output; bad ones (or none) drag everything down. 4. Understand the problem and existing code deeply before writing Don't jump straight to coding. Map out the problem, study what's already there (you'll inherit a lot), and plan. Low-level knowledge (hardware quirks, alignment issues on different architectures like MIPS/Alpha) was crucial. Also: assert early and often. It forces clarity. 5. People, politics, and "the right tool for the job" beat pure tech arguments. Brilliant engineers still argue endlessly. Sometimes it's about ego, not merit. Learn to spot the difference and "steer" the conversation rather than "winning" it. Bonus from experience: Side projects like Task Manager (started at home because I wanted the tool) can become your biggest hits. Ship small, useful things often. If you're just starting, focus on fundamentals, patterns over syntax, and building resilience for the long haul. It's going to be a wild ride, but the fundamentals still matter.
显示更多
0
182
4K
518
转发到社区
My reply to someone considering starting a video game company: The distribution of possible rewards for starting a video game company are generally not very good today. The market is well served, and gaining a foothold requires strong execution on both business and product issues, along with a substantial amount of luck. Plan to burn through seven figures with a not-great chance of making it back. If you do go for it, some bits of advice: Identify your customers clearly before you start. Not just a broad community, but specific people, and imagine them as you make decisions. Initially, build the smallest, most concise game you can imagine anyone paying for. It will still take much longer than you expect. Once something exists, hill-climb the value. Hopefully you will have some elements that clearly bring joy to people, which you can magnify. There will inevitably be tons of things that people find confusing, frustrating, or just boring that you will need to fix.
显示更多
0
264
6.4K
512
转发到社区
Space launch was a clear case where there was a large difference in efficiency between what was possible and what was done in practice before SpaceX. A large part of that was due to everything being locked in to what (just barely) already worked, with huge risk aversion. WIth national prestige or a half billion dollar geosync satellite on the line, speculative engineering ideas that might result in a public debacle were not welcome. When failure is not an option, success can stay very expensive. You need to experiment to improve, and that fundamentally means being comfortable with failure. If you know it is going to work, it isn’t an experiment. I have long believed that nuclear power today is in precisely the same state as space launch two decades ago, but the even more pressing question now is if semiconductor fabrication might also be. On the one hand, Moore’s Law has been a sequence of heroic miracles of technology at the wafer fabrication level, grinding out hundreds of compounding small improvements. On the other hand, fabs are “too big to fail”, and there are elements of extreme conservatism at play. Intel’s “Copy exactly!” fab development exemplifies that mindset – instead of every new building being an opportunity to explore and optimize processes, it was deemed more valuable to just replicate. While each individual machine may be straining against physical limits of technology, it is possible that the systems orchestrating them all together could be far from optimal. The explore / exploit axis is fundamental to all decision making, but human risk avoidance probably biases away from optimal exploration.
显示更多
0
103
3.3K
293
转发到社区
New @BeatSaber music pack is out, and I must be one of the first to play, landing a top-10 score that will surely be out of the top 100 by tomorrow.
0
40
394
17
转发到社区
Some people are misreading this -- 511x511 was FASTER. It looks like at 512x512 and above it falls to another path that requires internal CudaMalloc/Free calls.
0
17
139
2
转发到社区
GPU library performance can be very notchy -- runtime of batched torch.linalg.solve_ex() went up by over 10x going from 511x511 matrices to 512x512.
0
43
606
17
转发到社区
I was on a cruise ship last week (Star of the Seas), and they had pods of 10 elevators in a circle, where you picked your destination floor on a pad, and it directed you to the correct elevator, which was often behind you. It seemed to work efficiently, but multiple times I saw people tap their floor and just look away, conditioned for normal elevator operation, and miss the arrival of the elevator they were supposed to get on. Addressing my normal pet peeve of interaction feedback latency would have helped — with all the fades and slides, it takes over a second for the first hint of the elevator to show up, and two seconds for it to fully stabilize. That may not seem like much in some circumstances, but it is plenty of time for people to look away. The elevator letter should appear instantaneously, maybe with some festive animation around it to hold attention that was on the button press. Even better would be to add a localized audio cue from the elevator the instant you pressed the button, which would let you immediately know where it is without having to scan for the lighted letter. (the Starlink internet on the ship was excellent, allowing me to get some work in at sea)
显示更多
0
124
1.3K
26
转发到社区
It is generally frowned upon to have LLMs precisely regurgitate part of their training set, but it is an interesting question how you could use LLM training to nearly losslesly compress a huge corpus like the entirety of the Internet Archive. The Hutter Prize is for perfect compression, but only one GB. There would be different trades at the PB level, and it gets much more interesting when it doesn’t have to be bit-accurate.
显示更多
0
108
1.5K
52
转发到社区
A Canticle For Leibowitz is a classic early (1959) post-apocalypse novel where an order of monks preserved the last remnants of learning (the memorabilia) after a nuclear exchange turned the remains of society into book and scientist burners. I first read it in the 80s as a mass market paperback that I somehow lost along the way. Other paperbacks from that time are yellow with age and getting brittle, but still readable. I read it again in the late 2000s on a first edition Kindle. I eventually migrated to iPads for Kindle reading, but every couple years I would come across an old Kindle in a drawer, charge it up, and check out what I had been reading on it. They eventually stopped working entirely. I’m just finishing reading a new Folio Society edition, printed on heavy, acid-free archival quality paper. If it doesn’t get soaked or burned, it could still be in good shape for centuries. The ephemeral nature of digital storage does give me some pause. We can still read Sumerian tablets full of administrative trivia from four thousand years ago, but there are no known copies of some important software products from just fifty years ago. I am a proud supporter of the Internet Archive!
显示更多
0
162
3.7K
434
转发到社区
FLOPS was originally “floating point operations per second”, specifying a rate of work for a system: A SPARCstation 2 gave 4.2 MFLOPS. Today you also see it used as “floating point operations” for an algorithm, or an amount of work: This layer takes 8 GFLOPS.
显示更多
0
48
725
21
转发到社区
Rhymes with @RichardSSutton’s Bitter Lesson.
A computer can do anything provided you learn to tell it how. Very recently, this has become vastly easier to do. Chalk up another victory for Carmack’s Law:
0
16
459
37
转发到社区
Making a scatter plot of 400_000 data points, some of the plots had odd gaps in coverage. It took me a little while to realize that it was only when the data was farther from the origin -- it was the raw bfloat16 precision. Everything looks great from -1 to 1, but as you go past 2 and 4, the coverage gaps get larger. My intuition didn't have it being quite so "discretely countable" at those modest numeric values. Float32 for comparison.
显示更多
0
69
1.9K
109
转发到社区
My library donation project for the LFS found homes for twenty sets of books, so I ordered another batch:
So many judging tasks could be improved by aggregating partial orderings, and in the limit, just ordering pairs. The annual Libertarian Futurist Society novel awards discussion is starting, and while I would like to participate on some level, there is no way I have time to read an entire slate of novels. However, I will likely read at least two from the list, and I could give a relative assessment. This cries out for the use of something like ELO ranking, as in chess competition, perhaps with some suggestions to get sufficient coverage. Peer and out-of-chain employee performance calibrations could probably also benefit from a greater quantity of sparse pairwise comparisons
显示更多
0
29
362
14
转发到社区
Without getting all the way down to performance counters, GPU power from nvidia-smi is a better indicator of true utilization than job scheduling or “gpu busy”. I would love to see animated “heat maps” of the big data centers, with each pixel being an individual GPU’s power draw. I am confident that inference and frontier training at the big labs is highly efficient, but I wonder how many GPUs would be dark due to scheduling and inefficient research code. With a little calibration for base load and peak, just the power bill for the datacenter would be a pretty good first order indicator of utilization.
显示更多
0
72
1.1K
62
转发到社区
#PaperADay# 10 LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics The comments on #PaperADay# 3 recommended this paper as the state of the art JEPA paper, and it does look much better! They acknowledge that much of the prior JEPA research is ad-hoc and full of heuristics, but here they make strong theoretical claims of optimality and provide proofs (which I did not read). The first claim is that isotropic gaussian is the unique optimal embedding distribution for both linear and nonlinear probing, minimizing worst-case risk across downstream tasks. I would have taken that on faith with just a “sounds good to me”, but they go into it with details and examples. Actually getting an isotropic gaussian in high dimensions is easier said than done. They present Sketched Isotropic Gaussian Regularization (SIGReg) as a well behaved loss function to achieve this after analyzing a number of different statistical tests, and they claim it beats the curse of dimensionality with linear scalability. The final loss is just a blend factor to weight the JEPA prediction loss against the SIGReg isotropy loss. This is the one tunable hyperparameter for LeJEPA. Despite the P in JEPA, they don’t use predictor networks here, they just directly compare view embeddings for the JEPA loss. Predictor networks could still be useful for video sequences, especially when conditioned with action information for agents / robots. Each training image is augmented to produce 2 global views and 6 local views with different spatial scales but the same set of color and geometric transformations. The loss is the average MSE between the average of the global view embeddings and each of the local view embeddings. I don’t have a good feel for the tradeoffs in their view transforms, which still seem very much in the ad-hoc space, but they will determine the nature of what gets filtered out of the representation. Learning what doesn’t matter is critical, but the specification of “matters” is only implicit in the view transformations. LeJEPA itself is architecture independent – anything that digests a batch of samples from a dataset into vectors can be used. Vision transformers, MLP, ConvNets, etc. The specific augmentations for views would be input modality specific, but the LeJEPA algorithm could work on audio, images, video, or other things. They show that the LeJEPA loss on a large foundation model is very indicative of downstream task performance, both directly, and with a heuristic to improve the predictive power of the loss farther. They also show that it can be used to train from scratch on small datasets with as few as 1000 samples and achieve better results than probing a conventional general foundation model. I was pleased to see sample code blocks in the paper instead of greek-laden pseudocode, as well as a github repo. Appendix D has interesting details on generating good coverage of unit hyperspheres with low discrepancy samples by transforming Sobol sequences, but this is only for their theoretical analysis, and they show you are better off just making new random hypervectors every batch, with even 16 random vectors outperforming a fixed set of thousands. Some questions: In the discussion of non-linear probing, only kNN and kernel methods are mentioned, presumably for their theoretical analysis tractability, but would an MLP generally perform better? A JEPA embedding is not fully reversible like NICE or a RevNet, so how does it react to inputs that are far outside the training set? Will novel inputs map to unique embeddings, or could they be collapsed onto the codes from the training set? How would the embeddings evolve in a continuous learning environment, as novel inputs are added to the training mix? Can a JEPA be overtrained – is lower training loss always better, or would there be an optimal early stopping point?
显示更多
Paper review: LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels Nice clean github: This is the application of the LeJEPA results to world models, trained offline on experience from three different robotics style tests with one to two million steps in each dataset. Re-states the benefits of the SigReg loss relative to prior world model approaches. Uses ImageNet standard 224x224 RGB pixel input images with an unmodified ViT-Tiny vision transformer from HuggingFace to generate latents. One extra post-projection step is needed to give SigReg the necessary freedom to perturb the latents into independent gaussians, since ViT ends with a layernorm’d layer. Also tested with ResNet-18, which still performed well, but slightly worse. Uses a 192 dimensional latent. Performance slightly dropped when doubling the latent size to 384; it would be nice to know if it was stable there, or if it continued worsening with excessive latents. There is a relationship between batch size and SIGReg, the larger latent may have improved performance if the batch size was increased. The predictor is implemented as a ViT-S backbone – Why a vision transformer when the latent is flat? Uses a history of 3 sets of latents for two of the benchmarks and 1 for the other. Performance was markedly better with the “small” ViT model than the “tiny”, but the larger “base” model degraded notably, which is interesting. Dropout of 0.1 on the predictor significantly improved performance. 0.2 was still better than 0.0, but 0.5 was worse. Trained with a batch of 128 x 4 trajectories. I wish their training loss graphs were more zoomed in with grid lines. Performs planning at test time instead of building a policy by training in imagination like Dreamer / Diamond. Rolls out 300 initially random sets of actions up to a planning horizon H of 5 (at frame-skip 5). Iterates up to 30 times using the Cross Entropy Method (CEM). The main paper body mentions using Model Predictive Control (MPC) strategy, where only the first K planned actions are executed before replanning, but appendix D says they execute all 5 planned actions. After training, they probe the latent space to demonstrate that it does capture and represent physically meaningful quantities. They also implement a decoder from the latent space back to pixels – not used by the algorithms, but helpful to see what things the latent space is actually representing. They tested incorporating the reconstruction loss into training, but it hurt performance somewhat. They wound up with a 0.1 lambda for SigReg, as opposed to 0.05 in the LeJEPA paper. 1024 sigreg projections, but observe the number has negligible impact I like the JEPA framework, but so far my attempts to use it on Atari games with value functions have not matched my other efforts.
显示更多
JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑:
显示更多
0
40
952
98
转发到社区
In addition to having the full high-functioning autist power set, Elon also genuinely likes being around and working with other people, which is a bit rare. The correlation between deep technical ability and anti-social hermit tendencies is real, and it limits a lot of people (ahem). @Project2501_117 had to point this out to me.
显示更多
Elon’s great super power is weapons grade autism combined with 99.9th percentile conscientiousness. Most people that conscientious are risk averse rule followers and most people that autistic have non existent executive function such they just become anti semitic mentats incapable of building anything.
显示更多
0
385
6.8K
407
转发到社区