NanoChat – The best ChatGPT that $100 can buy

(github.com)

673 points | by huseyinkeles 6 hours ago

32 comments

tehnub 51 minutes ago
Interesting exchange on the use of AI coding tools:
```
    curious how much did you write the code by hand of it?

    Karpathy: Good question, it's basically entirely hand-written (with tab autocomplete). I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution.
```
https://x.com/karpathy/status/1977758204139331904
[-]
- gyomu 15 minutes ago
  > the repo is too far off the data distribution
  ah, this explains why these models have been useless to me this whole time. everything i do is just too far off the data distribution!
- dude250711 4 minutes ago
  How convenient! You know, my code is somewhat far off the data distribution too.
- oblio 30 minutes ago
  We're still not ready for ouroboros.
montebicyclelo 1 hour ago
> nanochat is also inspired by modded-nanoGPT
Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat
modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.
Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).
[1] https://github.com/KellerJordan/modded-nanogpt
[2] https://kellerjordan.github.io/posts/muon/
[-]
- varunneal 1 hour ago
  Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training
  [-]
  - tbalsam 1 hour ago
    This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).
    Both share equal credit I feel (also, his co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.
    (Source: am experienced speedrunner who's been in these circles for a decent amount of time)
  - swyx 24 minutes ago
    sharing some useful resrources for learning Muon (since I'm also just catching up on it)
    - https://x.com/leloykun/status/1846842883967692926
    - https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...
- echelon 1 hour ago
  8xH100 is pretty wild for a single inference node.
  Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?
  At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.
  Is my math wrong?
  [-]
  - vessenes 1 hour ago
    This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
  - Tepix 45 minutes ago
    As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.
sammyd56 2 hours ago
I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij
Will share the resulting model once ready (4 hours from now) for anyone to test inference.
[-]
- Lerc 1 hour ago
  The comment beside the first chart
  >Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
  Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
  [-]
  - typpilol 5 minutes ago
    Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?
- royosherove 2 hours ago
  Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
  [-]
  - sammyd56 2 hours ago
    There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
    [-]
    - royosherove 18 minutes ago
      Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!
sieve 2 hours ago
Nice! His Shakespeare generator was one of the first projects I tried after ollama. The goal was to understand what LLMs were about.
I have been on an LLM binge this last week or so trying to build a from-scratch training and inference system with two back ends:
- CPU (backed by JAX)
- GPU (backed by wgpu-py). This is critical for me as I am unwilling to deal with the nonsense that is rocm/pytorch. Vulkan works for me. That is what I use with llama-cpp.
I got both back ends working last week, but the GPU back end was buggy. So the week has been about fixing bugs, refactoring the WGSL code, making things more efficient.
I am using LLMs extensively in this process and they have been a revelation. Use a nice refactoring prompt and they are able to fix things one by one resulting in something fully functional and type-checked by astral ty.
[-]
- danielmarkbruce 1 hour ago
  Unwilling to deal with pytorch? You couldn't possibly hobble yourself anymore if you tried.
  [-]
  - sieve 1 hour ago
    If you want to train/sample large models, then use what the rest of the industry uses.
    My use case is different. I want something that I can run quickly on one GPU without worrying about whether it is supported or not.
    I am interested in convenience, not in squeezing out the last bit of performance from a card.
    [-]
    - danielmarkbruce 1 minute ago
      You wildly misunderstand pytorch.
faxmeyourcode 3 hours ago
This weekend I just cracked into nanoGPT (https://github.com/karpathy/nanoGPT), an older but fabulous learning exercise where you build and train a crappy shakespeare GPT with ~0.8M parameters on a cpu. Results are about what you'd expect from that, they suck, but you can start to feel the magic, especially if you're not a deep learning professional and you just want to poke around and hack on it.
I started writing up a blog post on my weekend with nanoGPT but it's not done yet... Would have been great to link to here lol oh well
[-]
- ACCount37 3 hours ago
  It's a useful exercise. A lot of the good ML work is first validated at small scale.
  And this new example goes even further - adds instruction following and tool use SFT, as well as RLVR. Makes for a more useful baseline.
- andrewljohnson 3 hours ago
  the shakespeare code tuned a little with different training data does a good job of generating Magic The Gathering commander decks
  [-]
  - SeanAnderson 2 hours ago
    would love more details on this. this is exactly the type of project I'd like to dabble in to get more up to speed.
  - dmarcos 3 hours ago
    I like the idea of specific-purpose toy models. How did you tune the code and what dataset you used?
swyx 4 hours ago
> Thank you to chief LLM whisperer Alec Radford for advice/guidance.
oh man an Alec x Andrej podcast would BREAK THE INTERNET... just saying... going from glory days of GPT1 to now building GPT3? in 4 hours
[-]
- codybontecou 4 hours ago
  Please oh please. This would be perfect.
flakiness 5 hours ago
Eureka Labs: https://github.com/EurekaLabsAI
What a prolific person Andrej is. It's been more than amazing to follow along!
dabockster 41 minutes ago
The title is extremely misleading - you have to rent time on an H100 cluster to get it to work. It is not on-device, and thus not truly $100.
I was really excited, too, until I looked through the readme files and the code.
[-]
- arkmm 12 minutes ago
  What's misleading about that? You rent $100 of time on an H100 to train the model.
- simonw 37 minutes ago
  It's about training a model from scratch for $100.
kragen 1 hour ago
This is really inspiring! Does anyone have some example of how well or poorly it performs on some example prompts?
CountGeek 3 hours ago
So could I in practice train it on all my psychology books, materials, reports, case study and research papers and then run it on demand on a 1xH100 node - https://getdeploying.com/reference/cloud-gpu/nvidia-h100 whenever I have a specialised question?
[-]
- leokeba 3 hours ago
  You could do that indeed, but the performance would be abysmal. For this kind of use-case, it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
  [-]
  - dmix 1 hour ago
    > it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
    I noticed NewRelic has a chat feature that does this sort of thing, it's scoped very narrowly down to their website and analytics DSL language, and generates charts/data from their db. I've always wondered how they did that (specifically in terms of set up the training/RAG + guardrails). It's super useful.
    [-]
    - simonw 1 hour ago
      You might be able to figure that out just by asking it - see if you can get it to spit out a copy of the system prompt or tell you what tools it has access to.
      The most likely way of building that would be to equip it with a "search_docs" tool that lets it look up relevant information for your query. No need to train an extra model at all if you do that.
- gojomo 3 hours ago
  Yes, though it's possible a more-general core model, further enhanced with some other ways to bring those texts-of-interest into the working context, might perform better.
  Those other ways to integrate the texts might be some form of RAG or other ideas like Apple's recent 'hierarchical memories' (https://arxiv.org/abs/2510.02375).
- zipy124 3 hours ago
  You could but it would be significantly worse than fine-tuning or RAG with a pre-trained model, or using a smaller model since your dataset would be so small.
- alganet 3 hours ago
  No.
karimf 6 hours ago
I've always thought about the best way to contribute to humanity: number of people you help x how much you help them. I think what Karpathy is doing is one of the highest leverage ways to achieve that.
Our current world is build on top of open source projects. This is possible because there are a lot of free resources to learn to code so anyone from anywhere in the world can learn and make a great piece of software.
I just hope the same will happen with the AI/LLM wave.
[-]
- bkettle 3 hours ago
  This free tradition in software is I think one of the things that I love so much, but I don't see how it can continue with LLMs due to the extremely high training costs and the powerful hardware required for inference. It just seems like writing software will necessarily require paying rent to the LLM hosts to keep up. I guess it's possible that we'll figure out a way to do local inference in a way that is accessible to everyone in the way that most other modern software tools are, but the high training costs make that seem unlikely to me.
  I also worry that as we rely on LLMs more and more, we will stop producing the kind of tutorials and other content aimed at beginners that makes it so easy to pick up programming the manual way.
  [-]
  - levocardia 2 hours ago
    There's a Stephen Boyd quote that's something like "if your optimization problem is too computationally expensive, just go on vacation to Greece for a few weeks and by the time you get back, computers might be fast enough to solve it." With LLMs there's sort of an equivalent situation with cost: how mindblowing would it be able to train this kind of LLM at all even just 4 years ago? And today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
    There's also a reasonable way to "leapfrog" the training cost with a pre-trained model. So if you were doing nanochat as a learning exercise and had no money, the idea would be to code it up, run one or two very slow gradient descent iterations on your slow machine to make sure it is working, then download a pre-trained version from someone who could spare the compute.
    [-]
    - dingnuts 1 hour ago
      > today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
      No, it's extremely hard to imagine since I used one of Karpathy's own models to have a basic chat bot like six years ago. Yes, it spoke nonsense; so did my GPT-2 fine tune four years ago and so does this.
      And so does ChatGPT
      Improvement is linear at best. I still think it's actually a log curve and GPT3 was the peak of the "fun" part of the curve. The only evidence I've seen otherwise is bullshit benchmarks, "agents" that increase performance 2x by increasing token usage 100x, and excited salesmen proclaiming the imminence of AGI
      [-]
      - simonw 1 hour ago
        Apparently 800 million weekly users are finding ChatGPT useful in its present state.
        [-]
        infinitezest 34 minutes ago
        1. According to who? Open AI? 2. Its current state is "basically free and containing no ads". I don't think this will remain true given that, as far as I know, the product is very much not making money.
        [-]
        simonw 29 minutes ago
        Yes, that number is according to OpenAI. They released that 800m number at DevDay last week.
        The most recent leaked annualized revenue rate was $12bn/year. They're spending a lot more than that but convincing customers to hand over $12bn is still a very strong indicator of demand. https://www.theinformation.com/articles/openai-hits-12-billi...
  - DennisP 22 minutes ago
    Maybe this isn't possible for LLMs yet, but open source versions of AlphaZero have been trained on peer-to-peer networks.
    https://zero.sjeng.org/
    https://katagotraining.org/
  - hodgesrm 3 hours ago
    This. It looks like one of the keys to maintaining open source is to ensure OSS developers have access to capable models. In the best of worlds, LLM vendors would recognize that open source software is the commons that feeds their models and ensure it flourishes.
    In the real world...
- Lerc 42 minutes ago
  (This is a bit ranty, but due to a sincere desire for a better world, and being the recipient of personal attacks for believing a better world is achievable by a different path to others)
  I feel like this point of view is an ideal not shared by one of the main branches of anti-AI sentiment.
  The idea of intellectual property works against this. Rather than contributing to humanity directly, ownership of information is accumulated by individuals and then rented to humanity.
  At the same time I agree that people should be able to have a livelihood that affords them the ability to create new intellectual contributions.
  The service Karpathy is providing is also being provided by thousands of YouTube creators in a huge variety of topics. It's a little sad that so many must support their efforts with support their efforts with sponsorships from sources with varying degrees of ethical behaviour. Patreon is better but still not ideal. I sincerely believe this _is_ one of the best ways to contribute to society.
  A recent Daily Show had Jon Stewart describe training AI as strip mining human knowledge. Training AI is regularly described as theft as if this position is a given without any counter argument possible. It is opinion masquerading as fact. This saddens me because it suggests to me that the war to control the narrative is being won by people who want to entrench a hypercapitalistic vision of ownership where not only is a particular expression of an idea ownable but also stakes a claim to own some of any ideas that come from viewing that expression.
  I cannot see any way that this viewpoint would aid humanity as a whole, but instead assign benefits to a collection of individuals. The ability to trade intellectual property means that ownership inevitably gets passed to a smaller and smaller pool of individuals over time.
  I think we really do need a new way to consider these issues in light of the modern world. When mentioning these thoughts to others a common refrain is that it doesn't matter because the powers that be (and their lobbyists) will prevent any fix from happening. I have never been fond of that particular fatalism, especially when it inhibits discussion of what would be better.
  [-]
  - oblio 22 minutes ago
    Awesome approach.
    I'm all for abolishing IP if all AIs are owned communally. I.e. ideally they're utilities or flat out coops like some Spanish businesses.
    https://en.wikipedia.org/wiki/Mondragon_Corporation
    Consum (supermarket).
    Thru don't get to use everything communally and then capitalist their way forward.
- viccis 3 hours ago
  I recommend his ANN/LLM from scratch videos to people a lot because not only is he a clear instructor, but his code tends to be very Pythonic and just the right balance of terse but readable (not counting the Pytorch vectorization stuff, but that's not his fault, it's just complex). So I think people benefit just from watching and imitating his code style.
- epolanski 2 hours ago
  Then a single person whose learned those skills decide to poison all of us thanks to the skills acquired.
- carlcortright 2 hours ago
  strong +1 - developers like him are heros
- shafyy 3 hours ago
  If it only were so easy
- contingencies 2 hours ago
  While documenting a build path is nice, IMHO renting hardware nobody can afford from VC-backed cloud providers using cold hard cash to produce clones of legacy tech using toy datasets under the guise of education is propping up the AI bubble and primarily helping institutional shareholders in those AI bubble companies, particularly their hardware supplier NVidia. Personally I do not see this as helping people or humanity.
  This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc.
  [-]
  - simonw 2 hours ago
    "This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc."
    This is a pretty disheartening way to respond to something like this. Someone puts a great deal of effort into giving something interesting away for free, and is told "you should have also done THIS work for free as well in order for me to value your contribution".
    [-]
    - contingencies 2 hours ago
      It is an objective and transparent response based on free software world norms. Feel free to interpret differently and to be disheartened. Hell, many of us are disheartened by the AI VC political theater we are seeing right now: experienced programmers, artists, lawyers, perhaps much of humanity. Let's stick to objective elements of the discussion, not emotional opine.
  - CamperBob2 2 hours ago
    If you can't afford $100 or learn how to train it locally with more time and less money, then this isn't something you should be focusing on at all.
    [-]
    - contingencies 2 hours ago
      It is amusing to note the dichotomy between the clearly compassionate, empathetic and altruistic perspective displayed here and the comically overstated framing of helping humanity.
      [-]
      - CamperBob2 4 minutes ago
        (Shrug) Other sites beckon.
  - jstummbillig 2 hours ago
    I think you got your proportions slightly wrong there. This will be contributing as much to an AI bubble as a kid tinkering around with combustion is contribution to global warming.
    [-]
    - contingencies 2 hours ago
      Not really. Anything that guy does sets the tone for an extended cacophony of fans and followers. It would be a sad day when nobody critically assesses the motivations, effects and framing of those moves. I question the claim this move helps humanity and stand by the assessment it's just more feeding an unfree ecosystem which equates to propping up the bubble.
- martin-t 3 hours ago
  As noble as the goal sounds, I think it's wrong.
  Software is just a tool. Much like a hammer, a knife, or ammonium nitrate, it can be used for both good or bad.
  I say this as someone who has spent almost 15 years writing software in my free time and publishing it as open source: building software and allowing anyone to use it does not automatically make other people's lives better.
  A lot of my work has been used for bad purposes or what some people would consider bad purposes - cheating on tests, cheating in games, accessing personal information without permission, and in one case my work contributed to someone's doxxing. That's because as soon as you publish it, you lose control over it.
  But at least with open source software, every person can use it to the same extent so if the majority of people are good, the result is likely to be more positive than negative.
  With what is called AI today, only the largest corporations can afford to train the models which means they are controlled by people who have entirely different incentives from the general working population and many of whom have quite obvious antisocial personality traits.
  At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
  I don't have high hopes for AI to be a force for good and teaching people how toy models work, as fun as it is, is not gonna change it.
  [-]
  - simonw 2 hours ago
    "With what is called AI today, only the largest corporations can afford to train the models"
    I take it you're very positive about Andrej's new project which allows anyone to train a model for a few hundred dollars which is comparable to the state-of-the-art from just 5 years ago then.
  - oliveiracwb 2 hours ago
    I would genuinely love to think otherwise. But I've seen and grown up seeing good things being used in stupid ways (not necessarily for malice)
  - isaacremuant 2 hours ago
    > At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
    It already works like this in your precious western democracies and they didn't need AI to be authoritarian total surveillance states in spirit, with quite a lot of support from a propagandized populace that begged for or pretended to agree with the infringement of their civil rights because of terrorism, drugs, covid or protecting the poor poor children.
    You can combat tech with legislation and culture but the legislation and culture were way beyond the tech in being extremely authoritian in the first place.
- croes 3 hours ago
  I‘m afraid the technology will do more damage because many people will abuse it for fake news and misinformation.
  [-]
  - IntrepidPig 3 hours ago
    Yeah it feels similar to inventing the nuke. Or it’s even more insidious because the harmful effects of the tech are not nearly as obvious or immediate as the good effects, so less restraint is applied. But also, similar to the nuke, once the knowledge on how to do it is out there, someone’s going to use it, which obligates everyone else to use it to keep up.
- Yizahi 2 hours ago
  I would adjust your formula to the:
  number of people you help x how much you help them x number of people you harm x how much you harm them
  For example - harming a little bit all content creators of the world, by stealing their work without compensation or permission. How much does that cost globally every year after year? How do we even quantify long term consequences of that? Stuff like that.
daft_pink 6 hours ago
Wow, how do we sign up for the Eurekalabs course and how much does it cost?
[-]
- karpathy 4 hours ago
  Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.
  [-]
  - BrokenCogs 3 hours ago
    What's the pricing for the course/EurekaLabs? P.s. thanks for all you're doing
- huseyinkeles 6 hours ago
  Karpathy says nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.
  I guess it’s still a work in progress? Couldn’t find any other information elsewhere.
  [-]
  - Schiphol 5 hours ago
    A bit more info [here](https://github.com/karpathy/LLM101n)
TheAceOfHearts 4 hours ago
Here's the announcement post [0] from Karpathy, which provides a bit of additional context.
[0] https://x.com/karpathy/status/1977755427569111362
[-]
- dang 4 hours ago
  Thanks - we'll put that in the toptext as well
wyldfire 1 hour ago
I would love to take an existing open-weight model and fine-tune it with specific training data along these lines. Can I do that with Qwen or GLM? Is there a ~simple recipe for doing that?
mhitza 4 hours ago
Should be "that you can train for $100"
Curios to try it someday on a set of specialized documents. Though as I understand the cost of running this is whatever GPU you can rent with 80GB of VRAM. Which kind of leaves hobbyists and students out. Unless some cloud is donating gpu compute capacity.
[-]
- Onavo 4 hours ago
  A GPU with 80GB VRAM costs around $1-3 USD an hour on commodity clouds (i.e. the non-Big 3 bare metal providers e.g. https://getdeploying.com/reference/cloud-gpu/nvidia-h100). I think it's accessible to most middle class users in first world countries.
  [-]
  - antinomicus 3 hours ago
    Isn’t the whole point to run your model locally?
    [-]
    - theptip 3 hours ago
      No, that’s clearly not a goal of this project.
      This is a learning tool. If you want a local model you are almost certainly better using something trained on far more compute. (Deepseek, Qwen, etc)
    - yorwba 3 hours ago
      The 80 GB are for training with a batch size of 32 times 2048 tokens each. Since the model has only about 560M parameters, you could probably run it on CPU, if a bit slow.
    - simonw 2 hours ago
      You can run a model locally on much less expensive hardware. It's training that requires the really big GPUs.
    - jsight 2 hours ago
      I'd guess that this will output faster than the average reader can read, even while using only CPU inferencing on a modern-ish CPU.
      The param count is small enough that even cheap (<$500) GPUs would work too.
- portaouflop 4 hours ago
  If I have let’s say 40gb RAM does it not work at all or just take twice as long to train?
  [-]
  - typpilol 4 hours ago
    Won't work at all. Or if it does it'll be so slow since it'll have to go to the disk for every single calculation so it won't ever finish.
    [-]
    - karpathy 2 hours ago
      It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.
samus 2 hours ago
Andrej Karpathy slays again by spreading knowledge about this important subject to the people!
RobGR 1 hour ago
This is an LLM trained using a $100 budget to RENT access to graphics cards. It's not about what you could do BUYING hardware for $100.
[-]
- danielmarkbruce 1 hour ago
  Nowhere does he suggest he is buying hardware.
oblio 18 minutes ago
I wonder, if something like this were trained on Wikipedia, could it become a reliable local Wikipedia search engine, basically?
[-]
- simonw 15 minutes ago
  I don't think so. Training on documents is not a great way of building a search engine for those for the information in those documents, because the training process mixes all of that information together in ways that detach the individual words from the source documents they came from.
  As usual, if you want an LLM to be able to help search a corpus of text the best way to achieve that is to teach it how to use a search tool against that text.
sbassi 2 hours ago
Which data uses for training?
[-]
- simonw 2 hours ago
  karpathy/fineweb-edu-100b-shuffle: https://huggingface.co/datasets/karpathy/fineweb-edu-100b-sh...
  Which is derived from HuggingFaceFW/fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
  HuggingFaceTB/smol-smoltalk: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
  And extra fine-tuning on portions of:
  cais/mmlu: https://huggingface.co/datasets/cais/mmlu
  openai/gsm8k: https://huggingface.co/datasets/openai/gsm8k
  allenai/ai2_arc: https://huggingface.co/datasets/allenai/ai2_arc
- eranation 2 hours ago
  I think he mentioned somewhere he used fineweb (I assume this one https://huggingface.co/datasets/HuggingFaceFW/fineweb)
tdhz77 1 hour ago
These are the time of community posts that are legendary.
Havoc 4 hours ago
>If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
That sounds like it could run on a 24gb GPU. Batch size of 8 would imply 20gb mem, no?
...presumably just takes forever
[-]
- zipy124 3 hours ago
  Yes, you can always stream data when training or doing inference on models when vram is lacking but the slow down is extremely noticeable. This is the case for CPU code too and is why optimising for bandwidth is so critical in high-performance computing. Your ability to compute is almost always substantially larger than your bandwidth. An Avx512 capable CPU with a suitable amount of cores is easily capable of doing multiple terabytes of fp64 operations per second, but is typically limited by memory bandwidth, GPUs with LLMs have just broadened this knowledge to more people.
  A fun consequence of the fact that CPUs got faster at a rate quicker than memory is look up tables of pre-computed values used to be common optimisations in code, but now it is almost always quicker to re-compute them than to retrieve a pre-computed value from memory for common use-cases.
lebimas 2 hours ago
I see Karpathy, I click
dinkblam 2 hours ago
from their promotional material:
>> Why is the sky blue? > The sky is blue due to an optical illusion called the Rayleigh Scattering
Rayleigh Scattering is not an illusion but an effect.
> […] particles are made up of tiny blue and violet particles that cause the light to bend in a particular way.
ugh. no, there are no "tiny blue" particles in the sky.
[-]
- simonw 2 hours ago
  That was the point. That example is meant to demonstrate that the model that trained for 4 hours can imitate a conversation but isn't actually anywhere close to being useful.
- kragen 1 hour ago
  Where did you find that?
  [-]
  - simonw 35 minutes ago
    It's in this screenshot: https://twitter.com/karpathy/status/1977755430093980034
andrewmcwatters 4 hours ago
[flagged]
computer23 2 hours ago
Has the word ChatGPT become generic? This has nothing to do with OpenAI's ChatGPT.
[-]
- huflungdung 2 hours ago
  [dead]
jackphilson 5 hours ago
[flagged]
[-]
- dang 4 hours ago
  I believe you that you just meant this as an interesting example, and in that sense were engaged in curious conversation (generally what we want here). But the amount of provocation in the comment is so high, and the amount of information so little, that it ends up on the wrong side of "Eschew flamebait. Avoid generic tangents." (https://news.ycombinator.com/newsguidelines.html). In other words: not gonna end well.
  We detached this subthread from https://news.ycombinator.com/item?id=45569878.
- nsriv 5 hours ago
  Controlling culture, yes but wild pivot to mention that criminal alongside Karpathy.
  [-]
  - cultofmetatron 4 hours ago
    not a particularly ethical guy and I wouldn't hold him up as a example of morality but the guy hasn't actually been found guilty YET. Multiple courts have tried. You'd think that for a guy under as much scrutiny as him that they would have SOMETHING to pin him on by now.
    Innocent until PROVEN guilty is a foundational legal precedent for a reason.
    [-]
    - portaouflop 4 hours ago
      He is definitely guilty of being a waste of human life, a massive asshole and a general detriment to society worldwide. Don’t need a court to prove that.
      There are 6 criminal cases against him in several countries, let’s see how they pan out - but regardless he is not an innocent person.
      [-]
      - decremental 2 hours ago
        [dead]
  - jackphilson 4 hours ago
    I mean just an example. He obviously wasn't the most ethical person. Depends how you do it
    [-]
    - IOT_Apprentice 4 hours ago
      Neither are Stalin, Netanyahu, Pol Pot, Hitler, Charles Manson et al.
      Way to derail the conversation. Focus on the positive people and their legacy of time, sharing, positive energy and contributions to society
      [-]
      - jackphilson 4 hours ago
        not derailing, just pointing out effective ways of producing good which is what i was responding to. i think its good for people to be aware of this. those people are all examples of people who have influenced culture for bad. you can do it for good: bryan johnson, civil rights leaders, leftist streamers. andrew tate was just the most effective, recent, and obvious one which is why I pointed him out.
cyanydeez 3 hours ago
if the AI bubble is anything to be compared to, how is 100$ worth anything in GPT terms.
efficax 2 hours ago
Try ~300k for an 8xH100 lol
earthnail 2 hours ago
This is absolutely fantastic. I really can't wait for the final course to be live. It's in the "shut up and take my money" category. I had so much fun with the nanoGPT videos.