SANA-WM, a 2.6B open-source world model for 1-minute 720p video

(nvlabs.github.io)

155 points | by mjgil 4 hours ago

21 comments

mccoyb 2 hours ago
I struggle with these world models from the perspective of video games (so this post is a particular perspective).
I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you.
It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game.
It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here?
All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications).
In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc.
[-]
- duskdozer 45 minutes ago
  >We are able to create less satisfying, less humane experiences faster?
  Yes, exactly. Inundate the world with superficially plausible yet hollow content, including any desired themes. People who aren't very discerning won't complain; the others will be outmatched and find that 99/100 pieces are all noise and they will need to spend increasing amounts of time trying to find the 1, if they can.
  I think there are some good parallels with Amazon: the broken sorting and manipulated unit pricing, coupled with the avalanche of cheap clones pushes users to give up and just buy one of the top listed products (a featured listing/Amazon-clone). If you do a web search for various products and go to images, Amazon product links often take up 50-90% of the results.
  [-]
  - Aperocky 12 minutes ago
    I think you also can create satisfying, same human experiences faster, but not as fast.
    But the dopamine descent require strong discipline to stop there, and most don't .
- wongarsu 30 minutes ago
  One thing is robotics. Both for training robotics AI, and to let robots test hypothetical actions before comitting to them. I don't think world models are stable enough for either yet
  The other is creating multi-modal models with a better understanding of our world. LLMs often fail at incredibly basic spatial reasoning ("someone left a package in front of your apartment, describe going there", or the "should I drive to the car wash or go there", etc). World models excel at these kinds of things (in theory). They develop a great understanding of physical spaces, object interactions, etc. They can simulate fluids, rigid body physics etc. You "just" have to get really good at making world models, then somehow marry them with an LLM in a way that ensures the LLM can benefit from the world model's training data. Nobody has managed to really do that yet
  So lots of hopes for the future. Until then they get commercialized as video models, or ways to experience your favorite forest, or to have a really bad video game ... whatever can be sold on a short time horizon to finance the actual goals
- Lerc 1 hour ago
  There are two things here, firstly, Without AI, you can have heavily designed environments or you can have procedurally generated, people manage to make both work. Both can also fail because of reasons specific to the approach. Careless procedural generation can produce a poor variety or nonsensical outputs. Careless specific placement can violate any rules that a game has established creating an incoherent experience.
  Making a world internally consistent by explicit placement gets harder as you increase in scale. When internal consistency is a factor impacting quality, there is a scale at which generated content eventually becomes the higher quality solution.
  Secondly, when generating content with AI, the same rules around carelessness apply. There are certainly generative AI tools out there that offer few options when it comes to composing what you want, that is not a necessary part of AI, some of it is because people are wanting rudimentary interfaces, some of it is that the generators are sufficiently new that the control mechanisms are limited because they are focused upon doing something at all before doing it highly controlled, in some ways the problem is that things are new enough that it can be hard to describe what is desirable controllability, making the generator to see what people would like it to be able to do is, I think, a reasonable path to follow prior to creating the control that people want. Part of it is also that there _are_ tools that give a high level of control over what is generated but far fewer people get to see them. There are ways to control styles, object placement, camera motions, scene compositions, etc. The more specialised you get, the smaller the subset of people who need that specific control.
  I think AI can make things possible for people who could not have done so without them, but it's still going to take care to make something special.
- bee_rider 1 hour ago
  What does intentionality mean in the context of a world model generated game-world? I guess true human intention would have been throw out the window already at that point.
  One aspect of intentionality is that there’ll be a narrative payoff when you find something you find interesting. In videogames, the world is mostly pre-designed, so the designer has to predict what you’ll be interested in for the most part (In pen and paper RPGs, this is usually done better, because the human dungeon master/DM can plan ahead, but also improvise a payoff or modify the plot between sessions). If there was a world model generated game world, I guess the model would have to be “smart” though to setup and execute those payoffs.
  An advantage that the world model would have (and shares with a good human DM) is that everything is an interactable, and the players get to pick what they think is interesting. If everything is improv with a loose skeleton around it, you don’t have to predict as far out. I think world model generated games, if they even become a thing, will be quite a bit worse than conventionally designed ones for a long time (improv can be quite shallow!) but have a lot of potential if they work out.
  FromSoft is an interesting example. They make the game more believable by having extremely missable quests, just, most of them don’t block progress through the game, and you usually stumble across enough side quests naturally (although IMO the density was too low in Elden Ring, their system showed a bit of weakness in the less-guided context). The plot is pretty vague, but the vibes tell enough of a story that you don’t really mind. It’s sort of improv/pen-and-paper but the player’s imagination is doing the job of the DM.
- orbital-decay 59 minutes ago
  > some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
  That's a pretty specific and one-sided example. There are tons of good games that don't rely on elaborate item placement (e.g. many Bethesda games are great because most items are useless decorations, they broke that rule in recent games, giving the purpose to clutter, and it made them a lot worse). There are tons of good games not relying on this intentionality at all, they're either literally random cool ideas thrown at the wall, or even procedurally generated.
  [-]
  - mccoyb 48 minutes ago
    That’s a fair critique of my comments! The space of fun games is large and diverse.
- pigpop 1 hour ago
  Even though I doubt the main purpose of these models is to produce video games, I have the opposite view from you in that I am excited to see these put to work as components of procedural generation in video games. I don't think that is going to negatively impact story driven games that you seem to enjoy any more than the market for open world and simulation games currently does. They are separate concerns and use distinct techniques.
  Where you look for an intentionally evoked experience authored by a game designer, I am looking for an unexplored world unfolding before me filled with emergent and unique phenomena that perhaps no one and not even the game designer has seen before.
- jdironman 53 minutes ago
  I'm of a strong believer that AI just isn't (may never will be?) a strong judge and executor of "quality". Quality is a loaded term though. Are there any objectively good game designs? Even if there is, maybe only one game of 10 that use the same 'blue print' every reach critical mass (popularity).
- Glohrischi 1 hour ago
  Video games are not the initial motivation at all.
  These world models are key for robotic and coherence in video generation.
  Give a world model images of a factory, the robot now can simulate tasks and do the best result.
  Give a world model images/context etc. and it can generate a coherent world for video generation.
  What this world model system might be able to do for us in regards of gaming or virtual reality: Either simulate 'old' environments like the house of your grandparents (gaussian splatting but interactive) or potential new ones like a house, kitchen, remodeling.
  It can also be a very interesting easy to approach VR environment were you can start building your world with voice. That would be very intentional. After all world building is not necessarily connected to being able to generate 3d assets. Just because you need to go this route today, doesn't mean you have to do this tomorrow.
- b65e8bee43c2ed0 1 hour ago
  FromSoftware-quality games are <5% of the market. >50% of the market is abominable slop that very well might benefit from AI writing and design.
  for example, I am 100% certain that ANY model could write a better Dragon Age sequel than the rotting corpse of Bioware did, because only humans can despise their audience and their source material. an LLM would dutifully attempt to produce more of the thing rather than 're-imagine' the thing for 'the modern audience'.
  [-]
  - embedding-shape 1 hour ago
    So? Parent still want to know how tools like these could potentially by used in a better way. That most people don't obsess over quality when building/doing things shouldn't mean that no one should.
- robot_jesus 2 hours ago
  By and large I agree, but it doesn’t need to be either/or.
  Many of the most popular games in the past decade are procedurally generated and have nothing “intentionally” placed (apart from tuning/tweaking the balance of the seeding algorithms).
  [-]
  - swiftcoder 1 hour ago
    > have nothing “intentionally” placed (apart from tuning/tweaking the balance of the seeding algorithms).
    I think you underestimate the intentionality that goes into developing procedural generation. Something like Dwarf Fortress isn't "place objects randomly" - it is layers upon layers of carefully crafted systems that build upon each other to produce specific patterns of outcome
    [-]
    - robot_jesus 1 hour ago
      By calling it out in my comment, I was trying to not underestimate it.
      I guess what I'm saying is: Couldn't a world model with targeted training and thoughtfully tuned system prompts be directionally similar to the layered systems to produce specific patterns of outcome?
  - mccoyb 2 hours ago
    Right, and I wondered how these world models might be use in a careful way (just as agents can be used carefully to accelerate work).
    Are video game developers using these systems in their workflows? Would love to learn more!
  - danielbln 1 hour ago
    Which game would that be apart from Minecraft?
    [-]
    - fc417fc802 51 minutes ago
      Dwarf fortress, no man's sky, elite dangerous, ...
      The combination of "many", "most popular", and "nothing" is overstating it by a wide margin but for example the majority of the vegetation in games as far back as oblivion was procedurally placed.
    - robot_jesus 1 hour ago
      No Man's Sky, Terraria, Dead Cells, to name a few.
- unfitted2545 1 hour ago
  If we use world models to train AI systems, are we not essentially forcing something to live so it can gather data for us?
  Yes, we haven't gone that far with creating consciousness yet, but there is gonna be a lot of money around neural computing devices for consumers in the coming decades, so that will speed up knowing what sense data you need for consciousness.
  [-]
  - ctoth 42 minutes ago
    > are we not essentially forcing something to live so it can gather data for us?
    Wait until you learn about what we do to chickens.
jubilanti 55 minutes ago
Model weights coming "soon" == currently vaporware. So the weights aren't even open, how can this be "open-source"?
Everyone is right to be skeptical of this coming from a 2.8B model. Weights or it didn't happen.
resist_futility 3 minutes ago
warning: viewing the videos that auto play on that page shot up my downloads to 350Mbps on that page
PyWoody 14 minutes ago
I tried watching the cave video and I was immediately overcome with nausea. I've never experienced anything like that before in my life. Wild.
I can't say I'm looking forward to an AI video future.
mejutoco 2 hours ago
They all look like video games. I guess Unreal Engine is used to create synthetic data for training.
[-]
ionwake 18 minutes ago
i survived flash, jquery, svn, soap, xml, microservices and crypto now some norwegian teenager is generating netflix-quality worlds during lunch break from a jpeg of a forest
EDIT> dont ask how I came up with this quote
Incipient 2 hours ago
Outputting video of that quality/consistency at 1 minute, for a 2.6B model seems insane?
Fischgericht 3 hours ago
So, where is the download? I can't find it on Github, and on your web page the download button is disabled.
Also, will this run on RTX 4090 with 24GB memory?
Thank you!
[-]
- mjgil 3 hours ago
  Scroll down and there are more videos --- seems like models will be there "soon".
joenot443 2 hours ago
What’s the long term utility of world models?
There’s no doubt they’re technically impressive, but what does one do with it?
[-]
- modeless 19 minutes ago
  World models will be how general purpose robots finally work. They are essentially learned simulators of the world. They will replace traditional robotics simulators which are not flexible enough to enable training of general robotics policies. Robot control policies will be trained and evaluated in learned simulators, and the policies themselves will be world models in order to predict the consequences of their own actions and thus enable planning. Simulated data will scale much better than expensive real-world robot data, and will allow robot policies to reach LLM-level dataset sizes, and subsequently, LLM-level performance. It is inevitable that learned simulators will replace hand-coded simulators, as it is a straightforward application of the Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
  By enabling general purpose robotics, world models will be one of the most useful inventions of all time. This particular world model is likely trained mostly or only on video game data and won't be useful for such things. We'll need world models trained on real world data before they are truly useful for robotics. But for examples of what I'm talking about in current research, check:
  Dreamer 4: https://danijar.com/project/dreamer4/
  Tesla's world model: https://www.youtube.com/watch?v=LFh9GAzHg1c
  Waymo's world model: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...
- fancyfredbot 1 hour ago
  The world model is useful for planning. It can "anticipate" consequences of actions. This can be used for a kind of tree search to decide on optimal actions in robotics
  [-]
- ACCount37 2 hours ago
  They can be base models for a bunch of things. Turning text-conditioned video generation models into robotics VLAs is a fun exercise.
  This one is probably too small to be useful for that, and not diverse enough? But I could be wrong.
- iinnPP 1 hour ago
  I believe the idea is to offer simulation of ideas to test out new tasks AND something like dreaming.
- Leonard_of_Q 2 hours ago
  Games. Build campaigns in hours instead of months. Make it possible for users to create their own campaigns, move the action to different game worlds - 'gimme Mario Kart in the ${favourite_game} world', etc.
  [-]
  - AshleyGrant 1 hour ago
    Yeah, but is this really that great? Are these models going to remember the town you wandered through on your session yesterday and want to return to?
    Imagine playing Read Dead Redemption 2 and you attempt to ride your horse from Saint Denis to Valentine and Valentine no longer exists, or is a completely different town located half a mile off from where it was originally.
    I just don't see how this would work...
    [-]
    - pigpop 30 minutes ago
      If I had to use the models as they exist right now I'd use them in a procedural Myst-like where I incorporate the temporal inconsistency into the setting. The player's actions and state would affect the prompts used for conditioning the video generation. It would probably be weird and buggy but could be fun.
      You could also use these models to generate assets for a game during development whether that's simple cutscenes or assets produced through gaussian splatting or some other process.
      If these models and others can be run cost effectively on a cloud service or even locally at some point then you could do some interesting things by combining them with 3D mesh generation, img2img, vid2vid, etc. just think about even simple games like Papers Please and the whole genre it spawned that uses short episodes where you have to make a guess based on what you see, there's a lot of potential for creating new mechanics around generative imagery.
    - hackinthebochs 53 minutes ago
      It's not hard to imagine a system that combines deterministic state tracking with diffusion generated scenes.
    - dyauspitr 1 hour ago
      Yes, a lot of models don’t state this explicitly, but they can be made deterministic. Not the generation itself, but the same prompt, with a generation seed will always result in the same output.
- whynotmaybe 2 hours ago
  It's a step towards something else?
- bix6 2 hours ago
  Digital twin?
- esafak 1 hour ago
  Put them in a robot so that it can navigate the physical world like humans. Self-driving cars.
pferdone 3 hours ago
First video with the guy walking the mountain in snow has consistency issues with the cave entrance. Which is "expected" at this model size?!
[-]
- andai 2 hours ago
  My dreams have it too, which is unexpected at that model size!
- notnullorvoid 26 minutes ago
  All of the videos have rather glaring consistency issues when direction shifts back to areas previously shown.
- Leonard_of_Q 2 hours ago
  Most videos seem to have some issues like that, e.g. the book on the table in the library video takes up different shapes every now and then.
  The 'Refiner' effect seems to do the opposite if the examples are representative as in all cases the 1-st stage images look better than the 'refined' ones. Less clutter, more realistic, less 'cowbell' for those who know the phrase.
- echelon 1 hour ago
  Remember the first Will Smith spaghetti?
bobkb 3 hours ago
The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.
[-]
- vessenes 3 hours ago
  I tend to think of these NV Labs models as architectural demos and ‘free razor blades’ — they’re more intended to inform internal R&D, get customers something that lets them do what they want quickly, and enhance the state of the art.
  In this case, what looks interesting is the one minute coherence and the massive speedup - they claim 36x over open models with similar capabilities. You can tell they aren’t aiming for state of the art visuals — looks very SD 1.5 in terms of the output quality.
agus4nas 1 hour ago
Has anyone actually tested this for robotics simulation? Curious how it handles edge cases in physical environments.
[-]
- notnullorvoid 31 minutes ago
  Judging by the examples it wouldn't be useful for that, the environments show little physical consistency.
yieldcrv 41 minutes ago
Really great for visuals during a dj set at a festival or YouTube
CommanderData 2 hours ago
All video models are terrible at consistency. Even closed source ones.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
[-]
- adenta 29 minutes ago
  what subreddits do _you_ subscribe to?
  I've been doing some content with people at https://industrialallusions.com
  [-]
  - CommanderData 21 minutes ago
    https://www.reddit.com/r/KlingAI_Videos/
    https://www.reddit.com/r/HiggsfieldAI/
    Higgsfield have multiple models available, people use Kling usually 2.5 & 3. There are a few good examples posted right now you'll notice the subtle differences.
    I have tried to generate things myself and it's extremely hard to have more than 7-8 clips that are consistent, eventually you'll accept a compromise. I think it's why there isn't any long form content being done yet. Getting good results is sometimes just "chance" regardless of how many reference data you have.
trunkiedozer 2 hours ago
It ain’t open source until it’s released. It’s baitware.
agus4nas 1 hour ago
Increíbles resultados
sebringj 1 hour ago
i see this and think about Suno's playbook where this could go... survival of the fittest rules the boards where you have user-generated-dynamic video games, not just static ones where design is fixed, the design will be adaptive... based off several prompt input boxes for various things and adhoc while playing, higher tier design boards and the like, this is all going toward user-gen commercial / vanity / personal enjoyment.
utopiah 40 minutes ago
Nice, now instead of just reading slop you'll soon be able to experience slop Worlds, in 3D! /s
It's honestly impressive, on the surface. The visuals are gorgeous... but it's still empty. What makes a "World" a world is precisely it's coherency. It's not about how it looks but rather how it "works". The plants in an ecosystems are a certain way because of the available resources, all the way to forces like gravity. It doesn't just "look" like that. To echo Konrad Lorenz a fish doesn't just swim in the water, rather the fish IS an efficient representation of the water it lives within. Here in such "worlds" there is nothing happening. There is minimal superficial coherence, no logic, nothing.
The ultimate liminal spaces.
jaspanglia 4 hours ago
[flagged]
[-]
- carlos-menezes 2 hours ago
  Bot comment.
- rvz 4 hours ago
  Given that is where everything is going, why not just get there faster by open-sourcing Seedance 2.0, Happyhorse, Veo 3 and all the others.
mjgil 4 hours ago
[flagged]
[-]
- pferdone 3 hours ago
  Who wrote your comment?
  [-]
  - andai 2 hours ago
    At this point we should cut our losses and just give Claude an official HN account.
- semiquaver 3 hours ago
  Stop posting slop.
  [-]
  - mjgil 3 hours ago
    less security issues with slop
xyzsparetimexyz 1 hour ago
ugly slop