Author name: Mike M.

on-dwarkesh’s-podcast-with-openai’s-john-schulman

On Dwarkesh’s Podcast with OpenAI’s John Schulman

Dwarkesh Patel recorded a Podcast with John Schulman, cofounder of OpenAI and at the time their head of current model post-training. Transcript here. John’s job at the time was to make the current AIs do what OpenAI wanted them to do. That is an important task, but one that employs techniques that their at-the-time head of alignment, Jan Leike, made clear we should not expect to work on future more capable systems. I strongly agree with Leike on that.

Then Sutskever left and Leike resigned, and John Schulman was made the new head of alignment, now charged with what superalignment efforts remain at OpenAI to give us the ability to control future AGIs and ASIs.

This gives us a golden opportunity to assess where his head is at, without him knowing he was about to step into that role.

There is no question that John Schulman is a heavyweight. He executes and ships. He knows machine learning. He knows post-training and mundane alignment.

The question is, does he think well about this new job that has been thrust upon him?

Overall I was pleasantly surprised and impressed.

In particular, I was impressed by John’s willingness to accept uncertainty and not knowing things.

He does not have a good plan for alignment, but he is far less confused about this fact than most others in similar positions.

He does not know how to best navigate the situation if AGI suddenly happened ahead of schedule in multiple places within a short time frame, but I have not ever heard a good plan for that scenario, and his speculations seem about as directionally correct and helpful as one could hope for there.

Are there answers that are cause for concern, and places where he needs to fix misconceptions as quickly as possible? Oh, hell yes.

His reactions to potential scenarios involved radically insufficient amounts of slowing down, halting and catching fire, freaking out and general understanding of the stakes.

Some of that I think was about John and others at OpenAI using a very weak definition of AGI (perhaps partly because of the Microsoft deal?) but also partly he does not seem to appreciate what it would mean to have an AI doing his job, which he says he expects in a median of five years.

His answer on instrumental convergence is worrisome, as others have pointed out. He dismisses concerns that an AI given a bounded task would start doing things outside the intuitive task scope, or the dangers of an AI ‘doing a bunch of wacky things’ a human would not have expected. On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level, and he readily admits it is an issue with commands that are likely to be given in practice, such as ‘make money.’

In general, he seems willing to react to advanced capabilities by essentially scaling up various messy solutions in ways that I predict would stop working at that scale or with something that outsmarts you and that has unanticipated affordances and reason to route around typical in-distribution behaviors. He does not seem to have given sufficient thought to what happens when a lot of his assumptions start breaking all at once, exactly because the AI is now capable enough to be properly dangerous.

As with the rest of OpenAI, another load-bearing assumption is presuming gradual changes throughout all this, including assuming past techniques will not break. I worry that will not hold.

He has some common confusions about regulatory options and where we have viable intervention points within competitive dynamics and game theory, but that’s understandable, and also was at the time very much not his department.

As with many others, there seems to be a disconnect. A lot of the thinking here seems like excellent practical thinking about mundane AI in pre-transformative-AI worlds, whether or not you choose to call that thing ‘AGI.’ Indeed, much of it seems built (despite John explicitly not expecting this) upon the idea of a form of capabilities plateau, where further progress is things like modalities and making the AI more helpful via post-training and helping it maintain longer chains of actions without the AI being that much smarter.

Then he clearly says we won’t spend much time in such worlds. He expects transformative improvements, such as a median of five years before AI does his job.

Most of all, I came away with the impression that this was a person thinking and trying to figure things out and solve problems. He is making many mistakes a person in his new position cannot afford to make for long, but this was a ‘day minus one’ interview, and I presume he will be able to talk to Jan Leike and others who can help him get up to speed.

I did not think the approach of Leike and Sutskever would work either, I was hoping they would figure this out and then pivot (or, perhaps, prove me wrong, kids.) Sutskever in particular seemed to have some ideas that felt pretty off-base, but with a fierce reputation for correcting course as needed. Fresh eyes are not the worst thing.

Are there things in this interview that should freak you out, aside from where I think John is making conceptual mistakes as noted above and later in detail?

That depends on what you already knew. If you did not know the general timelines and expectations of those at OpenAI? If you did not know that their safety work is not remotely ready for AGI or on track to get there and they likely are not on track to even be ready for GPT-5, as Jan Leike warned us? If you did not know that coordination is hard and game theory and competitive dynamics are hard to overcome? Then yeah, you are going to get rather a bit blackpilled. But that was all known beforehand.

Whereas, did you expect someone at OpenAI, who was previously willing to work on their capabilities teams given everything we now know, having a much better understanding of and perspective on AI safety than the one expressed here? To be a much better thinker than this? That does not seem plausible.

Given everything that we now know has happened at OpenAI, John Schulman seems like the best case scenario to step into this role. His thinking on alignment is not where it needs to be, but it is at a place he can move down the path, and he appears to be a serious thinker. He is a co-founder and knows his stuff, and has created tons of value for OpenAI, so hopefully he can be taken seriously and fight for resources and procedures, and to if necessary raise alarm bells about models, or other kinds of alarm bells to the public or the board. Internally, he is in every sense highly credible.

Like most others, I am to put it mildly not currently optimistic about OpenAI from a safety or an ethical perspective. The superalignment team, before its top members were largely purged and its remaining members dispersed, was denied the resources they were very publicly promised, with Jan Leike raising alarm bells on the way out. The recent revelations with deceptive and coercive practices around NDAs and non-disparagement agreements are not things that arise at companies I would want handling such grave matters, and they shine new light on everything else we know. The lying and other choices around GPT-4o’s Sky voice only reinforce this pattern.

So to John Schulman, who is now stepping into one of the most important and hardest jobs under exceedingly difficult conditions, I want to say, sincerely: Good luck. We wish you all the best. If you ever want to talk, I’m here.

This follows my usual podcast analysis format. I’ll offer comments with timestamps.

To make things clearer, things said in the main notes are what Dwarkesh and John are saying, and things in secondary notes are my thoughts.

  1. (2: 40) What do we anticipate by the end of the year? The next five years? The models will get better but in what ways? In 1-2 years they will do more involved tasks like carrying out an entire coding project based on high level instructions.

  2. (4: 00) This comes from training models to do harder tasks and multi-step tasks via RL. There’s lots of low-hanging fruit. Also they will get better error recovery and ability to deal with edge cases, and more sample efficient. They will generalize better, including generalizing from examples of ‘getting back on track’ in the training data, which they will use to learn to get back on track.

    1. The interesting thing he did not say yet is ‘the models will be smarter.’

    2. Instead he says ‘stronger model’ but this vision is more that a stronger model is more robust and learns from less data. Those are different things.

  3. (6: 50) What will it take for how much robustness? Now he mentions the need for more ‘model intelligence.’ He expects clean scaling laws, with potential de facto phase transitions. John notes we plan on different timescales and complexity levels using the same mental functions and expects that to apply to AI also.

  4. (9: 20) Would greater coherence mean human-level intelligence? John gives a wise ‘I don’t know’ and expects various other deficits and issues, but thinks this going quite far is plausible.

  5. (10: 50) What other bottlenecks might remain? He speculates perhaps something like taste or ability to handle ambiguity, or other mundane barriers, which he expects not to last.

    1. This seems like a focus on the micro at the expense of the bigger picture? It seems to reinforce an underlying implicit theory that the underlying ‘raw G’ is not going to much improve, and your wins come from better utilization. It is not obvious how far John thinks you can take that.

  6. (12: 00) What will the multimodal AI UI look like? AIs should be able to use human websites via vision. Some could benefit from redesigns to make AI interactions easier via text representations, but mostly the AIs will be the ones that adapt.

    1. That seems bizarre to me, at least for websites that have very large user bases. Wouldn’t you want to build a parallel system for AIs even if they could handle the original one? It seems highly efficient and you should capture some gains.

  7. (13: 40) Any surprising generalizations? Some in post-training, such as English fine-tuning working in other languages. He also mentions a tiny amount of data (only ~30 examples) doing the trick of universally teaching the model it couldn’t do things like order an Uber or send an Email.

  8. (16: 15) Human models next year? Will these new abilities do that, if not why not? John points out coherence is far from the only issue with today’s models.

    1. This whole frame of ‘improved coherence with the same underlying capabilities otherwise’ is so weird a hypothetical to dive into this deeply, unless you have reason to expect it. Spider senses are tingling. And yet…

  9. (17: 15) Dwarkesh asks if we should expect AGI soon. John says that would be reasonable (and will later give a 5 year timeline to replace his own job.) So Dwarkesh asks: What’s the plan? John says: “Well, if it came sooner than expected, we would want to be careful. We might want to slow down a little bit on training and deployment until we’re pretty sure we can deal with it safely. We would have a good handle on what it’s going to do and what it can do. We would have to be very careful if it happened way sooner than expected. Because our understanding is still rudimentary in a lot of ways.”

    1. You keep using that word? What were we even talking about before? Slow down a little bit? Pretty sure? I am going to give the benefit of the doubt, and say that this does not sound like much of an AGI.

    2. This seems like the right answer directionally, but with insufficient caution and freaking out, even if this is a relatively weak AGI? If this happens as a surprise, I would quite deliberately freak out.

  10. (18: 05) Dwarkesh follows up. What would ‘being careful’ mean? Presumably you’re already careful, right? John says, maybe it means not training the even smarter version or being really careful when you do train it that it’s properly sandboxed ‘and everything,’ not deploying it at scale.

    1. Again, that seems directionally right, but magnitude poor and that’s assuming the AGI definition is relative weaksauce. The main adjustment for ‘we made AGI when we didn’t expect it’ is to move somewhat slower on the next model?

    2. I mean it seems like ‘what to do with the AGI we have’ here is more or less ‘deploy it to all our users and see what happens’? I mean, man, I dunno.

  11. Let’s say AGI turns out to be easier than we expect and happens next year, and you’re deploying in a ‘measured way,’ but you wait and then other companies catch up. Now what does everyone do? John notes the obvious game theory issues, says we need some coordination so people can agree on some limits to deployment to avoid race dynamics and compromises on safety.

    1. This emphasizes that we urgently need an explicit antitrust exemption for exactly this scenario. At a bare minimum, I would hope we could all agree that AI labs need to able to coordinate and agree to delay development or deployment of future frontier models to allow time for safety work. The least the government can do, in that situation, is avoid making the problem worse.

    2. Norvid Studies: The Dwarkesh Schulman conversation is one of the crazier interviews I’ve ever heard. The combination of “AGI-for-real may fall out automatically from locked-in training in 1 to 3 years” and “when it happens I guess we’ll uh, maybe labs will coordinate, we’ll try to figure that out.”

    3. I read John here as saying he does not expect this to happen, that it would be a surprise and within a year would be a very large surprise (which seems to imply not GPT-5?) but yes that it is possible. John does not pretend that this coordination would then happen, or that he’s given it a ton of thought (nor was it his job), instead correctly noting that it is what would be necessary.

    4. His failure to pretend here is virtuous. He is alerting us to the real situation of what would happen if AGI did arrive soon in many places. Which is quite bad. I would prefer a different answer but only if it was true.

    5. Justin Halford: Schulman’s body language during the portion on game theory/coordination was clear – universal coordination is not going to happen. Firms and nation states will forge the path at a blistering pace. There is not a clear incentive to do anything but compete.

    6. I saw talk about how calm he was here. To my eyes, he was nervous but indeed insufficiently freaked out as I noted above. But also he’s had a while to let such things sink in, he shouldn’t be having the kind of emotional reaction you get when you first realize this scenario might happen.

  12. (20: 15) Pause what then? Deployment, training, some types of training, set up some reasonable rules for what everyone should do.

    1. I’m fine with the vagueness here. You were surprised by the capabilities in question, you should update on that and respond accordingly. I would still prefer the baseline be ‘do not train anything past this point and keep the AGI very carefully sandboxed at minimum until safety is robustly established.’

    2. That is true even in the absence of any of the weirder scenarios. True AGI is a big freaking deal. Know what you are doing before deployment.

  13. (21: 00) OK, suppose a pause. What’s the plan? John doesn’t have a good answer, but if everyone can coordinate like that it would be an OK scenario. He does notice that maintaining the equilibrium would be difficult.

    1. I actually give this answer high marks. John is being great all around about noticing and admitting confusion and not making up answers. He also notes how fortunate we would be to be capable of this coordination at all.

    2. I presume that if we did get there, that the government would then either be amenable to enshrining the agreement and extending it, or they would actively betray us all and demand the work resume. It seems implausible they would let it play out on its own.

  14. (22: 20) Dwarkesh pushes. Why is this scenario good? John says we could then solve technical problems and coordinate to deploy smart technical AIs with safeguards in place, which would be great, prosperity, science, good things. That’s the good scenario.

    1. The issue is this assumes both even stronger coordination on deployment, which could be far harder than coordination on pausing, making a collective decision to hold back including internationally, and it supposes that we figure out how to make the AI safety work on our behalf.

    2. Again, I wish we had better answers all around, but given that we do not admitting we don’t have them is the best answer available.

  15. (23: 15) What would be proof the systems were safe to deploy? John proposes incremental deployment of smarter systems, he’d prefer to avoid the lockdown scenario. Better to continuously release incremental improvements, each of which improves safety and alignment alongside capability, with ability to slow down if things look scary. If you did have a discontinuous jump? No generic answer, but maybe a lot of testing simulated deployment and red teaming, under conditions more likely to fail than the real world, and have good monitoring. Defense in depth, good morals instilled, monitoring for trouble.

    1. Again I love the clear admission that he doesn’t know many things.

    2. Incremental deployment has its advantages, but there is an underlying assumption that alignment and safety are amenable to incremental progress as well, and that there won’t be any critical jumps or inflection points where capabilities effectively jump or alignment techniques stop working in various ways. I’d have liked to see these assumptions noted, especially since I think they are not true.

    3. We are in ‘incremental deployment’ mode right now because we went 4→Turbo→4o while others were catching up but I expect 5 to be a big jump.

  16. (26: 30) How to notice a discontinuous jump? Should we do these long-range trainings given that risk? Evals. Lots of evals. RIght now, John says, we’re safe, but in future we will need to check if they’re going to turn against us, and look for discontinuous jumps. ‘That doesn’t seem like the hardest thing to do. The way we train them with RLHF, even though the models are very smart, the model is just trying to produce something that is pleasing to a human. It has no other concerns in the world other than whether this text is approved.’ Then he notices tool use over many steps might change that, but ‘it wouldn’t have any incentive to do anything except produce a very high quality output at the end.’

    1. So this is the first answer that made me think ‘oh no.’ Eliezer has tried to explain so many times why it’s the other way. I have now tried many times to explain why it’s the other way. Or rather, why at some point in the capability curve it becomes the other way, possibly all at once, and you should not be confident you will notice.

    2. No, I’m not going to try again to explain it here. I do try a bit near the end.

  17. (29: 00) He mentions the full instrumental convergence scenario of ‘first take over the world’ and says it’s a little hard to imagine. Maybe with a task like ‘make money’ that would be different and lead to nefarious instrumental goals.

    1. So close to getting it.

    2. Feels like there’s an absurdity heuristic blocking him from quite getting there.

    3. If John really does dive deep into these questions, seems like he’ll get it.

  1. (30: 00) Psychologically what kind of thing is being changed by RLHF? John emphasizes this is an analogy, like the satisfaction you get from achieving a goal, one can metaphorically think of the models as having meaningful drives and goals.

    1. I love the balanced approach here.

  2. (31: 30) What is the best approach to get good reasoning? Train on chains of thought, or do inference in deployment? John says you could think of reasoning as tasks that require computation or deduction at test time, and that you should use a mix of both.

    1. Yep, seems right to me.

  3. (33: 45) Is there a path between in-context learning and pre-training, some kind of medium-term memory? What would ‘doing the research for the task’ or ‘looking into what matters here that you don’t know’ look like? John says this is missing from today’s systems and has been neglected. Instead we scale everything including the context window. But you’d want to supplement that through fine-tuning.

    1. This suggests a kind of lightweight, single-use automated fine-tuning regime?

    2. Currently this is done through scaffolding, chain of thought and external memory for context, as I understand this, but given how few-shot fine-tuning can be and still be effective, this does seem underexplored?

  4. (37: 30) What about long horizon tasks? You’re learning as you go so your learning and memory must update. Really long context also works but John suggests you also want fine tuning, and you might get active learning soon.

  5. (39: 30) What RL methods will carry forward to this? John says policy grading is not sample efficient, similar to motor learning in animals, so don’t use that at test time. You want in-context learning with a learned algorithm, things that look like learned search algorithms.

  6. (41: 15) Shift to personal history and experiences. Prior to ChatGPT they had ‘instruction following models’ that would at least do things like answer questions. They did a bunch of work to make the models more usable. Coding was a clear early use case. They had browsing early but they de-emphasized it. Chat orientation made it all much easier, people knew what to reinforce.

  7. (47: 30) Creating ChatGPT requires several iterations of bespoke fine-tuning.

  8. (49: 40) AI progress has been faster than John expected since GPT-2. John’s expectations pivot was after GPT-3.

  9. (50: 30) John says post-training likely will take up a larger portion of training costs over time. They’ve found a lot of gains through post-training.

  10. (51: 30) The improvement in Elo score for GPT-4o is post-training.

    1. Note: It was a 100-point Elo improvement based on the ‘gpt2’ tests prior to release, but GPT-4o itself while still on top saw only a more modest increase.

  11. (52: 40) What makes a good ML researcher? Diverse experience. Knows what to look for. Emperia and techne, rather than metis.

  12. (53: 45) Plateau? Can data enable more progress? How much cross-progress? John correctly warns us that it has not been so long since GPT-4. He does not expect us to hit the data wall right away but that we will approach it soon and this will change training. He also notes that running experiments on GPT-4 level training runs are too expensive to be practical, but you could run ablation experiments on GPT-2 level models, but John notes that transfer failure at small scale only provides weak evidence for what happens at large scale.

  13. (57: 45) Why does more parameters make a model smarter on less data? John does not think anyone understand the mechanisms of scaling laws for parameter counts. John speculates that the extra parameters allow more computations and better residual streams and doing more things in parallel. You can have a bigger library of functions you can chain together.

  1. (1: 01: 00) What other modalities and impacts should we expect over the next few years? New modalities coming soon and over time. Capabilities will improve through a combination of pre-training and post-training. Higher impact on economy over time, even if model abilities were frozen. Much more wide use and for more technically sophisticated tasks. Science analysis and progress. Hopefully humans are still in command and directing the AIs.

    1. This all seems right and very much like the things that are baked in even with disappointing AI progress. I continue to be baffled by the economists who disagree that similar changes are coming.

    2. What this does not sound like is what I would think about as AGI.

  2. (1: 05: 00) What happens on the path to when AI is better at everything? Is that gradual? Will the systems stay aligned? John says maybe not jump to AIs running whole firms, maybe have people oversee key decisions. Hopefully humans are still the drivers of what AIs end up doing.

    1. Agreed, but how do we make that happen, when incentives run against it?

  3. (1: 07: 00) In particular, Dwarkesh raises Amdahl’s law, that the slowest part of the process bottlenecks you. How do you compete with the corporation or nations that take humans out of their loops? John suggests regulation.

    1. But obviously that regulation gets de facto ignored. The human becomes at best a rubber stamp, if it would be expensive to be more than that.

    2. Thus this is not a valid bottleneck to target. Once you let the AI ‘out of the box’ in this sense, and everyone has access to it, even if the AIs are all being remarkably aligned and well-behaved this style of regulation is swimming too upstream.

    3. Even if you did institute ‘laws with teeth’ that come at great relative efficiency cost but would do the job, how are you going to enforce them? At best you are looking at a highly intrusive regime requiring international cooperation.

  4. (1: 08: 15) Dwarkesh is there. If you do this at the company level then every company must be monitored in every country. John correctly notes that the alternative is to get all the model providers onboard.

    1. Not only every company, also every individual and every computer or phone.

    2. John gets the core insight here. In my word: If capabilities advance sufficiently then even in relatively otherwise good worlds, we can either:

      1. ‘Allow nature to take its course’ in the sense of allowing everything to be run and be controlled by AIs and hope that goes well for the humans OR

      2. Use models and providers as choke points to prevent this OR

      3. Use another choke point, but that looks far worse and more intrusive.

  5. (1: 09: 45) John speculates, could AI-run companies still have weaknesses, perhaps higher tail risk? Perhaps impose stricter liability? He says if alignment is solved that even then letting AIs run the firms, or fully run firms, might be pretty far out.

    1. Tail risk to the firm, or to the world, or both?

    2. Wouldn’t a capable AI, if it had blind spots, know when to call upon a human or another AI to check for those blind spots, if it could not otherwise fix them? That does not seem so hard, relative to the rest of this.

    3. I agree there could be a period where the right play on a company level is ‘the AI is mostly running things but humans still need to supervise for real to correct errors and make macro decisions,’ and it might not only be a Tuesday.

    4. You still end up in the same place?

  6. (1: 11: 00) What does aligned mean here? User alignment? Global outcome optimization? John notes we would have to think about RLHF very differently than we do now. He refers to the Model Spec on how to settle various conflicts. Mostly be helpful to the user, but not when it impinges on others. Dwarkesh has seen the model spec, is impressed by its handling of edge cases. John notes it is meant to be actionable with examples.

    1. This is the scary stuff. At the capabilities levels being discussed and under the instructions involved in running a firm, I fully expect RLHF to importantly fail, and do so in unexpected, sudden and hard to detect and potentially catastrophic ways.

    2. I will be analyzing the Model Spec soon. Full post is coming. The Model Spec is an interesting first draft of a useful document, very glad they shared it with us, but it does not centrally address this issue.

    3. Mostly resolution of conflicts is simple at heart, as spelled out in the Model Spec? Platform > Developer > User > Tool. You can in a sense add Government at the front of that list, perhaps, as desired. With the upper levels including concern for others and more. More discussion will be in full post.

    4. I do suggest a number of marginal changes to the Model Spec, both for functionality and for clarity.

    5. I’m mostly holding onto that post because I worry no one would read it atm.

  7. (1: 15: 40) Does ML research look like p-hacking? John says it’s relatively healthy due to practicality, although everyone has complaints. He suggests using base models to do social science research via simulation.

    1. I don’t see much p-hacking either. We got 99 problems, this aint one.

    2. Using base models for simulated social science sounds awesome, especially if we have access to strong enough base models. I both hope and worry that this will be accurate enough that certain types will absolutely freak out when they see the results start coming back. Many correlations are, shall we say, unwelcome statements in polite society.

  8. (1: 19: 00) How much of big lab research is compute multipliers versus stabilizing learning versus improving infrastructure? How much algorithmic improvement in efficiency? John essentially says they trade off against each other, and there’s a lot of progress throughout.

    1. First time an answer felt like it was perhaps a dodge. Might be protecting insights, might also be not the interesting question, Dwarkesh does not press.

  9. (1: 20: 15) RLHF rapid-fire time. Are the raters causing issues like all poetry having to rhyme until recently? John says processes vary a lot, progress is being made including to make the personality more fun. He wonders about ticks like ‘delve.’ An interesting speculation is, what if there is de facto distillation because people you hire decided to use other chatbots to generate their feedback for the model via cut and paste. But people like bullet points and structure and info dumps.

    1. Everyone has different taste, but I am not a fan of the new audio personality as highlighted in the GPT-4o demos. For text it seems to still mostly have no personality at least with my instructions, but that is how I like it.

    2. It does make sense that people like bullet points and big info dumps. I notice that I used to hate it because it took forever, with GPT-4o I am largely coming around to it with the new speed, exactly as John points out in the next section. I do still often long for more brevity.

  10. (1: 23: 15) Dwarkesh notes it seems to some people too verbose perhaps due to labeling feedback. John speculates that only testing one message could be a cause of that, for example clarifying questions get feedback to be too long. And he points to the rate of output as a key factor.

  11. (1: 24: 45) For much smarter models, could we give a list of things we want that are non-trivial and non-obvious? Or are our preferences too subtle and need to be found via subliminal preferences? John agrees a lot of things models learn are hard to articulate in an instruction manual, potentially you can use a lot of examples like the Model Spec. You can do distillation, and bigger models learn a lot of concepts automatically about what people find helpful and useful and they can latch onto moral theories or styles.

    1. Lot to dig into here, and this time I will attempt it.

    2. I strongly agree, as has been pointed out many times, that trying to precisely enumerate and define what we want doesn’t work, our actual preferences are too complex and subtle.

    3. Among humans, we adjust for all that, and our laws and norms are chosen with the expectation of flexible enforcement and taking context and various considerations into account.

    4. When dealing with current LLMs, and situations that are effectively inside the distribution and that do not involve outsized capabilities, the ‘learn preferences through osmosis’ strategy should and so far does work well when combined with a set of defined principles, with some tinkering. And indeed, for now, as optimists have pointed out, making the models more capable and smarter should make them better able to do this.

    5. In my world model, this works for now because there are not new affordances, options and considerations that are not de facto already in the training data. If the AI tried to (metaphorically, non-technically) take various bizarre or complex paths through causal space, they would not work, the AI and its training are not capable enough to profitably find and implement them. Even when we try to get the AIs to act like agents and take complex paths and do strategic planning, they fall on their metaphorical faces. We are not being saved from these outcomes because the AI has a subtle understanding of human morality and philosophy and the harm principles.

    6. However, if the AIs got sufficiently capable that those things would stop failing, all bets are off. A lot of new affordances come into play, things that didn’t happen before because they wouldn’t have worked now work and therefore happen. The correspondence between what you reward and what you want will break.

    7. Even if the AIs did successfully extract all our subtle intuitions for what is good in life, and even if the AIs were attempting to follow that, those intuitions only give you reasonable answers inside the human experiential distribution. Go far enough outside it, change enough features, and they become deeply stupid and contradictory.

    8. You also have the full ‘the genie knows but does not care’ problem.

    9. We are going to need much better plans for now to deal with all this. I certainly do not have the answers.

  12. (1: 27: 20) What will be the moat? Will it be the finicky stuff versus model size? John says post training can be a strong moat in the future, it requires a lot of tacit knowledge and organizational knowledge and skilled work that accumulates over time to do good post training. It can be hard to tell because serious pre-training and post-training efforts so far have happened in lockstep. Distillation could be an issue, either copying or using the other AI as output judge, if you are willing to break terms of service and take the hit to your pride.

    1. There are other possible moats as well, including but not limited to user data and customers and social trust and two-sided markets and partnerships.

    2. And of course potentially regulatory capture. There has been a bunch of hyperbolic talk about it, but eventually this is an important consideration.

  13. (1: 29: 40) What does the median rater look like? John says it varies, but one could look on Upwork or other international remote work job sites for a baseline, although there are a decent number of Americans. For STEM you can use India or lower income countries, for writing you want Americans. Quality varies a lot.

  14. (1: 31: 30) To what extent are useful outputs closely matched to precise labelers and specific data? John says you can get a lot out of generalization.

  15. (1: 35: 40) Median timeline to replace John’s job? He says five years.

    1. I like the concreteness of the question phrasing, especially given John’s job.

    2. If the AI can do John’s job (before or after the switch), then… yeah.

    3. Much better than asking about ‘AGI’ given how unclear that term is.

I put my conclusion and overall thoughts at the top.

It has not been a good week for OpenAI, or a good week for humanity.

But given what else happened and that we know, and what we might otherwise have expected, I am glad John Schulman is the one stepping up here.

Good luck!

On Dwarkesh’s Podcast with OpenAI’s John Schulman Read More »

On Questionnaires and Briefings: Explaining the GigaOm Policy Change

A stitch in time saves nine, they say, and so can receiving information in the right order.

We at GigaOm are constantly looking to make our research processes more efficient and more effective. Vendors often tell us what it’s like to work with us—we welcome these interactions and look to address every comment (so thank you for these!). We spent a good part of 2023 on driving far-reaching improvements in our processes, and we’re building on that in the knowledge that better efficiency leads to higher quality research at lower cost, as well as happier analysts and vendors!

That’s why we’re making a small yet necessary change to our briefings process. Historically, we’ve asked vendors to complete a questionnaire and/or schedule a briefing call, and we haven’t specified the order these should take place. The small tweak is to request that vendors first complete a questionnaire, THEN participate in a call to clarify details.

In practice, this means we will enforce receiving a completed questionnaire 24 hours before a scheduled briefing call. Should we not receive it within this timeframe, we will reschedule the briefing so the questionnaire can be completed and reviewed prior to the call. Analysts need time to review vendor responses before a briefing, so getting the questionnaire five minutes before won’t cut it.

As well as fostering efficiency on both sides, the broader reasons for this change are founded in our engineering-led evaluation approach, which reflects how an end-user organization might conduct an RFP process. What we do is to set out a number of decision criteria we expect products to possess, then ask for evidence to show these features are in fact present.

Briefings are actually an inefficient mechanism for delivering that information; the questionnaire is far better at giving us what we need to know to assess whether and how a product delivers on our criteria. Briefings should supplement the questionnaire, giving analysts an opportunity to ask follow-up questions about vendor responses which will cut down on unnecessary back-and-forth during fact check.

Briefings also have their own distractions. Keep in mind that we care less about market positioning and more about product capability. General briefings (outside of the research cycle) are a great place to set out strategy, have the logo slide, run through case studies, and all that. We love those general briefings, but the research cycle is the wrong moment for the big tent stuff (which often exists as a prerecorded video that we’d be happy to review, just not as part of a report briefing call).

I’ve often told vendors we’re not looking for all the bling during briefings. In the best cases, our engineers engage with your engineers about the key features of your products. We don’t need trained spokespeople as much as an honest conversation about functionality and use cases—10 minutes on a video call can clarify something that reams of marketing material, and user documentation cannot. Hence the change.

This shouldn’t add any extra time to the process—the opposite, in fact, as briefings are more productive when the questionnaire is already in place. We can reduce costly errors, decrease back-and-forth clarifications, and minimize misinterpretation (with the consequent potential backlash on AR, “how did you let them write that?”).

So, there you have it. We’ll be rolling out this change in early June for our September reports, so nothing will happen in a rush. Any questions or concerns, please do let us know—we’re constantly adjusting timeframes based on national holidays, industry conferences, and competitor cycles, and we welcome all input on events that might impact delivery.

We are looking at other ways we can improve efficiency, notably simplifying or reformatting the questionnaire, so watch this space for details—and we welcome any thoughts you may have! We also understand that logistics can be tough: we are all juggling time, resources, and people to enable research to happen.

We absolutely recognize the symbiosis between analysts and vendors, and we thoroughly appreciate the efforts made by AR teams on our behalf, to enable these interactions to happen—from familiarization with GigaOm and explaining our value, through negotiating the minefield of operational logistics! Our door is always open if you need anyone to help support your endeavors, as we work toward a win-win for all.

On Questionnaires and Briefings: Explaining the GigaOm Policy Change Read More »

we-take-a-stab-at-decoding-spacex’s-ever-changing-plans-for-starship-in-florida

We take a stab at decoding SpaceX’s ever-changing plans for Starship in Florida

SpaceX's Starship tower (left) at Launch Complex 39A dwarfs the launch pad for the Falcon 9 rocket (right).

Enlarge / SpaceX’s Starship tower (left) at Launch Complex 39A dwarfs the launch pad for the Falcon 9 rocket (right).

There are a couple of ways to read the announcement from the Federal Aviation Administration that it’s kicking off a new environmental review of SpaceX’s plan to launch the most powerful rocket in the world from Florida.

The FAA said on May 10 that it plans to develop an Environmental Impact Statement (EIS) for SpaceX’s proposal to launch Starships from NASA’s Kennedy Space Center in Florida. The FAA ordered this review after SpaceX updated the regulatory agency on the projected Starship launch rate and the design of the ground infrastructure needed at Launch Complex 39A (LC-39A), the historic launch pad once used for Apollo and Space Shuttle missions.

Dual environmental reviews

At the same time, the US Space Force is overseeing a similar EIS for SpaceX’s proposal to take over a launch pad at Cape Canaveral Space Force Station, a few miles south of LC-39A. This launch pad, designated Space Launch Complex 37 (SLC-37), is available for use after United Launch Alliance’s last Delta rocket lifted off there in April.

On the one hand, these environmental reviews often take a while and could cloud Elon Musk’s goal of having Starship launch sites in Florida ready for service by the end of 2025. “A couple of years would not be a surprise,” said George Nield, an aerospace industry consultant and former head of the FAA’s Office of Commercial Space Transportation.

Another way to look at the recent FAA and Space Force announcements of pending environmental reviews is that SpaceX finally appears to be cementing its plans to launch Starship from Florida. These plans have changed quite a bit in the last five years.

The environmental reviews will culminate in a decision on whether to approve SpaceX’s proposals for Starship launches at LC-39A and SLC-37. The FAA will then go through a separate licensing process, similar to the framework used to license the first three Starship test launches from South Texas.

NASA has contracts with SpaceX worth more than $4 billion to develop a human-rated version of Starship to land astronauts on the Moon on the first two Artemis lunar landing flights later this decade. To do that, SpaceX must stage a fuel depot in low-Earth orbit to refuel the Starship lunar lander before it heads for the Moon. It will take a series of Starship tanker flights—perhaps 10 to 15—to fill the depot with cryogenic propellants.

Launching that many Starships over the course of a month or two will require SpaceX to alternate between at least two launch pads. NASA and SpaceX officials say the best way to do this is by launching Starships from one pad in Texas and another in Florida.

Earlier this week, Ars spoke with Lisa Watson-Morgan, who manages NASA’s human-rated lunar lander program. She was at Kennedy Space Center this week for briefings on the Starship lander and a competing lander from Blue Origin. One of the topics, she said, was the FAA’s new environmental review before Starship can launch from LC-39A.

“I would say we’re doing all we can to pull the schedule to where it needs to be, and we are working with SpaceX to make sure that their timeline, the EIS timeline, and NASA’s all work in parallel as much as we can to achieve our objectives,” she said. “When you’re writing it down on paper just as it is, it looks like there could be some tight areas, but I would say we’re collectively working through it.”

Officially, SpaceX plans to perform a dress rehearsal for the Starship lunar landing in late 2025. This will be a full demonstration, with refueling missions, an uncrewed landing of Starship on the lunar surface, then a takeoff from the Moon, before NASA commits to putting people on Starship on the Artemis III mission, currently slated for September 2026.

So you can see that schedules are already tight for the Starship lunar landing demonstration if SpaceX activates launch pads in Florida late next year.

We take a stab at decoding SpaceX’s ever-changing plans for Starship in Florida Read More »

new-research-shows-gas-stove-emissions-contribute-to-19,000-deaths-annually

New research shows gas stove emissions contribute to 19,000 deaths annually

New research shows gas stove emissions contribute to 19,000 deaths annually

Ruth Ann Norton used to look forward to seeing the blue flame that danced on the burners of her gas stove. At one time, she says, she would have sworn that preparing meals with the appliance actually made her a better cook.

But then she started learning about the toxic gasses, including carbon monoxide, formaldehyde and other harmful pollutants that are emitted by stoves into the air, even when they’re turned off.

“I’m a person who grew up cooking, and love that blue flame,” said Norton, who leads the environmental advocacy group known as the Green & Healthy Homes Initiative. “But people fear what they don’t know. And what people need to understand really strongly is the subtle and profound impact that this is having—on neurological health, on respiratory health, on reproductive health.”

In recent years, gas stoves have been an unlikely front in the nation’s culture wars, occupying space at the center of a debate over public health, consumer protection, and the commercial interests of manufacturers. Now, Norton is among the environmental advocates who wonder if a pair of recent developments around the public’s understanding of the harms of gas stoves might be the start of a broader shift to expand the use of electrical ranges.

On Monday, lawmakers in the California Assembly advanced a bill that would require any gas stoves sold in the state to bear a warning label indicating that stoves and ovens in use “can release nitrogen dioxide, carbon monoxide, and benzene inside homes at rates that lead to concentrations exceeding the standards of the Office of Environmental Health Hazard Assessment and the United States Environmental Protection Agency for outdoor air quality.”

The label would also note that breathing those pollutants “can exacerbate preexisting respiratory illnesses and increase the risk of developing leukemia and asthma, especially in children. To help reduce the risk of breathing harmful gases, allow ventilation in the area and turn on a vent hood when gas-powered stoves and ranges are in use.”

The measure, which moved the state Senate, could be considered for passage later this year.

“Just running a stove for a few minutes with poor ventilation can lead to indoor concentrations of nitrogen dioxide that exceed the EPA’s air standard for outdoors,” Gail Pellerin, the California assembly member who introduced the bill, said in an interview Wednesday. “You’re sitting there in the house drinking a glass of wine, making dinner, and you’re just inhaling a toxic level of these gases. So, we need a label to make sure people are informed.”

Pellerin’s proposal moved forward in the legislature just days after a group of Stanford researchers announced the findings of a peer-reviewed study that builds on earlier examinations of the public health toll of exposure to nitrogen dioxide pollution from gas and propane stoves.

New research shows gas stove emissions contribute to 19,000 deaths annually Read More »

the-nature-of-consciousness,-and-how-to-enjoy-it-while-you-can

The nature of consciousness, and how to enjoy it while you can

Remaining aware —

In his new book, Christof Koch views consciousness as a theorist and an aficionado.

A black background with multicolored swirls filling the shape of a human brain.

Unraveling how consciousness arises out of particular configurations of organic matter is a quest that has absorbed scientists and philosophers for ages. Now, with AI systems behaving in strikingly conscious-looking ways, it is more important than ever to get a handle on who and what is capable of experiencing life on a conscious level. As Christof Koch writes in Then I Am Myself the World, “That you are intimately acquainted with the way life feels is a brute fact about the world that cries out for an explanation.” His explanation—bounded by the limits of current research and framed through Koch’s preferred theory of consciousness—is what he eloquently attempts to deliver.

Koch, a physicist, neuroscientist, and former president of the Allen Institute for Brain Science, has spent his career hunting for the seat of consciousness, scouring the brain for physical footprints of subjective experience. It turns out that the posterior hot zone, a region in the back of the neocortex, is intricately connected to self-awareness and experiences of sound, sight, and touch. Dense networks of neocortical neurons in this area connect in a looped configuration; output signals feedback into input neurons, allowing the posterior hot zone to influence its own behavior. And herein, Koch claims, lies the key to consciousness.

In the hot zone

According to integrated information theory (IIT)—which Koch strongly favors over a multitude of contending theories of consciousness—the Rosetta Stone of subjective experience is the ability of a system to influence itself: to use its past state to affect its present state and its present state to influence its future state.

Billions of neurons exist in the cerebellum, but they are wired “with nonoverlapping inputs and outputs … in a feed-forward manner,” writes Koch. He argues that a structure designed in this way, with limited influence over its own future, is not likely to produce consciousness. Similarly, the prefrontal cortex might allow us to perform complex calculations and exhibit advanced reasoning skills, but such traits do not equate to a capacity to experience life. It is the “reverberatory, self-sustaining excitatory loops prevalent in the neocortex,” Koch tells us, that set the stage for subjective experience to arise.

This declaration matches the experimental evidence Koch presents in Chapter 6: Injuries to the cerebellum do not eliminate a person’s awareness of themselves in relation to the outside world. Consciousness remains, even in a person who can no longer move their body with ease. Yet injuries to the posterior hot zone within the neocortex significantly change a person’s perception of auditory, visual, and tactile information, altering what they subjectively experience and how they describe these experiences to themselves and others.

Does this mean that artificial computer systems, wired appropriately, can be conscious? Not necessarily, Koch says. This might one day be possible with the advent of new technology, but we are not there yet. He writes. “The high connectivity [in a human brain] is very different from that found in the central processing unit of any digital computer, where one transistor typically connects to a handful of other transistors.” For the foreseeable future, AI systems will remain unconscious despite appearances to the contrary.

Koch’s eloquent overview of IIT and the melodic ease of his neuroscientific explanations are undeniably compelling, even for die-hard physicalists who flinch at terms like “self-influence.” His impeccably written descriptions are peppered with references to philosophers, writers, musicians, and psychologists—Albert Camus, Viktor Frankl, Richard Wagner, and Lewis Carroll all make appearances, adding richness and relatability to the narrative. For example, as an introduction to phenomenology—the way an experience feels or appears—he aptly quotes Eminem: “I can’t tell you what it really is, I can only tell you what it feels like.”

The nature of consciousness, and how to enjoy it while you can Read More »

the-apple-tv-is-coming-for-the-raspberry-pi’s-retro-emulation-box-crown

The Apple TV is coming for the Raspberry Pi’s retro emulation box crown

watch out, raspberry pi —

Apple’s restrictions will still hold it back, but there’s a lot of possibility.

The RetroArch app installed in tvOS.

Enlarge / The RetroArch app installed in tvOS.

Andrew Cunningham

Apple’s initial pitch for the tvOS and the Apple TV as it currently exists was centered around apps. No longer a mere streaming box, the Apple TV would also be a destination for general-purpose software and games, piggybacking off of the iPhone’s vibrant app and game library.

That never really panned out, and the Apple TV is still mostly a box for streaming TV shows and movies. But the same App Store rule change that recently allowed Delta, PPSSPP, and other retro console emulators onto the iPhone and iPad could also make the Apple TV appeal to people who want a small, efficient, no-fuss console emulator for their TVs.

So far, few of the emulators that have made it to the iPhone have been ported to the Apple TV. But earlier this week, the streaming box got an official port of RetroArch, the sprawling collection of emulators that runs on everything from the PlayStation Portable to the Raspberry Pi. RetroArch could be sideloaded onto iOS and tvOS before this, but only using awkward workarounds that took a lot more work and know-how than downloading an app from the App Store.

Downloading and using RetroArch on the Apple TV is a lot like using it on any other platform it supports, for better or worse. ROM files can be uploaded using a browser connected to the Apple TV’s IP address or hostname, which will pop up the first time you launch the RetroArch app. From there, you’re only really limited by the list of emulators that the Apple TV version of the app supports.

The main benefit of using the Apple TV hardware for emulation is that even older models have substantially better CPU and GPU performance than any Raspberry Pi; the first-gen Apple TV 4K and its Apple A10X chip date back to 2017 and still do better than a Pi 5 released in 2023. Even these older models should be more than fast enough to support advanced video filters, like Run Ahead, to reduce wireless controller latency and higher-than-native-resolution rendering to make 3D games look a bit more modern.

Beyond the hardware, tvOS is also a surprisingly capable gaming platform. Apple has done a good job adding and maintaining support for new Bluetooth gamepads in recent releases, and even Nintendo’s official Switch Online controllers for the NES, SNES, and N64 are all officially supported as of late 2022. Apple may have added this gamepad support primarily to help support its Apple Arcade service, but all of those gamepads work equally well with RetroArch.

At the risk of stating the obvious, another upside of using the Apple TV for retro gaming is that you can also still use it as a modern 4K video streaming box when you’re finished playing your games. It has well-supported apps from just about every streaming provider, and it supports all the DRM that these providers insist on when you’re trying to stream high-quality 4K video with modern codecs. Most Pi gaming distributions offer the Kodi streaming software, but it’s frankly outside the scope of this article to talk about the long list of caveats and add-ons you’d need to use to attempt using the same streaming services the Apple TV can access.

Obviously, there are trade-offs. Pis have been running retro games for a decade, and the Apple TV is just starting to be able to do it now. Even with the loosened App Store restrictions, Apple still has other emulation limitations relative to a Raspberry Pi or a PC.

The biggest one is that emulators on Apple’s platforms can’t use just-in-time (JIT) code compilation, needed for 3D console emulators like Dolphin. These restrictions make the Apple TV a less-than-ideal option for emulating newer consoles—the Nintendo 64, Nintendo DS, Sony PlayStation, PlayStation Portable, and Sega Saturn are the newest consoles RetroArch supports on the Apple TV, cutting out newer things like the GameCube and Wii, Dreamcast, and PlayStation 2 that are all well within the capabilities of Apple’s chips. Apple also insists nebulously that emulators must be for “retro” consoles rather than modern ones, which could limit the types of emulators that are available.

With respect to RetroArch specifically, there are other limitations. Though RetroArch describes itself as a front-end for emulators, its user interface is tricky to navigate, and cluttered with tons of overlapping settings that make it easy to break things if you don’t know what you’re doing. Most Raspberry Pi gaming distros use RetroArch, but with a front-end-for-a-front-end like EmulationStation installed to make RetroArch a bit more accessible and easy to learn. A developer could release an app that included RetroArch plus a separate front-end, but Apple’s sandboxing restrictions would likely prevent anyone from releasing an app that just served as a more user-friendly front-end for the RetroArch app.

Regardless, it’s still pretty cool to be able to play retro games on an Apple TV’s more advanced hardware. As more emulators make their way to the App Store, the Apple TV’s less-fussy software and the power of its hardware could make it a compelling alternative to a more effort-intensive Raspberry Pi setup.

The Apple TV is coming for the Raspberry Pi’s retro emulation box crown Read More »

cats-playing-with-robots-proves-a-winning-combo-in-novel-art-installation

Cats playing with robots proves a winning combo in novel art installation

The feline factor —

Cat Royale project explores what it takes to trust a robot to look after beloved pets.

Cat with the robot arm in the Cat Royale installation

Enlarge / A kitty named Clover prepares to play with a robot arm in the Cat Royale “multi-species” science/art installation .

Blast Theory – Stephen Daly

Cats and robots are a winning combination, as evidenced by all those videos of kitties riding on Roombas. And now we have Cat Royale, a “multispecies” live installation in which three cats regularly “played” with a robot over 12 days, carefully monitored by human operators. Created by computer scientists from the University of Nottingham in collaboration with artists from a group called Blast Theory, the installation debuted at the World Science Festival in Brisbane, Australia, last year and is now a touring exhibit. The accompanying YouTube video series recently won a Webby Award, and a paper outlining the insights gleaned from the experience was similarly voted best paper at the recent Computer-Human Conference (CHI’24).

“At first glance, the project is about designing a robot to enrich the lives of a family of cats by playing with them,” said co-author Steve Benford of the University of Nottingham, who led the research, “Under the surface, however, it explores the question of what it takes to trust a robot to look after our loved ones and potentially ourselves.” While cats might love Roombas, not all animal encounters with robots are positive: Guide dogs for the visually impaired can get confused by delivery robots, for example, while the rise of lawn mowing robots can have a negative impact on hedgehogs, per Benford et al.

Blast Theory and the scientists first held a series of exploratory workshops to ensure the installation and robotic design would take into account the welfare of the cats. “Creating a multispecies system—where cats, robots, and humans are all accounted for—takes more than just designing the robot,” said co-author Eike Schneiders of Nottingham’s Mixed Reality Lab about the primary takeaway from the project. “We had to ensure animal well-being at all times, while simultaneously ensuring that the interactive installation engaged the (human) audiences around the world. This involved consideration of many elements, including the design of the enclosure, the robot, and its underlying systems, the various roles of the humans-in-the-loop, and, of course, the selection of the cats.”

Based on those discussions, the team set about building the installation: a bespoke enclosure that would be inhabited by three cats for six hours a day over 12 days. The lucky cats were named Ghostbuster, Clover, and Pumpkin—a parent and two offspring to ensure the cats were familiar with each other and comfortable sharing the enclosure. The enclosure was tricked out to essentially be a “utopia for cats,” per the authors, with perches, walkways, dens, a scratching post, a water fountain, several feeding stations, a ball run, and litter boxes tucked away in secluded corners.

(l-r) Clover, Pumpkin, and Ghostbuster spent six hours a day for 12 days in the installation.

Enlarge / (l-r) Clover, Pumpkin, and Ghostbuster spent six hours a day for 12 days in the installation.

E. Schneiders et al., 2024

As for the robot, the team chose the Kino Gen3 lite robot arm, and the associated software was trained on over 7,000 videos of cats. A decision engine gave the robot autonomy and proposed activities for specific cats. Then a human operator used an interface control system to instruct the robot to execute the movements. The robotic arm’s two-finger gripper was augmented with custom 3D-printed attachments so that the robot could manipulate various cat toys and accessories.

Each cat/robot interaction was evaluated for a “happiness score” based on the cat’s level of engagement, body language, and so forth. Eight cameras monitored the cat and robot activities, and that footage was subsequently remixed and edited into daily YouTube highlight videos and, eventually, an eight-hour film.

Cats playing with robots proves a winning combo in novel art installation Read More »

leaks-from-valve’s-deadlock-look-like-a-pressed-sandwich-of-every-game-around

Leaks from Valve’s Deadlock look like a pressed sandwich of every game around

Deadlock isn’t the most original name, but trademarks are hard —

Is there something new underneath a whole bunch of familiar game elements?

Shelves at Valve's offices, as seen in 2018, with a mixture of artifacts from Half-Life, Portal, Dota 2, and other games.

Enlarge / Valve has its own canon of games full of artifacts and concepts worth emulating, as seen in a 2018 tour of its offices.

Sam Machkovech

“Basically, fast-paced interesting ADHD gameplay. Combination of Dota 2, Team Fortress 2, Overwatch, Valorant, Smite, Orcs Must Die.”

That’s how notable Valve leaker “Gabe Follower” describes Deadlock, a Valve game that is seemingly in playtesting at the moment, for which a few screenshots have leaked out.

The game has been known as “Neon Prime” and “Citadel” at prior points. It’s a “Competitive third-person hero-based shooter,” with six-on-six battles across a map with four “lanes.” That allows for some of the “Tower defense mechanics” mentioned by Gabe Follower, along with “fast travel using floating rails, similar to Bioshock Infinite.” The maps reference a “modern steampunk European city (little bit like Half-Life),” after “bad feedback” about a sci-fi theme pushed the development team toward fantasy.

Since testers started sharing Deadlock screenshots all over the place, here’s ones I can verify, featuring one of the heroes called Grey Talon. pic.twitter.com/KdZSRxObSz

— ‎Gabe Follower (@gabefollower) May 17, 2024

Valve doesn’t release games often, and the games it does release are often in development for long periods. Deadlock purportedly started development in 2018, two years before Half-Life: Alyx existed. That the game has now seemingly reached a closed (though not closed enough) “alpha” playtesting phase, with players in the “hundreds,” could suggest release within a reasonable time. Longtime Valve watcher (and modder, and code examiner) Tyler McVicker suggests in a related video that Deadlock has hundreds of people playing in this closed test, and the release is “about to happen.”

McVicker adds to the descriptor pile-on by noting that it’s “team-based,” “hero-based,” “class-based,” and “personality-driven.” It’s an attempt, he says, to “bring together all of their communities under one umbrella.”

Tyler McVicker’s discussion of the leaked Deadlock content, featuring … BioShock Infinite footage.

Many of Valve’s games do something notable to push gaming technology and culture forward. Half-Life brought advanced scripting, physics, and atmosphere to the “Doom clones” field and forever changed it. Counter-Strike and Team Fortress 2 lead the way in team multiplayer dynamics. Dota 2 solidified and popularized MOBAs, and Half-Life: Alyx gave VR on PC its killer app. Yes, there are Artifact moments, but they’re more exception than rule.

Following any of those games seems like a tall order, but Valve’s track record speaks for itself. I think players like me, who never took to Valorant or Overwatch or the like, should reserve judgment until the game can be seen in its whole. I have to imagine that there’s more to Deadlock than a pile of very familiar elements.

Leaks from Valve’s Deadlock look like a pressed sandwich of every game around Read More »

“unprecedented”-google-cloud-event-wipes-out-customer-account-and-its-backups

“Unprecedented” Google Cloud event wipes out customer account and its backups

Bringing new meaning to “Killed By Google” —

UniSuper, a $135 billion pension account, details its cloud compute nightmare.

“Unprecedented” Google Cloud event wipes out customer account and its backups

Buried under the news from Google I/O this week is one of Google Cloud’s biggest blunders ever: Google’s Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper’s incident log, downtime started May 2, and a full restoration of services didn’t happen until May 15.

UniSuper’s website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled “A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian.” This statement reads, “Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription. This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”

In the next section, titled “Why did the outage last so long?” the joint statement says, “UniSuper had duplication in two geographies as a protection against outages and loss. However, when the deletion of UniSuper’s Private Cloud subscription occurred, it caused deletion across both of these geographies.” Every cloud service keeps full backups, which you would presume are meant for worst-case scenarios. Imagine some hacker takes over your server or the building your data is inside of collapses, or something like that. But no, the actual worst-case scenario is “Google deletes your account,” which means all those backups are gone, too. Google Cloud is supposed to have safeguards that don’t allow account deletion, but none of them worked apparently, and the only option was a restore from a separate cloud provider (shoutout to the hero at UniSuper who chose a multi-cloud solution).

UniSuper is an Australian “superannuation fund“—the US equivalent would be a 401(k). It’s a retirement fund that employers pay into as part of an employee paycheck; in Australia, some amount of superfund payment is required by law for all employed people. Managing $135 billion worth of funds makes UniSuper a big enough company that, if something goes wrong, it gets the Google Cloud CEO on the phone instead of customer service.

A June 2023 press release touted UniSuper’s big cloud migration to Google, with Sam Cooper, UniSuper’s Head of Architecture, saying, “With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It’s all about efficiencies that help us deliver highly competitive fees for our members.”

The many stakeholders in the service meant service restoration wasn’t just about restoring backups but also processing all the requests and payments that still needed to happen during the two weeks of downtime.

“Unprecedented” Google Cloud event wipes out customer account and its backups Read More »

using-vague-language-about-scientific-facts-misleads-readers

Using vague language about scientific facts misleads readers

Using vague language about scientific facts misleads readers

Anyone can do a simple experiment. Navigate to a search engine that offers suggested completions for what you type, and start typing “scientists believe.” When I did it, I got suggestions about the origin of whales, the evolution of animals, the root cause of narcolepsy, and more. The search results contained a long list of topics, like “How scientists believe the loss of Arctic sea ice will impact US weather patterns” or “Scientists believe Moon is 40 million years older than first thought.”

What do these all have in common? They’re misleading, at least in terms of how most people understand the word “believe.” In all these examples, scientists have become convinced via compelling evidence; these are more than just hunches or emotional compulsions. Given that difference, using “believe” isn’t really an accurate description. Yet all these examples come from searching Google News, and so are likely to come from journalistic outlets that care about accuracy.

Does the difference matter? A recent study suggests that it does. People who were shown headlines that used subjective verbs like “believe” tended to view the issue being described as a matter of opinion—even if that issue was solidly grounded in fact.

Fact vs. opinion

The new work was done by three researchers at Stanford University: Aaron Chueya, Yiwei Luob, and Ellen Markman. “Media consumption is central to how we form, maintain, and spread beliefs in the modern world,” they write. “Moreover, how content is presented may be as important as the content itself.” The presentation they’re interested in involves what they term “epistemic verbs,” or those that convey information about our certainty regarding information. To put that in concrete terms, “’Know’ presents [a statement] as a fact by presup­posing that it is true, ‘believe’ does not,” they argue.

So, while it’s accurate to say, “Scientists know the Earth is warming, and that warming is driven by human activity,” replacing “know” with “believe” presents an inaccurate picture of the state of our knowledge. Yet, as noted above, “scientists believe” is heavily used in the popular press. Chueya, Luob, and Markman decided to see whether this makes a difference.

They were interested in two related questions. One is whether the use of verbs like believe and think influences how readers view whether the concepts they’re associated with are subjective issues rather than objective, factual ones. The second is whether using that phrasing undercuts the readers’ willingness to accept something as a fact.

To answer those questions, the researchers used a subject-recruiting service called Prolific to recruit over 2,700 participants who took part in a number of individual experiments focused on these issues. In each experiment, participants were given a series of headlines and asked about what inferences they drew about the information presented in them.

Using vague language about scientific facts misleads readers Read More »

twitter-urls-redirect-to-x.com-as-musk-gets-closer-to-killing-the-twitter-name

Twitter URLs redirect to x.com as Musk gets closer to killing the Twitter name

Goodbye Twitter.com —

X.com stops redirecting to Twitter.com over a year after company name change.

An app icon and logo for Elon Musk's X service.

Getty Images | Kirill Kudryavtsev

Twitter.com links are now redirecting to the x.com domain as Elon Musk gets closer to wiping out the Twitter brand name over a year and half after buying the company.

“All core systems are now on X.com,” Musk wrote in an X post today. X also displayed a message to users that said, “We are letting you know that we are changing our URL, but your privacy and data protection settings remain the same.”

Musk bought Twitter in October 2022 and turned it into X Corp. in April 2023, but the social network continued to use Twitter.com as its primary domain for more than another year. X.com links redirected to Twitter.com during that time.

There were still remnants of Twitter after today’s change. This morning, I noticed a support link took me to a help.twitter.com page. The link subsequently redirected to a help.x.com page after I sent a message to X’s public relations email, though the timing could be coincidence. After sending that message to press@x.com, I got the standard auto-reply from press+noreply@twitter.com, just as I have in the past.

You might still encounter Twitter links that don’t redirect to x.com, depending on which browser you use. The Verge said it is “seeing a mix of results depending upon browser choice and whether you’re logged in or not.”

I had no trouble accessing x.com on desktop browsers today. But in Safari on iPhone, I received error messages when trying to access either twitter.com or x.com without first logging in. I eventually succeeded in logging in and was able to view content, but I remained at twitter.com in the iPhone browser instead of being redirected to x.com.

This will presumably be sorted out, but the awkward Twitter-to-X transition has previously been accompanied by technical problems. In early April, Musk’s service started automatically changing “twitter.com” to “x.com” in links posted by users in the iOS app. But the automatic text replacement initially applied to any URL ending in “twitter.com” even if it wasn’t actually a twitter.com link, which meant that phishers could have taken advantage by registering misleading domain names.

Twitter URLs redirect to x.com as Musk gets closer to killing the Twitter name Read More »

how-to-port-any-n64-game-to-the-pc-in-record-time

How to port any N64 game to the PC in record time

Enlarge / “N-tel (64) Inside”

Aurich Lawson | Getty Images

In recent years, we’ve reported on multiple efforts to reverse-engineer Nintendo 64 games into fully decompiled, human-readable C code that can then become the basis for full-fledged PC ports. While the results can be impressive, the decompilation process can take years of painstaking manual effort, meaning only the most popular N64 games are likely to get the requisite attention from reverse engineers.

Now, a newly released tool promises to vastly reduce the amount of human effort needed to get basic PC ports of most (if not all) N64 games. The N64 Recompiled project uses a process known as static recompilation to automate huge swaths of the labor-intensive process of drawing C code out of N64 binaries.

While human coding work is still needed to smooth out the edges, project lead Mr-Wiseguy told Ars that his recompilation tool is “the difference between weeks of work and years of work” when it comes to making a PC version of a classic N64 title. And parallel work on a powerful N64 graphic renderer means PC-enabled upgrades like smoother frame rates, resolution upscaling, and widescreen aspect ratios can be added with little effort.

Inspiration hits

Mr-Wiseguy told Ars he got his start in the N64 coding space working on various mod projects around 2020. In 2022, he started contributing to the then-new RT64 renderer project, which grew out of work on a ray-traced Super Mario 64 port into a more generalized effort to clean up the notoriously tricky process of recreating N64 graphics accurately. While working on that project, Mr-Wiseguy said he stumbled across an existing project that automates the disassembly of NES games and another that emulates an old SGI compiler to aid in the decompilation of N64 titles.

YouTuber Nerrel lays out some of the benefits of Mr-Wiseguy’s N64 recompilation tool.

“I realized it would be really easy to hook up the RT64 renderer to a game if it could be run through a similar static recompilation process,” Mr-Wiseguy told Ars. “So I put together a proof of concept to run a really simple game and then the project grew from there until it could run some of the more complex games.”

A basic proof of concept for Mr-Wiseguy’s idea took only “a couple of weeks at most” to get up and running, he said, and was ready as far back as November of 2022. Since then, months of off-and-on work have gone into rounding out the conversion code and getting a recompiled version of The Legend of Zelda: Majora’s Mask ready for public consumption.

Trust the process

At its most basic level, the N64 recompilation tool takes a raw game binary (provided by the user) and reprocesses every single instruction directly and literally into corresponding C code. The N64’s MIPS instruction set has been pretty well-documented over years of emulation work, so figuring out how to translate each individual opcode to its C equivalent isn’t too much of a hassle.

Wave Race 64.” height=”360″ src=”https://cdn.arstechnica.net/wp-content/uploads/2024/05/recomprt2-640×360.png” width=”640″>

Enlarge / An early beta of the RT64 renderer shows how ray-tracing shadows and reflections might look in a port of Wave Race 64.

The main difficulty, Mr-Wiseguy said, can be figuring out where to point the tool. “The contents of the [N64] ROM can be laid out however the developer chose to do so, which means you have to find where code is in the ROM before you can even start the static recompilation process,” he explained. And while N64 emulators automatically handle games that load and unload code throughout memory at runtime, handling those cases in a pre-compiled binary can add extra layers of complexity.

How to port any N64 game to the PC in record time Read More »