Sunday, May 10, 2020

Ferguson Covid-19 model predicting 2.2 million US deaths still not released nor peer reviewed. “Cleaned up” version provided by Microsoft/GitHub is unusable for scientific purposes-Sue Denim, GitHub...(2.2 million deaths prediction was used to justify $2 trillion taxpayer funded virus bailout)

.
It isn’t the code Ferguson ran to produce his famous Report 9. What’s been released on GitHub is a heavily modified derivative of it, after having been upgraded for over a month by a team from Microsoft and others….Clearly, Imperial are too embarrassed by the state of it…which is unacceptable given that it was paid for by the taxpayer and belongs to them.…This problem makes the code unusable for scientific purposes, given that a key part of the scientific method is the ability to replicate results. Without replication, the findings might not be real at all.”…[GitHub is owned by Microsoft]…“The code should have been made available to all other Profs & top Coders and Data Scientists & Bio-Statisticians to PEER Review BEFORE the UK and USA Gvts made their decisions.Commenter Sean Flanagan 

5/6/20, “Code Review of Ferguson’s Model,” LockdownSceptics.org, by Sue Denim (not the author’s real name} 

“Imperial [College of London] finally released a derivative of Ferguson’s code. I figured I’d do a review of it and send you some of the things I noticed. I don’t know your background so apologies if some of this is pitched at the wrong level.

My background. I wrote software for 30 years. I worked at Google between 2006 and 2014, where I was a senior software engineer working on Maps, Gmail and account security. I spent the last five years at a US/UK firm where I designed the company’s database product, amongst other jobs and projects. I was also an independent consultant for a couple of years. Obviously I’m giving only my own professional opinion and not speaking for my current employer. 

The code.It isn’t the code Ferguson ran to produce his famous Report 9. What’s been released on GitHub is a heavily modified derivative of it, after having been upgraded for over a month by a team from Microsoft and others. This revised codebase is split into multiple files for legibility and written in C++, whereas the original program wasa single 15,000 line file that had been worked on for a decade” (this is considered extremely poor practice). request for the original code was made 8 days ago but ignored, and it will probably take some kind of legal compulsion to make them release it. Clearly, Imperial are too embarrassed by the state of it ever to release it of their own free will, which is unacceptable given that it was paid for by the taxpayer and belongs to them. 

The model. What it’s doing is best described as “SimCity without the graphics”. It attempts to simulate households, schools, offices, people and their movements, etc. I won’t go further into the underlying assumptions, since that’s well explored elsewhere. 

Non-deterministic outputs. Due to bugs, the code can produce very different results given identical inputs. They routinely act as if this is unimportant. 

This problem makes the code unusable for scientific purposes, given that a key part of the scientific method is the ability to replicate results. Without replication, the findings might not be real at all – as the field of psychology has been finding out to its cost. Even if their original code was released, it’s apparent that the same numbers as in Report 9 might not come out of it. Non-deterministic outputs may take some explanation, as it’s not something anyone previously floated as a possibility. 

The documentation says: “The model is stochastic. Multiple runs with different seeds should be undertaken to see average behaviour.” “Stochastic” is just a scientific-sounding word for “random”. That’s not a problem if the randomness is intentional pseudo-randomness, i.e. the randomness is derived from a starting “seed” which is iterated to produce the random numbers. Such randomness is often used in Monte Carlo techniques. It’s safe because the seed can be recorded and the same (pseudo-)random numbers produced from it in future. Any kid who’s played Minecraft is familiar with pseudo-randomness because Minecraft gives you the seeds it uses to generate the random worlds, so by sharing seeds you can share worlds. 

Clearly, the documentation wants us to think that, given a starting seed, the model will always produce the same results. Investigation reveals the truth: the code produces critically different results, even for identical starting seeds and parameters. 

I’ll illustrate with a few bugs. In issue 116 a UK “red team” at Edinburgh University reports that they tried to use a mode that stores data tables in a more efficient format for faster loading, and discovered – to their surprise – that the resulting predictions varied by around 80,000 deathThat mode doesn’t change anything about the world being simulated, so this was obviously a bug.
 

The Imperial team’s response is that it doesn’t matter: they are “aware of some small non-determinisms”, but “this has historically been considered acceptable because of the general stochastic nature of the model”. Note the phrasing here: Imperial know their code has such bugs, but act as if it’s some inherent randomness of the universe, rather than a result of amateur coding.

Apparently, in epidemiology, a difference of 80,000 deaths is “a small non-determinism”. 

Imperial advised Edinburgh that the problem goes away if you run the model in single-threaded mode, like they do. This means they suggest using only a single CPU core rather than the many cores that any video game would successfully use. For a simulation of a country, using only a single CPU core is obviously a dire problem – as far from supercomputing as you can get. Nonetheless, that’s how Imperial use the code: they know it breaks when they try to run it faster.It’s clear from reading the code that in 2014 Imperial tried to make the code use multiple CPUs to speed it up, but never made it work reliably. This sort of programming is known to be difficult and usually requires senior, experienced engineers to get good results. Results that randomly change from run to run are a common consequence of thread-safety bugs. More colloquially, these are known as “Heisenbugs“. 

But Edinburgh came back and reported that – even in single-threaded mode – they still see the problem. So Imperial’s understanding of the issue is wrong.  Finally, Imperial admit there’s a bug by referencing a code change they’ve made that fixes it.The explanation given is “It looks like historically the second pair of seeds had been used at this point, to make the runs identical regardless of how the network was made, but that this had been changed when seed-resetting was implemented”. In other words, in the process of changing the model they made it non-replicable and never noticed.

Why didn’t they notice? Because their code is so deeply riddled with similar bugs and they struggled so much to fix them that they got into the habit of simply averaging the results of multiple runs to cover it up…and eventually this behaviour became normalised within the team. 

In issue #30, someone reports that the model produces different outputs depending on what kind of computer it’s run on (regardless of the number of CPUs). Again, the explanation is that although this new problem “will just add to the issues”…“This isn’t a problem running the model in full as it is stochastic anyway”. 

Although the academic on those threads isn’t Neil Ferguson, he is well aware that the code is filled with bugs that create random results. In change #107 he authored he comments: “It includes fixes to InitModel to ensure deterministic runs with holidays enabled”. In change #158 he describes the change only as “A lot of small changes, some critical to determinacy”. 

Imperial are trying to have their cake and eat it. Reports of random results are dismissed with responses like “that’s not a problem, just run it a lot of times and take the average”, but at the same time, they’re fixing such bugs when they find them. They know their code can’t withstand scrutiny, so they hid it until professionals had a chance to fix it, but the damage from over a decade of amateur hobby programming is so extensive that even Microsoft were unable to make it run right. 

No tests. In the discussion of the fix for the first bug, Imperial state the code used to be deterministic in that place but they broke it without noticing when changing the code. 

Regressions like that are common when working on a complex piece of software, which is why industrial software-engineering teams write automated regression tests. These are programs that run the program with varying inputs and then check the outputs are what’s expected. Every proposed change is run against every test and if any tests fail, the change may not be made.

The Imperial code doesn’t seem to have working regression tests. They tried, but the extent of the random behaviour in their code left them defeated. On 4th April they said: “However, we haven’t had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn’t let the figures drift over time.” 

Beyond the apparently unsalvageable nature of this specific codebase, testing model predictions faces a fundamental problem, in that the authors don’t know what the “correct” answer is until long after the fact, and by then the code has changed again anyway, thus changing the set of bugs in it. So it’s unclear what regression tests really mean for models like this – even if they had some that worked. 

Undocumented equations. Much of the code consists of formulas for which no purpose is given. John Carmack (a legendary video-game programmer) surmised that some of the code might have been automatically translated from FORTRAN some years ago. 

For example, on line 510 of SetupModel.cpp there is a loop over all the “places”  the simulation knows about. This code appears to be trying to calculate R0 for “places”. Hotels are excluded during this pass, without explanation.

This bit of code highlights an issue Caswell Bligh has discussed in your site’s comments: R0 isn’t a real characteristic of the virus. R0 is both an input to and an output of these models, and is routinely adjusted for different environments and situations. Models that consume their own outputs as inputs is problem well known to the private sector – it can lead to rapid divergence and incorrect prediction. There’s a discussion of this problem in section 2.2 of the Google paper, “Machine learning: the high interest credit card of technical debt“. 

Continuing development. Despite being aware of the severe problems in their code that they “haven’t had time” to fix, the Imperial team continue to add new features; for instance, the model attempts to simulate the impact of digital contact tracing apps. 

Adding new features to a codebase with this many quality problems will just compound them and make them worse. If I saw this in a company I was consulting for I’d immediately advise them to halt new feature development until thorough regression testing was in place and code quality had been improved. 

Conclusions. All papers based on this code should be retracted immediately. Imperial’s modelling efforts should be reset with a new team that isn’t under Professor Ferguson, and which has a commitment to replicable results with published code from day one. 

On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector. Insurers employ modellers and data scientists, but also employ managers whose job is to decide whether a model is accurate enough for real world usage and professional software engineers to ensure model software is properly tested, understandable and so on. Academic efforts don’t have these people, and the results speak for themselves.” 

“My identity. Sue Denim isn’t a real person (read it out). I’ve chosen to remain anonymous partly because of the intense fighting that surrounds lockdown, but there’s also a deeper reason. This situation has come about due to rampant credentialism and I’m tired of it. As the widespread dismay by programmers demonstrates, if anyone in SAGE or the Government had shown the code to a working software engineer they happened to know, alarm bells would have been rung immediately. Instead, the Government is dominated by academics who apparently felt unable to question anything done by a fellow professor. Meanwhile, average citizens like myself are told we should never question “expertise”. Although I’ve proven my Google employment to Toby, this mentality is damaging and needs to end: please, evaluate the claims I’ve made for yourself, or ask a programmer you know and trust to evaluate them for you.” 

………………………………………………………
Among comments:
……………………………………………………

"Simon Conway-Smith  

I had hoped Donald Trump would be a stronger leader than that, and insisted on any model being independently and repeatedly verified before making any decision.".[Excerpt from full comment below]. 
......
Will Jones
Guest
Will Jones 
Devastating. Heads must roll for this, and fundamental changes be made to the way government relates to academics and the standards expected of researchers. Imperial College should be ashamed of themselves. 
175  Reply 
20 hours ago
Lms2
Guest
Lms2
The UK government should be just as ashamed for taking their advice. And anyone in the media who repeated their nonsense.
Robert
Guest
Robert
The problem is the nature of government and politics. Politics is a systematic way of transferring the consequences of inadequate or even reckless decision-making to others without the consent or often even the knowledge of those others. Politics and science are inherently antithetical. Science is about discovering the truth, no matter how inconvenient or unwelcome it may be to particular interested parties. Politics is about accomplishing the goal of interested parties and hiding any truth that would tend to impede that goal. The problem is not that “government has being doing it wrong;” the problem is that government has been doing it.
Mimi
Member
Mimi
Thank you so much for this! This code should’ve been available from the outset.
Sean Flanagan
Guest
Sean Flanagan
Amateur Hour all round! The code should have been made available to all other Profs & top Coders & Data Scientists & Bio-Statisticians to PEER Review BEFORE the UK and USA Gvts made their decisions. Imperial should be sued for such amateur work.
Caswell Bligh
Member
Caswell Bligh
This is an outstanding investigation. Many thanks for doing it – and to Toby for providing a place to publish it.
lesg
Member
lesg
So this is ‘the science’ that the Government thinks is that it is following!
lesg
Member
lesg
ChrisH29
Member
ChrisH29
This is isn’t a piece of poor software for a computer game, it is, apparently, the useless software that has shut down the entire western economy. Not only will it have wasted staggeringly vast sums of money but every day we are hearing of the lives that will be lost as a result.We are today learning of 1.4 million avoidable deaths from TB but that is nothing compared to the UN’s own forecast of “famine on a biblical scale”. Does one think that the odious, inept, morally bankrupt hypocrite, Ferguson will feel any shame, sorrow or remorse if, heaven forbid, the news in a couple of months time is dominated by the deaths of hundreds of thousands of children from starvation in the 3rd World or will his hubris protect him?
speedy
Member
speedy
I don’t understand why governments are still going for this ridiculous policy and NGOs all pretend it is Covid 19 that will cause this devastation RATHER than our reaction to it.
Simon Conway-Smith
Guest
Simon Conway-Smith
Why any of this isn’t obvious to our politicians says a lot about our politicians, but your summary also shows that that it is ENGINEERs and not academics that should be generating the input to policy making. It is only engineers who have the discipline to make things work, properly and reliably.
Chris Martin
Guest
Chris Martin
This kind of thing frequently happens with academic research. I’m a statistician and I hate working with academics for exactly this sort of reason.
skeptik
Guest
skeptik
the global warming models are secret too (mostly) and probably the same kind of mess as this code
Jeremy Crawford
Guest
Jeremy Crawford
Just wonderful and sadly utterly devastating. As an IT bod myself and early days skeptic this was such a pleasure to ŕead. Well done
Mike Haseler
Guest
Mike Haseler
Thanks for doing the analysis. Totally agree that leaving this kind of job to amateur academics is completely non sensical. I like your suggestion of using the insurance industry and if I were PM I would take that up immediately.
Andy Riley
Guest
Andy Riley
Look at SetupModel.ccp from line 2060 – pages of nested conditionals and loops with nary a comment. Nightmare!
Alicat2441
Guest
Alicat2441
Haven’t time to read the article and stopped at the portion where the data can’t be replicated. That right there is a huuuuuuge red flag and makes the “models” useless. I’ll come back tonight to finish reading. I have to ask: Is this the same with the University of Washington IMHE models?. Why do I have a sneaking suspicion that it is.
Laurence_R
Member
Laurence_R 
The IMHE [Bill Gates] ‘model’ is much worse – it’s just a simple exercise in curve fitting, with little or no actual modelling happening at all. I have collected screenshots of its predictions (for the US, UK, Italy, Spain, Sweden) every few days over the last few weeks, so I could track them against reality, and it is completely useless. But, according to what I’ve read, the US government trusts it!
Until a few days ago, its curves didn’t even look plausible – for countries on a downward trend (e.g. Italy and Spain), they showed the numbers falling off a cliff and going down to almost zero within days, and for countries still on an upward trend (e.g. the UK and Sweden) they were very pessimistic. However, the figures for the US were strangely optimistic – maybe that’s why the White House liked them.

They seem to have changed their model in the last few days – the curves look more plausible now. However, plausible looking curves mean nothing – any one of us could take the existing data (up to today) and ‘extrapolate’ a curve into the future. So plausibility means nothing – it’s just making stuff up based on pseudo-science. In the UK, we’re not supposed to dissent, because that implies that we don’t want to ‘save lives’ or ‘protect the NHS’, so the pessimistic model wins. In the US, it’s different, depending on people’s politics, so I’m not going to try to analyse that. 

So why do governments leap at these pseudo-models with their useless (but plausible-looking) predictions?...If there are competing crystal balls from different academics, the government will simply pick the one that matches its philosophy best, and claim that it is ‘following the science’.
Robin66
Member
Robin66
This is scary stuff. I’ve been a professional developer and researcher in the finance sector for 12 years. My background is Physics PhD. I have seen this sort of single file code structure a lot and it is a minefield for bugs. This can be mitigated to some extent by regression tests but it’s only as good as the number of test scenarios that have been written. Randomness cannot just be dismissed like this. It is difficult to nail down non-determinism but it can be done and requires the developer to adopt some standard practices to lock down the computation path. It sounds like the team have lost control of their codebase and have their heads in the sand. I wouldn’t invest money in a fund that was so shoddily run. The fact that the future of the country depends on such code is a scandal.
dr_t
Member
dr_t
Ferguson’s code is 30 years old. This review criticizes it as though it was written today, but many of these criticisms are simply not valid when applied to code that’s 30 years old. It was normal to write code that way 30 years ago. Monolithic code was much more common, especially for programs that were not meant to produce reusable components….
It’s perfectly normal not to want to disclose 30 year old code because, as has been proven by this very review, people will look at it and criticize it as if it was modern code.
So Ferguson evidently rewrote his program to be more consistent with modern coding standards before releasing it. And probably introduced a couple of bugs in the process. Given the fact that the original code was undocumented, old, and that he was under time pressure to produce it in a hurry, it would have been strange if this didn’t introduce some bugs. This does not, per se, invalidate the model….
MFP
Guest
MFP
I read the author’s discussion of the single-thread/multi-thread issue not so much as a criticism but as a rebuttal to possible counter-arguments. I agree it probably should have been left out (or relegated to a footnote), but the rest of the author’s arguments stand independently of the multi-thread issues.
I disagree with your framing of the author’s other criticisms as amounting to criticism of stochastic models. It does not appear the author has an issue with stochastic models, but rather with models where it is impossible to determine whether the variation in outputs is a product of intended pseudo-randomness or whether the variation is a product of unintended variability in the underlying process.
Paul Penrose
Guest
Paul Penrose
dr_t, I am also a Software Engineer with over 35 years of experience, so I understand what you are saying as far as 30 year old code, however if the software is not fit for purpose because it is riddled with bugs, then it should not be used for making policy decisions. And frankly I don’t care how old the code is, if it is poorly written and documented, then it should be thrown out and rewritten, otherwise it is useless.
As a side note, I currently work on a code base that is pure C and close to 30 years old. It is properly composed of manageable sized units and reasonably organized. It also has up to date function specifications and decent regression tests. When this was written, these were probably cutting-edge ideas, but clearly wasn’t unknown. Since then we’ve upgraded to using current tech compilers, source code repositories, and critical peer review of all changes.
So there really is no excuse for using software models that are so deficient. The problem is these academics are ignorant of professional standards in software development and frankly don’t care. I’ve worked with a few over the course of my career and that has been my experience every time.
skeptik
Guest
skeptik
I agree 100%, I wrote c/c++ code for years and this single file atrocity reminds me of student code
Neil
Guest
Neil
The fact it wasn’t refactored in 30 years is a sin plain and simple.
Robbo
Guest
Robbo
Testing is already indicating that huge numbers of the global population have already caught it. The virus has been in Europe since December at the latest, and as more information comes to light, that date will likely be moved significantly backwards. If the R0 is to be believed, the natural peak would have been hit, with or without lockdown, in March or April. That is what we have seen. This virus will be proven to be less deadly than a bad strain of influenza, with or without a vaccinated population. Total deaths have only peaked post lockdown. That is not a coincidence.
Bumble
Guest
Bumble
This model assumes first infections at least two months too late. The unsuppressed peak was supposed to be mid May (the ‘terrifying’ graph) so what we have seen in April is likely the real peak and lockdown has had no impact on the virus. Lockdown will have killed far more people.
Guest
SteveB   

Peak deaths in NHS hospitals in England were 874 on [4/08] 08/04. A week earlier, on [4/01] 01/04, there were 607 deaths. Crude Rt = 874/607 = 1.4. On average, a patient dying on [4/08] 08/04 would have been infected c. 17 days earlier on 22/03. So, by [3/22] 22/03 (before the full lockdown), Rt was (only) approx 1.4.Ok, so that doesn’t tell us too much, but if we repeat the calculation and go back a further week to [3/15] 15/03, Rt was approx 2.3. Another week back to [3/08] 08/03 and it was approximately 4.0. Propagating forward a week from [3/22] 22/03, Rt then fell to 0.8 on [3/29] 29/03
So you can see that Rt fell from 4.0 to 1.4 over the two weeks preceding the full lockdown and then from 1.4 to 0.8 over the following week, pretty much following the same trend regardless.
So, using the data we can see that we could have predicted the peak before the lockdown occurred, simply using the trend of Rt.
In my hypothesis, this was a consequence of limited social distancing (but not full lockdown) and the virus beginning to burn itself out naturally, with very large numbers of asymptomatic infections and a degree of prior immunity.
…………………………… 

silent one  

What are the deaths of those that have died FROM covid 19 and how are those written on the death certificates and how is it that those that die of a disease other than covid 19 are also included as covid 19 deaths when they were only infected by covid 19. As we know there are asymptomatic carriers so there MUST be deaths were they had the covid but that it was not a factor in those deaths but were included on the death certificate. The numbers of deaths that have been attributable to covid 19 have been over-inflated. Never mind that the test is for a general coronavirus and not specific to covid 19. ……………………………….. 

Tom Welsh 

“Flu season deaths top 80,000 last year, CDC says”
By Susan Scutti, CNN
Updated 1645 GMT (0045 HKT) September 27, 2018
https://edition.cnn.com/2018/09/26/health/flu-deaths-2017–2018-cdc-bn/index.html
Russ Nelson
Guest
Right, but you’re comparing apples to oranges. Compare Covid-19 to other pandemics, like 1917, 1957, or 1968.
Chebyshev
Guest
Chebyshev
May be it is not “despite” but “because of”? If you start the lockdown as late as March, then you ensure that infection and death rates are going to be higher because of high dosage and fragile immune system that comes from lockdown.
There are plenty of countries without lockdown to compare against. So it is not an unverifiable hypothesis.
Epictetus
Guest
Epictetus
Yes but the manner in which they count COVID-19 deaths is flawed. Even with co-morbidity they ascribe to COVID, and in cases where they do not test but there were COVID-like symptoms, they ascribe it to COVID according to CDC.
Bazza McKenzie
Guest
Bazza McKenzie
Most governments are busily fudging the numbers up, to ex-post “justify” the extreme and massively damaging actions they imposed on communities and to gain financial benefit (e.g. states and hospitals which get larger payouts for Wuhan virus treatment than for treatment for other diseases).
As with “global warming”, the politicians, bureaucrats and academics are circling the wagons together to protect their interlinked interests.
David Blackall
Guest
David Blackall
“The virus has been in Europe since December at the latest” https://www.sciencedirect.com/science/article/pii/S1567134820301829?via%3Dihub

..................

No comments:

Post a Comment