The phrase biggest data leak sounds precise until you look closely at the incidents people keep placing in that category. Yahoo’s 2013 theft eventually expanded to all 3 billion user accounts. Equifax exposed 147 million people. Marriott’s Starwood disaster was revised to 383 million guest records, with the company noting that the number of unique guests was lower. Meta’s 2021 Facebook episode involved more than 530 million users, but Meta said the data was scraped rather than taken through a direct hack of its systems. The count is the hook. The structure of the exposure is the real story.
Table of Contents
That difference matters because the history of giant leaks is not one clean leaderboard. Some cases involve a single company losing a vast user base. Some involve state infrastructure. Some are sprawling compilations of old credential dumps. Some are technically “scrapes,” where attackers abuse public-facing features rather than breaking deep into a core database. Put all of them into one pile and the ranking becomes noisy. Separate them by type, and the pattern becomes much clearer.
The number that distorts the story
Raw scale is seductive because it compresses complexity into a single integer. A breach with billions of rows sounds automatically worse than one with millions. Yet a record is not always a person, a person is not always uniquely counted, and a leak can contain extensive duplication. That is why some of the most eye-catching numbers in cybersecurity history need an asterisk beside them. Rows, records, accounts, profiles, and individuals are not interchangeable units.
Scale by headline count
| Incident | Publicly cited scale | The catch |
|---|---|---|
| Yahoo 2013 | 3 billion accounts | A single provider, with all accounts said to be affected |
| Aadhaar 2018 | More than 1.1 billion users at risk | Repeated reports of exposure, but the authority denied a core database breach |
| National Public Data 2024 | Up to 2.9 to 3 billion records claimed | The company confirmed an incident, but the biggest public corpus remains disputed |
Yahoo’s figure comes from the company’s own later notice that all accounts existing in 2013 were affected. Aadhaar belongs in any discussion of historical scale because the system covered more than 1.1 billion users and outside reporting described access to data at population scale, yet Indian authorities publicly denied that the central Aadhaar database had been breached. National Public Data confirmed a 2024 security incident and the exposure of highly sensitive data, but Have I Been Pwned classified the largest circulating breach set as unverified and said the full origin and accuracy remained in question.
Damage by data sensitivity
| Incident | Affected population | Why the fallout felt heavier |
|---|---|---|
| Equifax 2017 | 147 million people | Credit-file identity data, including Social Security-linked exposure |
| Marriott Starwood 2018 | Up to 383 million records | Travel histories, passport data, and payment-card exposure |
| Facebook scrape 2021 | More than 530 million users | Massive platform-scale profile and phone data, widely reusable for fraud |
Equifax remains one of the defining cases because 147 million is large enough to be national in scope and the data was exceptionally useful for identity theft. Marriott’s Starwood breach involved passport numbers, payment-card data, dates of birth, and reservation details, making it more intimate than a routine email-and-password dump. The Facebook incident showed a different route to scale altogether: Meta said malicious actors gathered the data by scraping a contact importer feature, not by hacking the company’s systems, and the exposed set did not include passwords or financial data.
Yahoo turned scale into a category of its own
Yahoo still sits at the center of any honest discussion about the biggest breaches ever disclosed. In 2017, after new forensic work, Yahoo said that the 2013 theft had affected all Yahoo user accounts, not the roughly one billion first disclosed. That pushed the event into a class of its own: not merely a giant breach, but a breach that effectively reached the entire platform population of the time.
The significance of Yahoo was not just volume. It also changed the way people talked about disclosure lag, acquisition risk, and the hidden liabilities of legacy internet empires. The company’s notice said the stolen information did not include clear-text passwords, payment-card data, or bank-account information, but that did not make the incident small. Names, email addresses, phone numbers, birth dates, and security-question material are enough to fuel account takeovers, social engineering, and years of secondary attacks. Yahoo proved that a platform can survive long enough for its user base to become infrastructure, and then fail at infrastructure scale.
Aadhaar exposed the danger of population-scale identity systems
Aadhaar belongs in this history for a different reason. It is less tidy than Yahoo because the category itself is contested. Reuters reported in 2018 that another major security lapse involving a utility-linked system could expose names, Aadhaar numbers, and bank details, while also noting that Aadhaar had more than 1.1 billion users. Earlier reporting described journalists being able to buy access to personal details from the system, and more than 200 government websites had already been cited for making Aadhaar-linked information public. UIDAI, the Indian authority behind Aadhaar, repeatedly denied that the core database had been breached.
That tension is exactly why Aadhaar matters. Even where officials dispute the label breach, repeated exposures around a national identity architecture show how fragile the promise of centralized trust can become. A corporate account can be reset. A national identity number tied to public services is harder to rotate, harder to compartmentalize, and harder to escape. The deeper lesson is not simply that central databases are attractive targets. It is that population-scale identity systems magnify the social cost of every design mistake around access, APIs, third parties, and public-facing integrations.
The breaches that hurt differently
Equifax is the cleanest example of a breach that looks smaller on a leaderboard than it feels in real life. The FTC’s settlement page still describes the incident as exposing the personal information of 147 million people. On raw count alone, that places it far below Yahoo. On damage potential, it lives in the top tier because the exposed information mapped directly onto credit fraud and long-tail identity theft. A stolen password is dangerous. A stolen identity spine is worse.
Marriott’s Starwood incident sits in another category again. The company disclosed that unauthorized access had existed in the Starwood network since 2014 and later revised the upper limit to approximately 383 million guest records, adding that the number of unique guests was lower because many people had multiple records. What made Marriott especially corrosive was the density of personal detail: passport numbers, reservation histories, payment-card numbers, travel timelines, and communication preferences. That is not merely account data. It is movement data, and movement data tells a far richer story about a person’s life.
The Facebook scrape of 2021 sharpened a different point. Meta said the data came from abuse of a contact importer before September 2019 and emphasized that the set did not include passwords, financial information, or health information. Even so, more than 530 million users in a single publicly circulated data set is enough to industrialize phishing, impersonation, SIM-swapping attempts, and targeted fraud. A scraped platform at that scale behaves like a breach in the lives of the people exposed, even if the company prefers a narrower technical label.
The age of recycled mega-dumps
One reason modern rankings feel unstable is that giant leak culture no longer depends on one dramatic intrusion. It also depends on aggregation. Collection #1 in 2019 contained almost 2.7 billion records and about 773 million unique email addresses, but it was not a fresh compromise of one service. It was a massive credential-stuffing collection made from data taken from many earlier breaches. That distinction matters because aggregation changes the meaning of the number. The headline can be enormous while the novelty of the data is mixed.
The 2024 National Public Data incident landed in the same uneasy landscape. The company’s own notice acknowledged a security incident involving names, email addresses, phone numbers, Social Security numbers, and mailing addresses. At the same time, Have I Been Pwned treated the biggest widely discussed breach set as unverified, noting that while billions of rows circulated publicly, the origin and accuracy of the full corpus remained in question. That is the modern problem in a sentence: the data can be dangerous even when the public claim is messy.
The so-called Mother of All Breaches pushed that logic even further. Reporting on the 26-billion-record discovery described it not as a single clean event but as a compilation likely containing substantial duplication and data drawn from many past breaches. These mega-dumps matter because they lower the cost of abuse: attackers do not need to steal from every company themselves when the work has already been consolidated. But they also muddy public understanding. A giant repository of old and new data is a serious risk, just not the same thing as a single organization freshly losing 26 billion unique records.
The real lesson sits below the headline
The biggest leaks in history are not valuable as trivia. They are valuable because they reveal the shape of digital dependence. Yahoo showed what happens when a platform reaches civilizational scale and security fails late. Aadhaar showed the special danger of identity systems woven into daily state functions. Equifax showed that less volume can still mean more personal harm. Marriott showed that travel, loyalty, and passport data can turn a hospitality company into a surveillance archive. Facebook showed that public-facing product design can become a collection mechanism for attackers. Collection #1, National Public Data, and the mega-dumps that followed showed that once data escapes, it rarely stays in its original container.
So the largest leaks in history are not just the ones with the biggest numbers. They are the ones that changed the way data behaves after exposure. The enduring damage is not the breach event itself. It is the creation of a permanent afterlife for personal information—copied, traded, repackaged, and reused long after the original failure has faded from the news.
Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency
This article is an original analysis supported by the sources cited below
Yahoo Provides Notice to Additional Users Affected by Previously Disclosed 2013 Data Theft
Yahoo’s 2017 notice stating that all Yahoo accounts existing in 2013 were affected by the earlier theft.
https://www.sec.gov/Archives/edgar/data/732712/000073271217000003/a2017_10x3xoathxexhibitx991.htm
Equifax Data Breach Settlement
FTC overview confirming that the Equifax breach exposed the personal information of 147 million people.
https://www.ftc.gov/enforcement/refunds/equifax-data-breach-settlement
Marriott International 2018 annual report
SEC filing describing the Starwood breach, the 383 million record upper limit, and the categories of data involved.
https://www.sec.gov/Archives/edgar/data/1048286/000162828019002337/mar-q42018x10k.htm
The Facts on News Reports About Facebook Data
Meta’s explanation of the 530 million user data set, including its claim that the data was scraped rather than hacked.
https://about.fb.com/news/2021/04/facts-on-news-reports-about-facebook-data/
Collection #1 Data Breach
Have I Been Pwned summary of the 2019 credential compilation containing almost 2.7 billion records and 773 million unique email addresses.
https://haveibeenpwned.com/Breach/Collection1
The 773 Million Record Collection #1 Data Breach
Troy Hunt’s detailed analysis of Collection #1 and the scale of the exposed credential corpus.
https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/
Security Incident
National Public Data’s public notice describing the 2024 incident and the types of information involved.
https://nationalpublicdata.com/breach.html
National Public Data Breach
Have I Been Pwned entry marking the largest circulating National Public Data set as unverified and explaining the uncertainty around origin and accuracy.
https://haveibeenpwned.com/Breach/NationalPublicData
New data leak hits India’s national ID card database Aadhaar
Reuters report on a 2018 Aadhaar-related security lapse and the scale of the national ID system.
https://www.reuters.com/article/world/new-data-leak-hits-indias-national-id-card-database-aadhaar-zdnet-idUSKBN1H00K2/
Personal data of a billion Indians sold online for £6, report claims
Guardian report summarizing Tribune’s Aadhaar access investigation, UIDAI’s denial, and prior public exposures.
https://www.theguardian.com/world/2018/jan/04/india-national-id-database-data-leak-bought-online-aadhaar
Press Note on Aadhaar report denial
Official Indian government note stating UIDAI denied the Tribune report and said Aadhaar data remained secure.
https://www.pib.gov.in/PressReleasePage.aspx?PRID=1515503
Client Alert 26 Billion Records Exposed in the Mother of All Breaches
Summary explaining that the MOAB dataset was a compilation likely containing duplication and data from past breaches.
https://www.ajg.com/news-and-insights/client-alert-26-billion-records-exposed-in-the-mother-of-all-breaches/



