⚒️ breach-data-storage

Adventures in processing and storing massive amounts of data

🕵️ What this is

Investigators use any suitable public data for their OSINT investigations. For some of us, this includes breach data, i.e. information released publicly to forums, communities, and channels. A few years ago, it was still possible to obtain a few of these data sets, place them on a fast storage device, and search through them with ripgrep. This is no longer the case with daily leaks and major sets reaching upwards of TB of data each.

This repo contains my efforts to organize and optimize this type of data for fast and flexible querying.

🦹‍♂️ What this isn't

This repo does not and will not at any point contain actual data from any breach. It is not intended to encourage any illicit activity. Also, it is not the final word on database design, data transformation, or storage. We are all learning as we go. If you have a better way of doing any of this – reach out and contribute!

🧐 Why

With services like HIBP, Dehashed, IntelX, and others, you may wonder why would you ever bother doing this yourself. The answer is it depends. It will take a lot of time to sift through data and organize it. It will likely need money and willingness to spend it on fast data devices. And it will take a mindset to learn and persist through challenges. In return you get a few key benefits: privacy, greater depth of data, and full control of the data.

⛰️ Goals

fast querying of multiple sources of data on key fields, such as name, username, email, password, phone, ip
querying of additional data that may enrich the above, including source of the data
low maintenance overhead (e.g. data sources vary in content, all of this needs to co-exist; no server level infrastructure and regular upkeep)
ease of backup/restore

🧰 Tools

Tools used to process the data

Note: My working environment is Debian based Linux. Some commands will need to be adjusted for other variants/flavors.

💪 Process

Analyze the file (type contents, size, etc.)
If file format is simple (e.g. delimited values), nothing more is necessary at this stage
If file format is complex (e.g. SQL, JSON), will likely need to import it into a system to manipulate it
- SQL files most frequently can go into MySQL, unless it was Postgres dump, which would go to Postgres
- JSON can either be transformed via specialized JSON tools or imported into MongoDB and processed there
Decide which data will be useful and which can be omitted
- For SQL, import into a new database, drop unnecessary columns, combine any related data into unified columns, change constraints to allow for NULL values to save space
  - Use head -100 file.sql to preview the first 100 lines of the file
  - If a file is far too large to be viewable/editable on your system, consider splitting it up into more manageable chunks, SQLDumpSplitter can help in this situation
  - Create a database for the incoming data, e.g. mysql -u root -p -e "create database db-name;"
Import into MongoDB
Process in MongoDB
- If fields have not been unified by this stage, do so now
- If extra fields exist, move them to a sub-document (e.g. Other data)
- Tag data source
- Remove any un-necessary fields
- Add indices as necessary
- Dump out via mongodump
- Restore via mongorestore into one unified collection
- Backup unified collection
- Test
TBD: querying from Terminal/Web

✍️ Examples

Cit0day massive collection of older websites and their user credentials
Collection1 another large collection of usernames/passwords (mostly w/out website sources and with more noise)
Collections2-5 another very large collection of usrernames/passwords with a lot of noise and duplicates
COMB yet another very large set of credentials
AndroidForums a SQL dump of a popular forum
AdultFriendFinder post-processing a collection with issues

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
examples		examples
README.md		README.md
tools.md		tools.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚒️ breach-data-storage

🕵️ What this is

🦹‍♂️ What this isn't

🧐 Why

⛰️ Goals

🧰 Tools

💪 Process

✍️ Examples

About

bikemazzell/breach-data-storage

Folders and files

Latest commit

History

Repository files navigation

⚒️ breach-data-storage

🕵️ What this is

🦹‍♂️ What this isn't

🧐 Why

⛰️ Goals

🧰 Tools

💪 Process

✍️ Examples

About

Topics

Resources

Stars

Watchers

Forks