Skip to content

Latest commit

 

History

History
71 lines (55 loc) · 2.77 KB

why-google-stores-billions-of-lines-of-code-in-a-single-repository.md

File metadata and controls

71 lines (55 loc) · 2.77 KB

Why Google stores billions of lines of code in a single repository

Meta Info

Presented in Communications of the ACM 2016.

Authors: Rachel Potvin, Josh Levenberg (Google)

Understanding the paper

TL;DR

  • Google chose to stick with the central repository due to its advantages.
  • The monolithic model of source code management is not for everyone, e.g., organizations where large parts of the codebase are private or hidden between groups.

Key systems

  • Piper: The distributed source-code repository
    • Implemented on top of standard Google infrastructure (originally Bigtable, now Spanner)
    • Reply on the Paxos algorithm to guarantee consistency across replicas
  • CitC (Clients in the Cloud): The workspace client
    • With a cloud-based storage backend and a Linux-only FUSE13 file system
  • Critique: The code-review tool
  • Tricorder: Static analysis system
    • Code quality, test coverage, and test results
  • Rosie: large-scale cleanups and code changes
    1. Create a large patch; find-and-replace
    2. Split the large patch into smaller patches; test them independently; send for code review; commit them automatically once they pass tests and a code review

Statistics

  • Google’s monolithic software repository is used by 95% of its software developers worldwide.
  • The Google codebase includes
    • approximately 1 billion files
    • a history of 35 million commits
    • 86TB of data (excluding release branches)
  • Over 99% of files stored in Piper are visible to all full-time Google engineers.
  • Over 80% of Piper users today use CitC.

Advantages of a monolithic codebase

  • Unified versioning → a single source of truth
  • Code sharing and reuse
  • Simplified dependency management
    • Avoid diamond dependency problem
  • Atomic changes
  • Large-scale refactoring
  • Collaboration across teams
  • Flexible team boundaries and code ownership
  • Code visibility and clear tree structure → implicit team namespacing

Costs and trade-offs

  • Tooling investments for both development and execution
    • Code-indexing system
    • Automated test infrastructure
    • Build infrastructure
    • Code search and browsing tools
  • Codebase complexity
    • Unnecessary dependencies → binary size bloating
  • Efforts invested in code health

Alternatives

  • Git (distributed version control systems)
    • A team at Google is focused on supporting Git, which is used by Google’s Android and Chrome teams outside the main Google repository.
    • Important for these teams due to external partner and open source collaborations.
    • The Git community strongly suggests and prefers developers have more and smaller repositories.
      • Git-clone will copy all content to one’s local machine.
  • Mercurial
    • An experimental effort