"Lessons from building GitHub code search" by Luke Francl (Strange Loop 2023)
Strange Loop Conference
36 min, 23 sec
Luke Francl at GitHub discusses the challenges and solutions in building the new GitHub code search, 'Blackbird'.
Summary
- Luke introduced GitHub's new code search engine, Blackbird, designed to overcome the limitations of the old system.
- Explained that the new system supports exact substring matches, regular expressions, code navigation, and can search billions of files quickly.
- Discussed the challenges faced with the old system, including the inability to search forks and slow indexing due to the size of GitHub.
- Outlined the technical improvements made, such as deduplication, delta compression, and cache servers to improve indexing times.
- Shared the lessons learned about the importance of building software that can adapt and scale efficiently.
Chapter 1

Luke Francl introduces the new GitHub code search and its capabilities.
- Luke Francl introduces himself and explains his role at GitHub.
- Describes the limitations of the old GitHub code search engine, such as slow performance and limited search capabilities.
- Introduces Blackbird, the new search engine built from the ground up for large-scale source code search, which supports exact substring match, regular expressions, and code navigation.
- Blackbird is fast and can search across billions of files and millions of repositories in a few hundred milliseconds.

Chapter 2

Luke details the challenges faced with the old code search system at GitHub.
- Luke worked on the old system, which was slow and limited.
- He was part of three attempts to improve the old code search, all unsuccessful.
- The old system's limitations made it difficult to improve, contributing to the delay in fixing the code search.

Chapter 3

Exploring the purpose of code search and technical considerations for indexing.
- Code search aims to help programmers understand code by quickly answering questions, necessitating an index for fast response.
- Indexing is crucial for a search engine, and it involves collecting content and building an inverted index with posting lists.
- Code indexing is unique as repositories change over time, and GitHub's system handles the computational challenges of tracking these changes.

Chapter 4

Discussion on the tokenization process in code search and the complexities of indexing forks.
- Tokenization is the process of breaking text into searchable tokens, which is challenging with code due to significant minor differences.
- GitHub's original tokenization process had limitations and couldn't handle certain programming characters, leading to loss of information.
- Forks add another layer of complexity, as they are cheap in git infrastructure but expensive for code search due to the necessity of indexing almost identical content.

Chapter 5

Luke outlines the design goals for the new code search and the importance of indexing speed.
- The new GitHub code search had end-user goals such as regular expression search, but the core goal was to make indexing fast, aiming for less than 24 hours to build a new index.
- The team achieved this by rewriting the indexer in Rust, which allowed processing of over 200,000 documents per second.
- Architectural improvements were made, separating the crawl and index processes to optimize performance.

Chapter 6

Luke discusses scaling the code search using deduplication and delta compression.
- As the system scaled, the team introduced document deduplication to reduce content indexing.
- Delta compression was used to optimize indexing by reusing previously indexed content across similar repositories.
- A novel data structure, the geometric XOR filter, was invented to accurately estimate the symmetric difference between repository sets, enabling efficient delta compression.

Chapter 7

Implementing dynamic shard assignment and a cache server to improve indexing speed.
- The team addressed the challenge of index compaction and resource management by introducing dynamic shard assignment.
- A new cache server was introduced to reuse documents from a previous index, greatly improving indexing speed and efficiency.
- These improvements allowed for a 99.9% cache hit rate for re-indexing and over 50% for incremental indexing.

Chapter 8

Luke concludes with lessons learned from the experience of building the new GitHub code search.
- Building software from scratch can be justified if it allows for better use of the data's structure to outperform off-the-shelf tools.
- Software scaling requires continual adaptation and solutions for emerging challenges.
- Efforts to make a system easily changeable are crucial, though not always visible to users, as it enables faster feature delivery.

More Strange Loop Conference summaries

"The Economics of Programming Languages" by Evan Czaplicki (Strange Loop 2023)
Strange Loop Conference
Evan discusses his journey in creating Elm, the challenges faced, and insights into the funding and economics behind programming languages.

"Making Hard Things Easy" by Julia Evans (Strange Loop 2023)
Strange Loop Conference
A detailed exploration of why systems like DNS, HTTP, and Bash can be challenging to master, even when they seem fundamental, and strategies to demystify them.

"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson
Strange Loop Conference
The talk discusses the benefits, challenges, and techniques of simulation testing in distributed systems.

"Noether: Symmetry in Programming Language Design" by Daira Hopwood (2013)
Strange Loop Conference
Dara presents the concept and design of a programming language called Neta, emphasizing the importance of symmetry in programming language design.