GitHub’s new and improved code search experience has garnered much attention and excitement since its launch. In this blog post, we will take a deep dive into the technology behind GitHub’s code search and explore the motivations, challenges, and solutions that led to its development.
Motivations for Building a New Search Engine
Building a search engine from scratch may seem counterintuitive when there are existing open-source solutions available. However, GitHub had unique requirements and scale that necessitated the creation of their own search engine, Blackbird. The goal was to provide developers with an entirely new user experience that allows them to ask questions of code and obtain answers through iterative searching, browsing, navigating, and reading code. General text search products did not align with GitHub’s needs in terms of indexing speed, user experience, and scalability.
The Limitations of Grep
Some may argue that using grep, a command-line utility for searching patterns in files, could suffice as a brute force solution. However, grep proves to be inefficient for large-scale code search. For instance, running grep on a machine with an 8-core Intel CPU can take several seconds to execute an exhaustive regular expression query on a 13 GB file cached in memory. Scaling this approach to GitHub’s codebase, which constantly changes and spans over 115 TB, would require significant resources.
Building the Search Index
To ensure fast queries, it was crucial to pre-compute indices that map keys to sorted lists of document IDs where the key appears. For code search purposes, a special type of inverted index called an ngram index was required. This type of index allows for efficient lookup operations of substrings within the content. By intersecting the results of multiple lookups, the system can obtain the list of documents where the queried string appears.
Indexing 45 Million Repositories
Efficiently building the index involved leveraging two key insights: Git’s content addressable hashing and the presence of duplicate content on GitHub. The indexing process was designed to shard the index by Git blob object ID, evenly distributing the documents between shards and avoiding duplication. The index itself was modeled as a tree, and delta encoding techniques were employed to reduce crawling time and optimize metadata. The system ensured that query results were consistent on a commit-level basis, guaranteeing users a prior but consistent state of the index.
Tracing a Query through the System
When a user enters a query, the Blackbird query service processes and parses it before sending requests to each shard in the search cluster. On each shard, the query is further converted to perform lookup operations on the indices. The iterators from each clause are executed, and the results are aggregated, sorted, and filtered. Finally, the requested results are returned to the user.
Ingesting and Indexing Process
The ingest and indexing side of the system involves various components. Kafka plays a vital role in providing events that trigger indexing operations. Multiple crawlers interact with Git, extracting symbols from code and publishing documents to Kafka. Each shard consumes a specific Kafka partition, ensuring data partitioning between shards. The indexer shards receive the documents and proceed to build their indices, including ngram indices for content, symbols, and paths, as well as other useful indices for languages, owners, and repositories.
The Life of a Query
To understand the inner workings of the code search system, let’s trace the journey of a query from the user to the final results. Suppose a user queries for code written in the Ruby programming language within repositories owned by the Rails organization, using the regular expression /arguments?/ org:rails lang:Ruby. The query is received by the Blackbird query service, which parses and rewrites it, resolving elements like languages to their canonical IDs and adding extra clauses for permissions and scopes.
The rewritten query is then sent to each shard in the search cluster, where further conversion is performed to lookup information in the indices. The regular expression is translated into a series of substring queries on the ngram indices, allowing for efficient retrieval of matching documents. The iterators from each clause of the query are executed, resulting in a list of documents. These documents are then double-checked, scored, sorted, and filtered before being aggregated and returned to the user.
Summary
In conclusion, GitHub’s new code search experience is powered by Blackbird, a Rust-based search engine built from scratch. It offers developers a unique and improved user experience for searching, browsing, and navigating code. By developing their own search engine, GitHub was able to fulfill their specific requirements and scale efficiently to their massive codebase. The technology behind GitHub’s new code search demonstrates the complexities involved in providing fast and accurate search capabilities for a platform of GitHub’s scale and significance.
Tags: GitHub, code search, search engine, indexing, ngram index
[Reference Link](!https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/)