Exploring the Technology Behind GitHub’s New Code Search

GitHub’s new and improved code search experience has garnered much attention and excitement since its launch. In this blog post, we will take a deep dive into the technology behind GitHub’s code search and explore the motivations, challenges, and solutions that led to its development.

Motivations for Building a New Search Engine

Building a search engine from scratch may seem counterintuitive when there are existing open-source solutions available. However, GitHub had unique requirements and scale that necessitated the creation of their own search engine, Blackbird. The goal was to provide developers with an entirely new user experience that allows them to ask questions of code and obtain answers through iterative searching, browsing, navigating, and reading code. General text search products did not align with GitHub’s needs in terms of indexing speed, user experience, and scalability.

The Limitations of Grep

Some may argue that using grep, a command-line utility for searching patterns in files, could suffice as a brute force solution. However, grep proves to be inefficient for large-scale code search. For instance, running grep on a machine with an 8-core Intel CPU can take several seconds to execute an exhaustive regular expression query on a 13 GB file cached in memory. Scaling this approach to GitHub’s codebase, which constantly changes and spans over 115 TB, would require significant resources.

Building the Search Index

To ensure fast queries, it was crucial to pre-compute indices that map keys to sorted lists of document IDs where the key appears. For code search purposes, a special type of inverted index called an ngram index was required. This type of index allows for efficient lookup operations of substrings within the content. By intersecting the results of multiple lookups, the system can obtain the list of documents where the queried string appears.

Indexing 45 Million Repositories

Efficiently building the index involved leveraging two key insights: Git’s content addressable hashing and the presence of duplicate content on GitHub. The indexing process was designed to shard the index by Git blob object ID, evenly distributing the documents between shards and avoiding duplication. The index itself was modeled as a tree, and delta encoding techniques were employed to reduce crawling time and optimize metadata. The system ensured that query results were consistent on a commit-level basis, guaranteeing users a prior but consistent state of the index.

Tracing a Query through the System

When a user enters a query, the Blackbird query service processes and parses it before sending requests to each shard in the search cluster. On each shard, the query is further converted to perform lookup operations on the indices. The iterators from each clause are executed, and the results are aggregated, sorted, and filtered. Finally, the requested results are returned to the user.

Ingesting and Indexing Process

The ingest and indexing side of the system involves various components. Kafka plays a vital role in providing events that trigger indexing operations. Multiple crawlers interact with Git, extracting symbols from code and publishing documents to Kafka. Each shard consumes a specific Kafka partition, ensuring data partitioning between shards. The indexer shards receive the documents and proceed to build their indices, including ngram indices for content, symbols, and paths, as well as other useful indices for languages, owners, and repositories.

The Life of a Query

To understand the inner workings of the code search system, let’s trace the journey of a query from the user to the final results. Suppose a user queries for code written in the Ruby programming language within repositories owned by the Rails organization, using the regular expression /arguments?/ org:rails lang:Ruby. The query is received by the Blackbird query service, which parses and rewrites it, resolving elements like languages to their canonical IDs and adding extra clauses for permissions and scopes.

The rewritten query is then sent to each shard in the search cluster, where further conversion is performed to lookup information in the indices. The regular expression is translated into a series of substring queries on the ngram indices, allowing for efficient retrieval of matching documents. The iterators from each clause of the query are executed, resulting in a list of documents. These documents are then double-checked, scored, sorted, and filtered before being aggregated and returned to the user.

Summary

In conclusion, GitHub’s new code search experience is powered by Blackbird, a Rust-based search engine built from scratch. It offers developers a unique and improved user experience for searching, browsing, and navigating code. By developing their own search engine, GitHub was able to fulfill their specific requirements and scale efficiently to their massive codebase. The technology behind GitHub’s new code search demonstrates the complexities involved in providing fast and accurate search capabilities for a platform of GitHub’s scale and significance.

Tags: GitHub, code search, search engine, indexing, ngram index

[Reference Link](!https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/)

The Challenge of Using AI in Search Engines: Finding Cost-Effective Solutions

Introduction

Search engines play a crucial role in our digital lives, enabling us to find information quickly and easily. With advancements in artificial intelligence (AI), chatbots have become an integral part of search engine experiences. However, integrating AI-driven chatbots into search engines poses significant challenges, particularly in terms of cost. In this blog post, we will explore the problem of high computational costs associated with using AI models like ChatGPT and discuss potential solutions to make AI-powered search engines more cost-effective.

The Problem: High Computational Costs

AI models like ChatGPT are highly sophisticated and deliver impressive conversational abilities. However, this level of sophistication comes at a significant computational cost. While traditional keyword-based search engines require minimal resources, conversational AI models incur nearly 10 times more costs. For instance, processing 50-word answers for 33 trillion search queries per second could result in an astronomical $6 billion in additional expenses.

Understanding the Causes and Seeking Solutions

To address the high computational costs of using conversational AI in search engines, researchers are exploring various strategies. Let’s examine some of the potential causes and solutions:

Inference Cost Optimization

One approach to reducing AI model costs is through inference cost optimization. This involves developing complex code to improve the efficiency of computations, maximizing the utilization of computing resources. OpenAI’s team of computer scientists is actively researching and developing sophisticated code optimizations to achieve cost reductions.

Model Size Reduction

Another promising avenue is reducing the size of AI models while maintaining accuracy. This requires finding optimal ways to shrink the number of model parameters by 10 to 100 times without sacrificing performance. While ongoing research is exploring this area, significant advancements are still necessary to achieve significant size reductions.

Subscription-Based Pricing

Introducing monthly subscription fees for services like ChatGPT is another potential solution. OpenAI, for example, has implemented a subscription-based pricing model, charging users $20 per month to access enhanced features of ChatGPT. This approach allows users to have cost predictability while also supporting the continuous improvement and availability of AI models.

Applying Small Models for Simple Tasks

For less computationally intensive tasks, employing smaller AI models is a cost-effective solution. Google, for instance, is actively exploring the application of smaller models for simpler tasks, reducing computational requirements while still delivering satisfactory results. This approach allows for resource-efficient utilization of AI models.

Conclusion

The integration of conversational AI into search engines introduces significant computational costs. However, ongoing research and innovative strategies are focused on finding cost-effective solutions. These include inference cost optimization, model size reduction, subscription-based pricing, and utilizing smaller models for simpler tasks. By exploring these avenues, search engine providers can strike a balance between delivering advanced conversational experiences and optimizing cost efficiency.

Tags: AI, search engines, computational costs, optimization

Reference Link