Website β’ Documentation β’ Blog β’ Slack β’ Twitter
source{d} Engine exposes powerful Universal AST's to analyze your code and a SQL engine to analyze your git history:
- Code Retrieval: retrieve and store git repositories as a dataset.
- Language Agnostic Code Analysis: automatically identify languages, parse source code, and extract the pieces that matter in a completely language-agnostic way.
- Git Analysis powerful SQL based analysis on top of your git repositories.
- Querying With Familiar APIs analyze your code through powerful friendly APIs, such as SQL, gRPC, REST, and various client libraries.
- Quickstart
- Guides & Examples
- Architecture
- Babelfish UAST
- Clients & Connectors
- Community
- Contributing
- Credits
- License
Follow the steps below to get started with source{d| Engine.
Follow these instructions:
sudo apt-get update
sudo apt-get install docker-ce
sudo pacman -S docker
Download the latest release for MacOS (Darwin), Linux or Windows.
MacOS / Linux:
# Make it executable
chmod +ux srcd
# Move it into your local bin folder to be executable from anywhere
sudo mv srcd /usr/local/bin/
Now it's time to initialize the source{d} engine and provide it some repositories to analyze:
# Without a path it operates on the local folder,
# it works with nested folders.
srcd init
# You can also provide a path
srcd init /home/user/replace/path/
To launch the web client, run the following command and start executing queries:
srcd web sql
In your browser, now go to http://localhost:8080
If you prefer to stay with the command line, you can execute:
srcd sql
This will open a SQL client that allows you to execute queries against your repositories.
If you want to run a query directly, you can also execute it as such:
srcd sql "SHOW tables;"
Top 10 repositories by commit count in HEAD:
SELECT repository_id,commit_count
FROM (
SELECT r.repository_id, COUNT(*) AS commit_count
FROM ref_commits r
WHERE r.ref_name = 'HEAD'
GROUP BY r.repository_id
) AS q
ORDER BY commit_count
DESC
LIMIT 10
Query all files from HEAD:
SELECT cf.file_path, f.blob_content
FROM ref_commits r
NATURAL JOIN commit_files cf
NATURAL JOIN files f
WHERE r.ref_name = 'HEAD'
AND r.index = 0
Retrieve the UAST for all files from HEAD:
SELECT * FROM (
SELECT cf.file_path,
UAST(f.blob_content, LANGUAGE(f.file_path, f.blob_content)) as uast
FROM ref_commits r
NATURAL JOIN commit_files cf
NATURAL JOIN files f
WHERE r.ref_name = 'HEAD'
AND r.index = 0
) t WHERE ARRAY_LENGTH(uast) > 0
Query for all LICENSE & README files across history:
SELECT repository_id, blob_content
FROM files
WHERE file_path = 'LICENSE'
OR file_path = 'README.md'
Show me more queries:
Extract all functions as UAST nodes for Java files from HEAD:
SELECT
files.repository_id,
files.file_path,
UAST(files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//FunctionGroup') as functions
FROM files
NATURAL JOIN commit_files
NATURAL JOIN commits
NATURAL JOIN refs
WHERE
refs.ref_name= 'HEAD'
AND LANGUAGE(files.file_path,files.blob_content) = 'Java'
LIMIT 10;
Find all files where 'trim' method is called:
SELECT * FROM (
SELECT
files.repository_id,
files.file_path,
UAST(files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//Identifier[@roleCall and @Name="trim"]') as functionCall
FROM files
NATURAL JOIN commit_files
NATURAL JOIN commits
NATURAL JOIN refs
WHERE
refs.ref_name = 'HEAD'
) t WHERE ARRAY_LENGTH(functionCall) > 0
You can now run the source{d} Engine, choose what you would like to do next:
- Analyze your git repositories
- Understand how your code has evolved
- Write your own static analysis rules
- Build a data pipeline for MLonCode
For the full list of the commands supported by srcd
and those
that have been planned, please read commands.md.
Collection of guide & examples using the source{d} Engine:
source{d} Engine functions as a CLI tool that provides easy access to components of the source{d} stack for Code As Data. It consists of a daemon managing all of the services (Babelfish, Enry, Gitbase etc.) which are packaged as docker containers.
For more details on the architecture of this project, read docs/architecture.md.
One of the most important components of the source{d} engine is the UAST.
UAST stands for Universal Abstract Syntax Tree, it is a normalized form of a programming language's AST, annotated with language agnostic roles and transformed with language agnostic concepts (e.g. Functions, Imports etc.). It enables advanced static analysis of code and easy feature extraction for statistics or Machine Learning on Code.
To parse a file for a UAST, it is as easy as:
srcd parse uast --lang=LANGUAGE /path/to/file
To launch the web client, run the following command and start executing queries:
srcd web parse
In your browser, now go to http://localhost:8081
For connecting to the language parsing server (Babelfish) and analyzing the UAST, there are several language clients currently supported and maintained:
The Gitbase Spark connector is under development, which aims to allow for an easy integration with Spark & PySpark:
source{d} has an amazing community of developers & contributors who are interested in Code As Data and/or Machine Learning on Code. Please join us! π
Contributions are welcome and very much appreciated π Please refer to our contribution guide for more details.
This software uses code from several open source packages. We'd like to thank the contributors for all their efforts: