Prerequisites • Resources • Workshop
- Who is this for: Security Engineers, Security Researchers, Developers.
- What you'll learn: Learn how to use CodeQL for code exploration and for finding security issues.
- What you'll build: Build a CodeQL query based on a security advisory to find a command injection.
You can choose between two options to run the workshop exercises:
- Option A: GitHub Codespace (Using a Browser or VS Code - CodeQL is run remotely on a Linux based GitHub Codespace in the cloud)
- Option B: Local installation (Using VS Code - CodeQL is run locally on your machine)
Use a remote GitHub Codespace to work on the workshop exercises.
- GitHub account (sign up for free)
- Browser (you can do the whole workshop in a browser - this is the fastest setup) or Visual Studio Code (VS Code) with the GitHub Codespaces extension installed on your local machine.
Note: The first 120 hours per core of Codespace usage are free per month, we use a codespace with 4 cores for this workshop since 4 cores is the current maximum for free accounts. (If you have a Pro account, we recommend switching to an 8-core machine.)
- Login to your GitHub account
- Go to the repo https://github.com/sylwia-budzynska/orangecon-2024-codeql-workshop / (short link: https://gh.io/orangecon-2024-ws)
- Click on Code -> Codespaces
- Click on the plus sign (+) to create a new codespace.
VS Code will start in your browser and a remote Codespace will be built. This may take a few minutes.
If you are asked to open the workspace vscode-codeql-starter.code-workspace
click on "Open Workspace".
- If you want to use VS Code locally, press the three lines button in the top left corner and select "Open VS Code Desktop". The option might take up to a few minutes to appear.
- Continue with Selecting a CodeQL Database
- Then Test your installation
You can see your codespaces on at github.com/codespaces.
Use a local CodeQL installation to work on the workshop exercises.
- Visual Studio Code (VS Code) and
git
installed on your local machine.
- Install VS Code extension for CodeQL
- Run in the terminal:
git clone https://github.com/sylwia-budzynska/orangecon-2024-codeql-workshop.git
cd orangecon-2024-codeql-workshop
git submodule init
git submodule update --recursive
- In VS Code: File -> Open Workspace from File...
vscode-codeql-starter.code-workspace
- Continue with Selecting a CodeQL Database
- Then Test your installation
In case you see errors such as:
Failed to run query: Could not resolve library path for [..]
Could not resolve module [..]
Could not resolve type [..]
It is very likely that you missed cloning the git submodules (namely the ql repo). To fix this run git submodule init && git submodule update --recursive
.
- Make sure you have the workspace
vscode-codeql-starter.code-workspace
open in VS Code. - Go To the CodeQL View
- Click on "Choose Database from Archive" and select the
test-app-db.zip
file in the root of the repository.
Make sure that the previously chosen CodeQL database is selected in the CodeQL view. (Click on "Select" if it's not)
When the database is selected it should look like this (note the checkmark):
- In VS Code: go to the workspace folder:
codeql-custom-queries-python
- Create a new file
test.ql
- add the following content:
select "Hello World!"
- Save file, right click in the file area and choose "CodeQL: Run Query on Selected Database"
- You should see a new tab open with the result "Hello World!"
- QL tutorials
- CodeQL for Python language guide
- CodeQL documentation
- QL language reference
- CodeQL library for Python
- Basic query for Python code
- QL classes
- CodeQL zero to hero part 1: the fundamentals of static analysis for vulnerability research
- CodeQL zero to hero part 2: getting started with CodeQL
- CodeQL zero to hero part 3: security research
Welcome to the workshop "Finding vunlnerabilities with CodeQL"!
This session will introduce fundamentals of security research and static analysis used when looking for vulnerabilities in software. We will use an example of a simple vulnerability, walk through how CodeQL could detect it, and provide examples on how the audience could use CodeQL to find vulnerabilities themselves.
Before we get started it is important that all of the prerequisites are met so you can participate in the workshop.
The workshop is divided into multiple sections and each section consists of exercises that build up to the final query. For each section we provide hints that help you finish the exercise by providing you with references to QL classes and member predicates that you can use.
In this workshop we will look for a known Command injection vulnerabilities in kohya_ss. Such vulnerabilities can occur when information that is controlled by a user makes its way to application code that insecurely constructs a command and executes it. The command insecurely constructed from user input can be rewritten to perform unintended actions such as arbitrary command execution, disclosure of sensitive information.
The command injections discussed in this workshop are CVE-2024-32022, CVE-2024-32026, CVE-2024-32025, CVE-2024-32027.
Think about one of the most well-known vulnerabilities—command injection. It happens if user input is used in functions, that allow running commands in a shell directly on the servier. It allows an attacker to execute operating system (OS) commands on the server that is running an application, and typically fully compromise the application and its data.
The main cause of injection vulnerabilities is untrusted, user-controlled input being used in sensitive or dangerous functions of the program. To represent these in static analysis, we use terms such as data flow, sources, and sinks.
User input generally comes from entry points to an application—the origin of data. These include parameters in HTTP methods, such as GET and POST, or command line arguments to a program. These are called “sources.”
Continuing with our command injection, an example of a dangerous function that should not be called with unsanitized untrusted data could be os.system
. These dangerous functions are called “sinks.” Note that just because a function is potentially dangerous, it does not mean it is immediately an exploitable vulnerability and has to be removed. Many sinks have ways of using them safely. Other exmaples of sinks, that shouldn't be used with user input are MySQLCursor.execute() from the MySQLdb library in Python (causing SQL injection) or Python’s eval() built-in function which evaluates arbitrary expressions (causing code injection).
For a vulnerability to be present, the unsafe, user-controlled input has to be used without proper sanitization or input validation in a dangerous function. In other words, there has to be a code path between the source and the sink, in which case we say that data flows from a source to a sink—there is a “data flow” from the source to the sink.
In this workshop we are going to find command injections, in which user input ends up in an os.system
call.
In the first part of the workshop, we will write CodeQL queries to find sources and sinks, os.system
calls, on an intentionally vulnerable codebase. In the second part of the workshop, we are going to use those queries to find a command injection from a source to a sink in an open source software, kohya_ss v22.6.1.
With the CodeQL query we write, we will be able to find command injections like the one below.
The user input comes from a GET parameter of a Flask (popular web framework in Python) request, which is stored in the variable files
(see #1). files
is then passed to the os.system
call and concatenated with ls
, leading to command injection (see #2).
import os
from flask import Flask, request
app = Flask(__name__)
@app.route("/command1")
def command_injection1():
files = request.args.get('files', '') #1
os.system("ls " + files) #2
We will start by gradually building a query to detect os.system
calls and afterwards a query for sources.
We can find all calls to functions from external libraries (not defined in the codebase) by using CodeQL's ApiGraphs
module.
Use the tempalate below:
import python
import semmle.python.ApiGraphs
from //TODO: fill me in. Start typing `API::` and press `Ctrl+Space` to see a list of available types. Name your variable `call`
select //TODO: fill me in
Right click in the file area and choose "CodeQL: Run Query on Selected Database" to run the query.
Hints
- In the
from
clause, start byAPI::
and pressCtrl + Space
to see the types available in the API Graphs module. - A call is represented by the
API::CallNode
type. Create a variable with that type and the namecall
. - To limit results only to calls in the root folder of the application (called
test-app
) add awhere
clause with the condtionwhere call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
.
Solution
import python
import semmle.python.ApiGraphs
from API::CallNode call
where call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call, "A call"
Hints
- In the
from
clause, create acall
variable of theAPI::CallNode
type. - In the
where
clause, use the equality operator=
to assert thatcall
is equal to theos.system
calls. Use the logical operatorand
to specify several conditions. - To find nodes corresponding to the
os
library, use theAPI::moduleImport()
method with theos
as the argument. To access thesystem
function of theos
library, use thegetMember()
predicate onAPI::moduleImport()
. At last, get anyos.system
calls with thegetACall()
predicate.
Fill out the template:
import python
import semmle.python.ApiGraphs
from API::CallNode call
where call //TODO: fill me in
and
call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call, "Call to `os.system`"
Solution
import python
import semmle.python.ApiGraphs
from API::CallNode call
where call = API::moduleImport("os").getMember("system").getACall() and
call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call, "Call to `os.system`"
We want to find the first arguments to os.system
calls, so later we can see if any user input flows into the first arguments (so into the command that will be executed).
Hints
- Fill out the template:
import python
import semmle.python.ApiGraphs
from API::CallNode call
where call = API::moduleImport("os").getMember("system").getACall() and
call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call // TODO: fill me in. Type a dot `.` right after `call` and press `Ctrl+Space` to see available predicates.
Solution
import python
import semmle.python.ApiGraphs
from API::CallNode call
where call = API::moduleImport("os").getMember("system").getACall() and
call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call.getArg(0), "First argument of an `os.system` call"
classes
in CodeQL can be used to encapsulate reusable portions of logic. Classes represent single sets of values, and they can also include operations (known as member predicates) specific to that set of values. You have already seen numerous instances of CodeQL classes (API::CallNode) and member predicates (getLocation() etc.)
Hints
- To create a new type, we have to extend a supertype, here
API::CallNode
, give it a name, and a characteristic predicate with the same name. We'll name our class,OsSystemSink
.
Fill out the template:
import python
import semmle.python.ApiGraphs
class OsSystemSink extends API::CallNode {
OsSystemSink() {
//TODO: fill me in
}
}
from API::CallNode call
where // TODO: fill me in
and call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call.getArg(0), "Call to os.system"
- Use the magic
this
keyword, that refers to the instances of the call nodes (API::CallNode
s) that we are describing in the class. Usethis
to find the calls toos.system
in the same way you did previously withAPI::moduleImport
. - Change the
where
clause to make yourcall
variable aninstanceof
your newOsSystemSink
class.
Solution
import python
import semmle.python.ApiGraphs
class OsSystemSink extends API::CallNode {
OsSystemSink() {
this = API::moduleImport("os").getMember("system").getACall()
}
}
from API::CallNode call
where call instanceof OsSystemSink
and call.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select call.getArg(0), "First argument of an `os.system` call"
Now we switch to finding sources.
Most sources are already modeled and in CodeQL, and have the RemoteFlowSource
type. We can use the type to find any sources in a codebase.
Hints
- Import
semmle.python.dataflow.new.RemoteFlowSources
to use the RemoteFlowSource type. - In the
from
clause, pressCtrl + Space
to see all available types.
Fill out the template:
import python
import semmle.python.dataflow.new.RemoteFlowSources
from //TODO: fill me in
where //TODO: fill me in
select //TODO: fill me in
Solution
import python
import semmle.python.dataflow.new.RemoteFlowSources
from RemoteFlowSource rfs
where rfs.getLocation().getFile().getRelativePath().regexpMatch("test-app/.*")
select rfs
Kohya_ss is a GUI for Kohya's Stable Diffusion scripts for training, generation and utilities for Stable Diffusion.
In the second part of the workshop, we are going to switch the codebase we are querying on to the kohya_ss
one and find the data flows from sources to sinks in kohya_ss
, which lead to command injections: CVE-2024-32022, CVE-2024-32026, CVE-2024-32025, CVE-2024-32027
Before you start with the next exercise:
- Go to the CodeQL tab in VSCode,
Databases
section, click on "Choose Database from Archive" and select thekohya_ss-db.zip
file in the root of the repository. A checkmark should appear. This will select the CodeQL database you are working on.
Hints
- Use the template below and note:
- in the
isSource
predicate, refine thesource
variable to be of theRemoteFlowSource
type. - in the
isSink
predicate, refine thesink
variable to be the first argument to anos.system
call. Do it using theexists
mechanism and yourOsSystemSink
class.exists
is a mechanism for introducing temporary variables with a restricted scope. You can think of them as their own from-where-select. In this case, useexists
to introduce the variablecall
with typeOsSystemSink
and then refinesink
to be the first argument to thecall
.
/**
* @name Command injection in os.system sink
* @kind path-problem
* @id orangecon/dataflow-query
*/
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
import semmle.python.ApiGraphs
import MyFlow::PathGraph
import semmle.python.dataflow.new.RemoteFlowSources
//TODO: add previous class definition here
private module MyConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
// TODO: fill me in
}
predicate isSink(DataFlow::Node sink) {
// TODO: fill me in. Use the `exists` mechanism
exists(<type> <variable> |
sink = ...
)
}
}
module MyFlow = TaintTracking::Global<MyConfig>;
from MyFlow::PathNode source, MyFlow::PathNode sink
where MyFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Command injection"
Solution
/**
* @name Command injection in os.system sink
* @kind path-problem
* @id orangecon/dataflow-query
*/
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
import semmle.python.ApiGraphs
import semmle.python.dataflow.new.RemoteFlowSources
import MyFlow::PathGraph
class OsSystemSink extends API::CallNode {
OsSystemSink() {
this = API::moduleImport("os").getMember("system").getACall()
}
}
private module MyConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof RemoteFlowSource
}
predicate isSink(DataFlow::Node sink) {
exists(OsSystemSink call |
sink = call.getArg(0)
)
}
}
module MyFlow = TaintTracking::Global<MyConfig>;
from MyFlow::PathNode source, MyFlow::PathNode sink
where MyFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Command injection"
CodeQL queries for Python reside in the ql/python/ql/src/Security
folder. There already exist queries for the most popular vulnerabilities: SQL injection, command injection, code injection, etc. Run the SQL injection query (CWE-089) on the test database (you'll need to select it in the CodeQL extension > Databases. Note the checkmark).
💡 This is very interesting for security researchers - using the default queries, we can get a general idea of what the potential vulnerabilities might exist in a given project.
The power of CodeQL lies in being able to reuse the CodeQL queries and models to run them on any codebase in the same language. We can run CodeQL queries on up to a 1000 repositories at once using multi-repository variant analysis (MRVA). The projects have to be hosted on GitHub.
💡 This is very interesting for security researchers - if you've found a potential dangerous sink or a source, you can add it to CodeQL (or run as a query) and do your research against a thousand repositories at once.
- Follow the setup in the docs.
- Note that MRVA runs using GitHub actions workflows. Actions workflows are free on public repositories, and paid on privates ones.
- After the setup, right-click and choose "CodeQL: Run Variant Analysis"
Today, you have learned how to explore a codebase using CodeQL and how to CodeQL in your own security research workflow.
Check out these resources if you'd like to know more about:
- static analysis and how it works:
- fundamentals of using CodeQL and its query language:
- security research with CodeQL:
If you end up finding a vulnerability using CodeQL, feel free to add it to the CodeQL Wall of Fame.