Tai-e Reference Documentation

This documentation is also available as multiple HTML pages.

1. Setup Tai-e in IntelliJ IDEA

Given the Gradle build script, setting up Tai-e in IntelliJ IDEA is easy as explained below.

1.1. Step 0

Download IntelliJ IDEA from JetBrains. and install it. We recommend installing a recent version (2021.3 or newer) of IntelliJ IDEA for better support of Java 17.

1.2. Step 1

Start to open a project

Note: If you have already used IntelliJ IDEA, and opened some projects, then you could choose File > Open… to open the same dialog for the next step.

1.3. Step 2

Select root directory of Tai-e and click "Open".

1.4. Step 3

IntelliJ IDEA may pop up a dialog asking if you trust the Gradle project. Just click "Trust Project" (Don’t worry. Tai-e is benign 😃).

You may wait a moment for importing Tai-e.

1.5. Step 4

Go to File > Project Structure…, click "Project SDK", select JDK 17. Next, expand "Language level", select "SDK default" (if the default is just 17) or "17 - Sealed types, always-strict floating-point semantics":

Note: If you have not installed JDK 17 yet, just select Add SDK > Download JDK…, and select "17" for "Version", any "Vendor" (usually "Oracle OpenJDK"), and "Location" to be installation location (default is fine), and then click "Download" to start downloading in background:

1.6. Step 5

As Tai-e is a Gradle project, IntelliJ IDEA always builds and runs it by delegating to Gradle. However, it’s important to note that the JVM used by Gradle may differ from the JVM used by the project on certain individuals' machines. To ensure consistency, just go to File > Settings → …, and change the Gradle JVM to "Project SDK":

1.7. Step 6

To run Tai-e in IntelliJ IDEA, first choose main class of Tai-e and open "Run Configuration":

then configure program arguments as follows:

That’s it! If you could finish above steps without any problems, then you have successfully setup Tai-e in IntelliJ IDEA. ヽ(｡◕‿◕｡)ﾉﾟ

2. How to Run Tai-e (command-line options)?

2.1. Prerequisites

Before running Tai-e, please finish following steps:

Install Java 17 (or higher version) on your system (Tai-e is developed in Java, and it runs on all major operating systems including Windows/Linux/macOS).
Clone submodule java-benchmarks (this repo contains the Java libraries used by the analysis; it is large and may take a while to clone):

git submodule update --init --recursive

The main class (entry) of Tai-e is pascal.taie.Main, and we classified its options into three categories:

Program options: specifying the program to analyze.
Analysis options: specifying the analyses to execute.
Other options

Below we introduce these options.

2.2. Program Options

These options specify the Java program (say P) and library to be analyzed.

Currently, Tai-e leverages Soot frontend to parse Java programs and help build Tai-e’s IR. Soot contains two frontends, one for parsing Java source files (.java) and the other one for bytecode files (.class). The former is outdated (only partially supports Java versions up to 7); while the latter, though quite robust (works properly for the .class files compiled by up to Java 17), cannot fully satisfy our requirements. Hence, we plan to develop our own frontend for Tai-e to address the above issues. For now, we advice using Tai-e to analyze bytecode, instead of source code, if possible.

Class paths (-cp, --class-path): -cp <path>[ -cp <path>…]
- Class paths for Tai-e to locate the classes of P, and this option can be repeated multiple times to specify multiple paths. Currently, Tai-e supports following types of paths:
  - Relative/Absolute path to a jar file
  - Relative/Absolute path to a directory which contains .class (or .java) files
Application class paths (-acp, --app-class-path): -acp <path>[ -acp <path>…]
- Class paths for Tai-e to locate the application classes of P. The usage of this option is exactly the same as -cp.
- The difference between -cp and -acp is that for the classes in -cp, only the ones referenced by the application/main/input classes are added to the closed world of P; but all classes in -acp will be added to the closed world.
Main class (-m, --main-class): -m <main-class>
- The main class (entry) of P. This class must declare a method with signature public static void main(String[]).
Input classes (--input-classes): --input-classes=<inputClass>[,<inputClass>…]
- Add classes to the closed world of P. Some Java programs use dynamic class loading so that Tai-e cannot reference to the relevant classes from the main class. Such classes can be added to the closed world by this option.
- The <inputClass> should follow the format of fully-qualified name in Java, e.g., org.package.MyClass.
Java version (-java): -java <version>
- Default value: 6
- Specify the version of Java library used in the analyses. When this option is given, Tai-e will locate the corresponding Java library in submodule java-benchmarks and add it to the class paths. Currently, we provide libraries for Java versions 3, 4, 5, 6, 7, and 8. Support for newer Java versions is under development.
Prepend JVM Class Path (-pp, --prepend-JVM)
- Prepend the class path of the JVM (which runs Tai-e) to the analysis class path. This means that if you run Tai-e with Java 17, then you can use Tai-e to analyze the library of Java 17. Note that this option will disable -java option.
Allow phantom references (-ap, --allow-phantom)
- Allow Tai-e to process phantom references, i.e., the referenced classes that are not found in the class paths.

2.3. Analysis Options

These options decide the analyses to be executed and their behaviors. We divided these options into two groups: general analysis options which affect multiple analyses, and specific analysis options which are relevant to individual analysis.

2.3.1. General Analysis Options

Build IR in advance (--pre-build-ir)
- Build IRs for all available methods before starting any analyses.
Analysis scope (-scope): -scope <scope>
- Default value: APP
- Specify the analysis scope for class and method analyses.There are three valid choices:
  - APP: application classes only
  - ALL: all classes
  - REACHABLE: classes that are reachable in the call graph (this scope requires analysis cg, i.e., call graph construction)

2.3.2. Specific Analysis Options

To execute an analysis, you need to specify its id and options (if necessary). All available analyses in Tai-e and their information (e.g., id and available options) are listed in the analysis configuration file src/main/resources/tai-e-analyses.yml.

There are two mutually-exclusive approaches to specify the analyses, by command-line options or by file, as described below.

Analysis option (-a, --analysis): -a <id>[=<key>:<value>;…]

Specify analyses by command-line options. For running analysis with id A, just give -a A. For specifying some analysis options for A, just append them to analysis id (connected by =), and separate them by ;, for example:

-a A=enableX:true;threshold:100;log-level:info

Note that on Unix-like systems (e.g., Linux), you may need to quote the option values when they include ;, for example:

-a "A=enableX:true;threshold:100;log-level:info"

The option system is expressive, and it supports various types of option values, such as boolean, string, integer, and list.

Option -a is repeatable, so that if you need to execute multiple analyses in a single run of Tai-e, say A1 and A2, just repeat -a like: -a A1 -a A2.

Plan file (-p, --plan-file): -p <file-path>

Alternatively, you can specify the analyses to be executed (called an analysis plan) in a plan file, and use -p to process the file. Similar to -a, you need to specify the id and options (if necessary) for each analysis in the file. The plan file should be written in YAML.

Note that options -a and -p are mutually-exclusive, thus you cannot specify them simultaneously. See Analysis Management for more information about these two options.

Keep results of specific analyses (-kr, --keep-result): -kr <id>[,<id>…]

By default, Tai-e keeps results of all executed analyses in memory. If you run multiple analyses and care about the results of only some of them, you could use this option to specify these analyses, then every time Tai-e executes an analysis, it will automatically detect and clean the analysis results which are not used by subsequent analyses to save memory.

2.4. Other Options

Help (-h, --help)
- Print help information for all available options. This option will disable all other given options.
Options file (--options-file): --options-file <optionsFile>
- You can specify the command-line options in a file and use --options-file to process the file. When this option is given, Tai-e ignores all other command-line options, and only processes the options in the file. The options file should be written in YAML.
- Tai-e will output all options to output/options.yml at each run.
Generate plan file (-g, --gen-plan-file)
- Merely generate analysis plan file (the plan will not be executed) to output/tai-e-plan.yml.
- This option works only when the analysis plan is specified by option -a, and it is provided to help the user compose analysis plan file.
World cache mode (-wc, --world-cache-mode)
- Enable world cache mode to save build time by caching the completed built world to the disk.
- When enabled, it will attempt to load the cached world instead of rebuilding it from scratch, resulting in a substantial acceleration of world-building process. This applies as long as the analyzed program (i.e. classPath, mainClass and so on) remain unchanged. This option is particularly useful during analysis development, when the analyzed program remains the same, but the analyzer code is modified and run repeatedly, thus saving developers' valuable time.
Specify output directory (--output-dir): --output-dir <outputDir>
- By default, Tai-e stores all outputs, such as logs, IR, and various analysis results, in the output folder within the current working directory. If you prefer to save outputs to a different directory, simply use this option.

2.5. A Usage Example of Command-Line Options

We give an example of how to analyze a program by Tai-e. Suppose we want to analyze a program P as described below:

P consists of two files: foo.jar (a JAR file) and my program/dir/bar.class (a class file).
P's main class is baz.Main
P is analyzed together with Java 8
we run 2-type-sensitive pointer analysis and limit the execution time of pointer analysis to 60 seconds

Then the options would be:

java -jar tai-e-all.jar -cp foo.jar -cp "my program/dir/" -m baz.Main -java 8 -a "pta=cs:2-type;time-limit:60;"

Note again that you need to enclose command-line parameters in quotes if they contain semicolons ; or spaces .

3. How to Specify and Access Types, Classes, and Class Members (Methods and Fields)

Java programs are built using types and classes, which consist of class members such as methods and fields. Tai-e assigns a unique identifier, known as a signature, to each type, class, and class member. These signatures enable users to easily configure and specify the behavior of program analyzers for specific elements, such as in taint configuration (see How to Use Taint Analysis?). Additionally, they allow analysis developers to easily retrieve and manipulate program elements through Tai-e’s convenient APIs.

In some cases, it may be necessary to specify a large number of related classes or class members within a configuration or when implementing a particular program analysis. To streamline this process, we have designed and implemented various signature patterns and matchers for classes, methods, and fields, enabling you to specify and retrieve multiple elements using a single signature pattern.

This documentation will guide you through the format of signatures for types, classes, and class members, as well as the APIs for accessing these program elements via their signatures.

Since generic types are erased in Java, type signatures, along with class and class member signatures, do not include type parameters.

3.1. Type Signatures

In this section, we introduce the signatures for various Java types, including primitive types, reference types, and the void type.

3.1.1. Primitive Types

The signatures for the eight Java primitive types are simply their names: byte, short, int, long, float, double, char, and boolean.

3.1.2. Reference Types

Java reference types include class types (encompassing interfaces and enums) and array types. The signature formats for these types are outlined below.

Class Types (Including Interfaces and Enums)

The signature for a class type is its fully-qualified class name, which includes the package name. For an inner class, insert a $ between the outer class name and the inner class name. Here are some examples:

java.lang.String
pascal.taie.Main
org.example.MyClass
java.util.Map$Entry

Array Types

An array type signature consists of its base type followed by one or more [], with the number of [] indicating the array’s dimensions. Here are some examples:

java.lang.String[]
org.example.MyClass[][]
char[]

3.1.3. Void Type

The signature for the void type is simply void. This appears in Method Signatures for methods that do not return a value.

3.1.4. Programmatically Accessing a Type via Signature

For analysis developers, Tai-e provides convenient APIs to access various types. All the classes related to types, mentioned below, are located in the pascal.taie.language.type package.

In Tai-e, the TypeSystem class (accessible via World.get().getTypeSystem()) offers APIs to retrieve all types (except void, which is discussed later):

TypeSystem.getPrimitiveType(String): Retrieves a primitive type by its signature.
TypeSystem.getClassType(String): Retrieves a class type by its signature.
TypeSystem.getArrayType(Type,int): Retrieves an array type by its base type and the number of dimensions.
TypeSystem.getType(String): Retrieves a primitive type, class type, or array type by its signature.

Additionally, primitive types and the void type are implemented as enums in Tai-e, and can be directly accessed through their respective classes, such as IntType.INT and VoidType.VOID.

3.2. Class and Class Member Signatures

In this section, we introduce the signatures for classes and their members, specifically methods and fields. While constructors are typically considered class members, in Tai-e, they are treated as methods with a special name <init>, as explained in Method Signatures.

3.2.1. Class Signatures

Unsurprisingly, the format for class signatures is identical to that of class types, so we won’t repeat the details here.

3.2.2. Method Signatures

The format of a method signature is as follows:

<CLASS_TYPE: RETURN_TYPE METHOD_NAME(PARAMETER_TYPES)>

CLASS_TYPE: The signature of the class in which the method is declared.
RETURN_TYPE: The signature of the method’s return type.
METHOD_NAME: The name of the method.
PARAMETER_TYPES: A ,-separated list of parameter type signatures (Do not insert spaces around the ,!). If the method has no parameters, use ().

Here are some examples of method signatures:

<java.lang.Object: java.lang.String toString()>
<java.lang.Object: boolean equals(java.lang.Object)>
<java.util.Map: java.lang.Object put(java.lang.Object,java.lang.Object)>

As mentioned earlier, constructors are treated as methods in Tai-e. Each constructor has the name <init>, and its return type is always void. For example, the constructor signatures for ArrayList are:

<java.util.ArrayList: void <init>()>
<java.util.ArrayList: void <init>(int)>
<java.util.ArrayList: void <init>(java.util.Collection)>

Another special class member is the static initializer (also known as the class initializer), which is treated as a method with no arguments and no return value in Tai-e. The method name for a static initializer is <clinit>. For example, the signature of static initializer for Object is <java.lang.Object: void <clinit>()>.

3.2.3. Field Signatures

Like methods, field signatures uniquely identify fields within a Java program. The format of a field signature is as follows:

<CLASS_TYPE: FIELD_TYPE FIELD_NAME>

CLASS_TYPE: The signature of the class where the field is declared.
FIELD_TYPE: The signature of the field’s type.
FIELD_NAME: The name of the field.

For example, the signature for the field info in the following code:

package org.example;

class MyClass {
    String info;
}

is:

<org.example.MyClass: java.lang.String info>

3.2.4. Programmatically Accessing a Class or Member via Signature

Tai-e offers convenient APIs through the pascal.taie.language.classes.ClassHierarchy class, allowing analysis developers to access a class or member by its signature. The available methods are:

ClassHierarchy.getClass(String): Retrieves a class (JClass) by its signature.
ClassHierarchy.getMethod(String): Retrieves a method (JMethod) by its signature.
ClassHierarchy.getField(String): Retrieves a field (JField) by its signature.

3.3. Signature Patterns

Sometimes, users need to specify multiple related classes or members in a configuration, such as in taint analysis. To simplify this process, we have designed and implemented the signature pattern mechanism, similar to regular expressions but specifically tailored for classes and members. This allows users to conveniently specify multiple related classes or members using a single signature pattern.

In this section, we will introduce the formats of signature patterns and explain how to use them in analysis development.

3.3.1. Name Wildcards

Signatures are composed of various names, including class names, method names, field names, and type names within method and field signatures. To simplify specifying these names, we introduce the concept of name wildcards, which form the foundation of signature patterns. A name wildcard is any name that contains zero or more * characters, where each * can match any sequence of characters.

Here are some examples:

java.util.* matches all classes in the java.util package and its sub-packages (like java.util.regex)
get* matches all method names that start with get (like getName or getKey)
Names without any * characters match exactly (like toString only matches the toString methods)

3.3.2. Class Signature Pattern

Class signature patterns come in two forms:

Basic Pattern: A name wildcard that directly matches class names.
- Example: java.util.* matches all classes in the java.util package
- Example: java.util.HashMap matches exactly that class
Subclass Pattern: A name wildcard followed by ^ that matches both the specified classes and all their subclasses.
- Example: java.util.List^ matches List and all classes that extend or implement it
- Example: java.lang.*Exception^ matches all exception classes in the java.lang package and their subclasses, including classes like RuntimeException, IllegalArgumentException, and any custom exceptions that extend these classes

The subclass pattern is particularly useful when you need to capture an entire class hierarchy without listing each class individually.

3.3.3. Method Signature Pattern

Method signature patterns follow a format similar to method signatures but with added flexibility to match multiple methods. The general format is:

<CLASS_PATTERN: RETURN_TYPE_PATTERN METHOD_NAME_PATTERN(PARAMETER_TYPE_PATTERNS)>

Each component of the method signature pattern supports different matching mechanisms:

CLASS_PATTERN: Can be a class signature pattern (basic or subclass pattern).
RETURN_TYPE_PATTERN: A type signature pattern.
METHOD_NAME_PATTERN: Can be a name wildcard.
PARAMETER_TYPE_PATTERNS: A ,-separated list of type signature patterns (no spaces around ,), which also supports parameter wildcards.

Type Signature Patterns:

For class types, they are equivalent to class patterns.
For other types, they use simple name wildcard matching.

Parameter Wildcards: Method signature patterns support parameter wildcards, allowing you to specify repetition of type signature patterns. There are three types of repetition:

Repeat exactly N times: TYPE_PATTERN{N}
Repeat at least N times: TYPE_PATTERN{N+}
Repeat between M and N times: TYPE_PATTERN{M-N}

Here are some examples of method signature patterns:

<java.util.List^: * get*(*)>

This pattern matches all methods in List and its implementations that start with get and have one parameter of any type.

<java.lang.*: void set*(java.lang.String,*)>

This pattern matches all methods in classes directly under the java.lang package that start with set, return void, and have two parameters: a String and any other type.

<*: java.lang.String toString()>

This pattern matches toString methods that return String and have no parameters, in any class.

<java.util.Map^: * *(java.lang.Object^,*)>

This pattern matches all methods in Map and its implementations that have two parameters: the first being Object or any of its subclasses, and the second being any type.

<java.lang.String: * format(java.lang.String,java.lang.Object^{0+})>

This pattern matches format methods in the String class that take a String parameter followed by zero or more Object (or subclass) parameters.

<java.util.Arrays: * asList(java.lang.Object{1-5})>

This pattern matches asList methods in the Arrays class that take between 1 and 5 Object parameters.

Method signature patterns provide a powerful way to specify groups of related methods across multiple classes, greatly simplifying configuration in various analyses. The addition of parameter wildcards further enhances this flexibility, allowing for precise matching of methods with varying numbers of parameters.

3.3.4. Field Signature Pattern

Field signature patterns follow a format similar to field signatures but with added flexibility to match multiple fields. The format of a field signature pattern is:

<CLASS_PATTERN: FIELD_TYPE_PATTERN FIELD_NAME_PATTERN>

This format is simpler than the method signature pattern, as field signatures do not include a parameter list. Each component (CLASS_PATTERN, FIELD_TYPE_PATTERN, and FIELD_NAME_PATTERN) supports the same matching mechanisms as in method signature patterns.

Example:

<java.util.List^: * size>

This pattern matches the size field in java.util.List and its subclasses, regardless of the field’s type.

3.3.5. Programmatically Accessing Multiple Classes or Members via Signature Pattern

Tai-e provides convenient APIs for analysis developers to retrieve multiple classes or members using signature patterns. To use these, developers first create a pascal.taie.language.classes.SignatureMatcher object, passing a ClassHierarchy as an argument. They can then use the following APIs:

SignatureMatcher.getClasses(String): Retrieves classes (JClass) based on the specified class signature pattern.
SignatureMatcher.getMethods(String): Retrieves methods (JMethod) based on the specified method signature pattern.
SignatureMatcher.getFields(String): Retrieves fields (JField) based on the specified field signature pattern.

4. How to Use Taint Analysis?

Tai-e provides a configurable and powerful taint analysis for detecting security vulnerabilities. We develop taint analysis based on the pointer analysis framework, enabling it to leverage advanced techniques (including various context sensitivity and heap abstraction techniques) and implementations (including the handling of complex language features such as reflection and lambda functions) provided by the pointer analysis framework. This documentation is dedicated to providing guidance on using our taint analysis.

4.1. Enabling Taint Analysis

Taint analysis can be enabled in one of two ways, or both approaches together:

using the YAML configuration file.
using the programmatic configuration provider.

4.1.1. YAML Configuration File

In Tai-e, taint analysis is designed and implemented as a plugin of pointer analysis framework. To enable taint analysis with the YAML configuration file, simply start pointer analysis with option taint-config, for example:

-a pta=...;taint-config:<path/to/config>;...

then Tai-e will run taint analysis (together with pointer analysis) using a configuration file specified by <path/to/config> (if you need to specify multiple configuration files, please refer to Multiple Configuration Files). In the upcoming section, we will provide a comprehensive guide on crafting a configuration file.

You could use various pointer analysis techniques to obtain different precision/efficiency tradeoffs. For additional details, please refer to Pointer Analysis Framework.

Interactive Mode

Interactive mode enables users to modify the taint configuration file(s) and re-run taint analysis without needing to re-run the whole program analysis.

This feature significantly speeds up both taint configuration development/debugging and production scenarios that running multiple configuration sets.

To enable interactive mode, append additional taint-interactive-mode:true option when starting the taint analysis, for example:

-a pta=...;taint-config:<path/to/config>;taint-interactive-mode:true;...

Once the taint analysis completes, Tai-e will enter an interactive state where you can:

Modify the taint configuration file(s) and press r in the console to re-run the taint analysis with your updated configuration.
Press e in the console to exit interactive mode.

4.1.2. Programmatic Taint Configuration Provider

In addition to the YAML configuration file, Tai-e also supports programmatic taint configuration.

To enable it, start pointer analysis with option taint-config-providers, for example:

-a pta=...;taint-config-providers:[my.example.MyTaintConfigProvider];...

The class my.example.MyTaintConfigProvider should extend the interface pascal.taie.analysis.pta.plugin.taint.TaintConfigProvider.

package my.example;

public class MyTaintConfigProvider extends TaintConfigProvider {
    public MyTaintConfigProvider(ClassHierarchy hierarchy, TypeSystem typeSystem) {
        super(hierarchy, typeSystem);
    }

    @Override
    protected List<Source> sources() { return List.of(); }

    @Override
    protected List<Sink> sinks() { return List.of(); }
// ...
}

4.2. Configuring Taint Analysis

In this section, we present instructions on configuring sources, sinks, taint transfers, and sanitizers for the taint analysis using a YAML configuration file. To get a broad understanding, you can start by examining the taint-config.yml file from our test cases as an illustrative example.

Certain configuration values include special characters, such as spaces, [, and ]. To ensure these values are correctly interpreted by the YAML parser, please make sure to enclose them within quotation marks.

4.2.1. Basic Concepts

We first present several basic concepts employed in the configuration.

Type, Method, and Field Signatures

In taint configuration, you’ll need to specify types, methods, and fields within the program. This is done using their signatures, as detailed in How to Specify and Access Types, Classes, and Class Members (Methods and Fields).

To simplify the configuration process, our taint analysis also supports Signature Patterns. These patterns provide a more flexible way to specify program elements. For example, instead of listing every method in a class, you might use a pattern to match all methods with a certain return type or parameter list.

This approach reduces the amount of configuration needed and makes it easier to maintain and update your taint analysis settings.

Index Reference

In taint analysis configuration, it’s often necessary to specify:

A variable
A field of an object referenced by a variable
Elements of an array referenced by a variable

These specifications may be required at a call site or within a method. To facilitate this, we introduce the concept of index reference.

An index reference consists of two parts:

Index: This refers to the specified variable (also called the variable index).
Reference: This indicates whether we’re referring to:
- The variable itself
- A field of the object referenced by the variable
- Elements of the array referenced by the variable

This combination of variable indexes and references provides a flexible way to pinpoint exactly which program element you want to include in your taint analysis configuration. Let’s break this down.

Variable Index of A Call Site

We classify variables at a call site into several kinds, and provide their corresponding indexes below:

Kind Description Index

Kind	Description	Index
Result variable	The variable receiving the method call result (i.e., the left-hand side or LHS variable)	`result`
Base variable	The variable pointing to the receiver object of the method call (absent in static method calls)	`base`
Arguments	The arguments of the call site, indexed starting from 0	`0`, `1`, `2`, …

Result variable

The variable receiving the method call result (i.e., the left-hand side or LHS variable)

result

Base variable

The variable pointing to the receiver object of the method call (absent in static method calls)

base

Arguments

The arguments of the call site, indexed starting from 0

0, 1, 2, …

For example, for a method call

r = o.foo(p, q);

The index of variable r is result.
The index of variable o is base.
The indexes of variables p and q are 0 and 1.

Variable Index of A Method

Within a method, we currently support indexing for method parameters. Similar to call site arguments, the parameters are indexed starting from 0. For example, the indexes of parameters t, s, and o of method foo below are 0, 1, and 2.

package org.example;

class MyClass {
    void foo(T t, String s, Object o) {
        ...
    }
}

Reference

The reference part is optional and specifies which aspect of the indexed variable we’re interested in:

No reference: Refers to the variable itself as specified by the index.
Field reference: Append .<field name> to the index (e.g., 0.f refers to field f of the object pointed to by the variable with index 0).
Array element reference: Append [*] to the index (e.g., result[*] refers to all elements of the array pointed to by the result variable).

[ and ] are special characters in YAML, so you need to enclose them in quotes like "result[*]".

This flexible system allows for precise specification of variables, object fields, and array elements in various contexts within your taint analysis configuration.

4.2.2. Sources

Taint objects are generated by sources. In the configuration file, sources are specified as a list of source entries following key sources, for example:

sources:
  - { kind: call, method: "<javax.servlet.ServletRequestWrapper: java.lang.String getParameter(java.lang.String)>", index: result }
  - { kind: param, method: "<com.example.Controller: java.lang.String index(javax.servlet.http.HttpServletRequest)>", index: 0 }
  - { kind: field, field: "<SourceSink: java.lang.String info>" }

Our taint analysis supports several kinds of sources, as introduced in the next sections.

Call Sources

This should be the most-commonly used source kind, for the cases that the taint objects are generated at call sites. The format of this kind of sources is:

- { kind: call, method: METHOD_SIGNATURE, index: INDEX_REF, type: TYPE }

If you write such a source in the configuration, then when the taint analysis finds that method METHOD_SIGNATURE is invoked at call site l, it will generate a taint object of type TYPE for the reference indicated by INDEX_REF at call site l. For how to specify METHOD_SIGNATURE and INDEX_REF, please refer to Type, Method, and Field Signatures and Variable Index of A Call Site.

We use underlining to emphasize the optional nature of type: TYPE in call source configuration. When it is not specified, the taint analysis will utilize the corresponding declared type from the method. This includes using the return type for the result variable, the declaring class type for the base variable, and the parameter types for arguments as the type for the generated taint object.

Someone may wonder why we need to include type: TYPE in the configuration for taint objects when we can already obtain the declared type from the method. This is because the type of taint objects should align with the corresponding actual objects. However, in certain situations, the actual object type related to the method might be a subclass of the declared type. Therefore, we use type: TYPE to specify the precise object type in such cases. As an illustration, consider the code snippet below. In this snippet, the source method Z.source() declares its return type as X, but it actually returns an object of type Y, which is a subclass of X. Therefore, we can define type: Y for the taint object generated by Z.source() method.

class X {...}

class Y extends X { ... }

class Z {
    X source() {
        ...
        return new Y();
    }
}

Throughout the rest of this documentation, we will also use underlining to indicate optional elements. The reasons for specifying type: TYPE in other cases are similar to those for call sources. In these situations, the type of generated taint object may be a subclass of the corresponding declared type.

Parameter Sources

Certain methods, such as entry methods, do not have explicit call sites within the program, making it impossible to generate taint objects for variables at their call sites. Nevertheless, there are situations where generating taint objects for their parameters can be useful. To address this requirement, our taint analysis provides the capability to configure parameter sources:

- { kind: param, method: METHOD_SIGNATURE, index: INDEX_REF, type: TYPE }

If you include this type of source in the configuration, when the taint analysis determines that the method METHOD_SIGNATURE is reachable, it will create a taint object of TYPE for the reference indicated by INDEX_REF. For guidance on specifying METHOD_SIGNATURE and INDEX_REF, please refer to the Type, Method, and Field Signatures and Variable Index of A Method.

Field Sources

Our taint analysis also enables users to designate fields as taint sources using the following format:

- { kind: field, field: FIELD_SIGNATURE, type: TYPE }

When you include this type of source in the configuration, if the taint analysis identifies that the field FIELD_SIGNATURE is loaded into a variable v (e.g., v = o.f), it will generate a taint object of TYPE for v. For instructions on specifying FIELD_SIGNATURE, please refer to Type, Method, and Field Signatures.

4.2.3. Sinks

At present, our taint analysis supports specifying specific variables at call sites of sink methods as sinks. In the configuration file, sinks are defined as a list of sink entries under the key sinks:

sinks:
  - { method: METHOD_SIGNATURE, index: INDEX_REF }
  - ...

If you include this type of sink in the configuration, when the taint analysis identifies that the method METHOD_SIGNATURE is invoked at call site l and the reference at l, as indicated by INDEX_REF, points to any taint objects, it will generate reports for the detected taint flows.

For guidance on specifying METHOD_SIGNATURE and INDEX_REF, please refer to Type, Method, and Field Signatures and Variable Index of A Method.

4.2.4. Taint Transfers

In taint analysis, taint is associated with data content and can move between objects. This process, known as taint transfer, is common in real-world code. Effectively managing these transfers is crucial for detecting potential security vulnerabilities.

Introduction

Here, we utilize an example to demonstrate the concept of taint transfer and its impact on taint analysis.


  1
2
3
4
5
6
7

  String taint = getSecret(); // source
StringBuilder sb = new StringBuilder();
sb.append("abc");
sb.append(taint); // taint is transferred to sb
sb.append("xyz");
String s = sb.toString(); // taint is transferred to s
leak(s); // sink

Suppose we consider getSecret() as the source and leak() as the sink. In this scenario, the code at line 1 acquires secret data in the form of a string and stores it in the variable taint. This secret data eventually flows to the sink at line 7 through two taint transfers:

The method call to append() at line 4 adds the contents of taint to sb, resulting in the StringBuilder object pointed to by sb containing the secret data. Therefore, it should also be regarded as tainted data. In essence, the append() call at line 4 transfers taint from taint to sb.
The method call to toString() at line 6 converts the StringBuilder to a String, which holds the same content as the StringBuilder, including the secret data. In essence, toString() transfers taint from sb to s.

In this example, if the taint analysis fails to propagate taint from taint to sb and from sb to s, it will be unable to detect the privacy leakage. To address such scenarios, our taint analysis allows users to specify which methods trigger taint transfers, facilitating the appropriate propagation of taint flow.

Configuration

In this section, we provide instructions on configuring taint transfers. Taint transfer essentially involves the triggering of taint propagation from specific reference (e.g., variables or fields) to other references at call sites through method calls. We refer to the source of taint transfer as the from-ref and the target as the to-ref. For example, in the case of sb.append(taint) from the previous example, taint serves as the from-ref, and sb acts as the to-ref.

In the configuration file, taint transfers are defined as a list of transfer entries under the key transfers, as shown in the example below:

transfers:
  - { method: "<java.lang.StringBuilder: java.lang.StringBuilder append(java.lang.String)>", from: 0, to: base }
  - { method: "<java.lang.StringBuilder: java.lang.String toString()>", from: base, to: result }

which can handle the taint transfers of the example in Introduction. Each transfer entry follows this format:

- { method: METHOD_SIGNATURE, from: INDEX_REF, to: INDEX_REF, type: TYPE }

Here, METHOD_SIGNATURE represents the method that triggers taint transfer, from and to specify the from-ref and to-ref at the call site. TYPE denotes the type of the transferred taint object, which is also optional.

Taint transfer can be intricate in real-world programs. To detect a broader range of security vulnerabilities, our taint analysis supports various types of taint transfers using Index Reference. You can use different expressions for from and to in transfer entries to enable different types of taint transfers, as outlined below:

Transfer From To

Transfer	From	To
variable → variable	`INDEX`	`INDEX`
variable → array	`INDEX`	`INDEX[*]`
variable → field	`INDEX`	`INDEX.FIELD_NAME`
array → variable	`INDEX[*]`	`INDEX`
field → variable	`INDEX.FIELD_NAME`	`INDEX`

variable → variable

INDEX

variable → array

INDEX

INDEX[*]

variable → field

INDEX

INDEX.FIELD_NAME

array → variable

INDEX[*]

INDEX

field → variable

INDEX.FIELD_NAME

INDEX

As a reference, we use an example here to show usefulness of array → variable transfer.


  1
2
3
4

  String cmd = request.getParameter("cmd"); // source
Object[] cmds = new Object[]{cmd};
Expression expr = Factory.newExpression(cmds); // taint transfer: cmds[0] -> expr
execute(expr); // sink

Here, assuming we consider getParameter() as the source and execute() as the sink, the code retrieves a value from an HTTP request at line 1 (which is uncontrollable and thus treated as a source) and stores it in cmd. At line 2, cmd is stored in an Object array, which is then used to create an Expression at line 3. Finally, the Expression is passed to execute(), which might lead to a command injection.

To detect this injection, we need to propagate taint from cmd to expr when analyzing method call expr = Factory.newExpression(cmds). At this call, the taint stored in array cmds is transferred to expr, and we can capture this behavior by specifying the following taint transfer entry:

- { method: "<Factory: Expression newExpression(java.lang.Object[])>", from: "0[*]", to: result }

Here, from: "0[*]" indicates that the taint analysis will examine all elements in the array pointed to by 0-th parameter (i.e., cmds), and if it detects any taint objects, it will propagate them to the variable specified by to: result (i.e., expr).

4.2.5. Sanitizers

Our taint analysis allows users to define sanitizers in order to reduce false positives. This can be accomplished by writing a list of sanitizer entries under the key sanitizers in the configuration, as demonstrated below:

sanitizers:
  - { kind: param, method: METHOD_SIGNATURE, index: INDEX }
  - ...

Subsequently, the taint analysis will prevent the propagation of taint objects to the parameter specified by INDEX in the method METHOD_SIGNATURE.

Currently, sanitizers do not support index references. You can only specify variables using the INDEX parameter.

4.2.6. Multiple Configuration Files

The taint analysis supports the loading of multiple configuration files, eliminating the need for users to consolidate all configurations into a single extensive file. Users can simply place all relevant configuration files within a designated directory and then provide the path to this directory (<path/to/config>) when enabling the taint analysis.

The taint analysis will traverse the directory iteratively during the configuration loading process. Therefore, you have the flexibility to organize the configuration files as you see fit, including placing them in multiple subdirectories if desired.

4.3. Output of Taint Analysis

Currently, the output of the taint analysis consists of two parts: console output and taint flow graph.

4.3.1. Console Output

In console output, the taint analysis reports the detected taint flows using the following format:

Detected n taint flow(s):
TaintFlow{SOURCE_POINT -> SINK_POINT}
...

Each taint flow is a pair of source point and sink point. A source point refers to a variable that points to a newly-generated taint object, while a sink point designates a variable pointing to taint objects that have flowed from the source point.

Given that there are several kinds of Sources, each kind has a corresponding source point representation with a specific format:

Source Source Point Description Source Point Format Explanation

Source	Source Point Description	Source Point Format	Explanation
Call source	A variable at a call site of the source method.	`METHOD_SIGNATURE[i@Ln] CALL_STMT/INDEX`	`METHOD_SIGNATURE`: The method containing the call site. `[i@Ln]`: Position of the call site. `CALL_STMT`: The call statement (site). `INDEX_REF`: Index Reference of the source point.
Parameter source	A parameter of the source method.	`METHOD_SIGNATURE/INDEX`	`METHOD_SIGNATURE`: The source method. `INDEX_REF`: Index Reference of the source point.
Field source	A variable that receives loaded value from the source field.	`METHOD_SIGNATURE[i@Ln] LOAD_STMT`	`METHOD_SIGNATURE`: The method containing the load statement. `[i@Ln]`: Position of the load statement. `LOAD_STMT`: The load statement.

Call source

A variable at a call site of the source method.

METHOD_SIGNATURE[i@Ln] CALL_STMT/INDEX

METHOD_SIGNATURE: The method containing the call site.
[i@Ln]: Position of the call site.
CALL_STMT: The call statement (site).
INDEX_REF: Index Reference of the source point.

Parameter source

A parameter of the source method.

METHOD_SIGNATURE/INDEX

METHOD_SIGNATURE: The source method.
INDEX_REF: Index Reference of the source point.

Field source

A variable that receives loaded value from the source field.

METHOD_SIGNATURE[i@Ln] LOAD_STMT

METHOD_SIGNATURE: The method containing the load statement.
[i@Ln]: Position of the load statement.
LOAD_STMT: The load statement.

The [i@Ln] represent the position of a statement, where i is the index of the statement in the IR, and n is the line number of the statement in the source code, which can help you locate the statement.

Here are some examples of source points for each kind:

Call source: <Main: void main(java.lang.String[])>[3@L7] pw = invokestatic Data.getPassword()/result
Parameter source: <Controller: void doGet(javax.servlet.http.HttpServletRequest,javax.servlet.http.HttpServletResponse)>/0
Field source: <Main: void main(java.lang.String[])> [29@L24] name = p.<Person: java.lang.String name>

The format of the sink point is exactly the same as call source point, so we won’t repeat the explanation here.

4.3.2. Taint Flow Graph

The console output only provides the starting and ending points of the taint flows. However, for users to validate the reported taint flows and associated security vulnerabilities, it is crucial to investigate the detailed propagation path of taint objects. To meet such needs, we define taint flow graph (TFG for short), whose nodes are the program pointers (e.g., variables and fields) that point to taint objects, and edges represent how taint objects flow among the pointers, so that users can check taint flows by going over the TFG.

To address this requirement, we introduce the concept of taint flow graph (TFG). In a TFG, nodes represent program pointers (such as variables and fields) that point to taint objects, while edges illustrate how taint objects move between these pointers. This allows users to review taint flows by analyzing the TFG.

Tai-e will output the path of the dumped TFG:

Dumping ...\tai-e\output\taint-flow-graph.dot

TFG is dumped as a DOT graph. For a better experience, we recommend installing Graphviz and using it to convert DOT to SVG with the following command:

$ dot -Tsvg taint-flow-graph.dot -o taint-flow-graph.svg

then you can open the TFG with your web browser and examine it.

We plan to develop more user-friendly mechanisms for examining taint analysis results in the future.

4.4. Pre-prepared Commonly Used Taint Configuration

Manually collecting and writing taint analysis configurations for different vulnerability types can be time-consuming and challenging, especially for developers and security researchers with limited experience. To help users streamline this process and improve the efficiency and accuracy of vulnerability detection, we have curated Commonly Used Taint Configuration. When creating or modifying your own taint analysis configuration, you can refer to this configuration for guidance in your process.

Commonly Used Taint Configuration is a comprehensive collection of source, sink, and transfer rules tailored for various common vulnerability types. Currently, this collection contains 327 source rules, 920 sink rules, and 138 transfer rules, enabling users to adapt and extend them to detect 13 types of vulnerabilities.

To further enhance the user experience, we have also carefully organized the project structure by packages and vulnerability types to ensure clarity and ease of understanding of the rules, allowing users to quickly locate and apply the relevant rules.

4.4.1. Organizational structure

The structure of this project is as follows:

Tai-e/src/main/resources/commonly-used-taint-config
├── sink
│   ├── infoleak              # contains 141 sinks
│   │   └── java-io
│   └── injection             # contains 779 sinks
│       ├── android
│       │   └── sql-injection
│       ├── java
│       │   ├── crlf
│       │   ├── path-traversal
│       │   ├── rce
│       │   └── ...
│       └── ...
├── source
│   ├── infoleak              # contains 158 sources
│   │   └── java
│   └── injection             # contains 169 sources
│       ├── apache-struts2
│       ├── javax
│       │   ├── javax-portlet
│       │   ├── javax-servlet
│       │   └── javax-swing
│       └── ...
└── transfer                  # contains 138 transfers about String

Specifically, this project firstly categorizes the configuration files into three main categories: sink, source, and transfer.

sink category: Contains sinks configurations files related to information leakage and injection vulnerabilities, further subdivided into two subdirectories:
- infoleak: Categorized by package name.
- injection: Categorized by vulnerability type.
source category: Contains sources configurations related to information leakage and injection vulnerabilities, further subdivided into two subdirectories:
- infoleak: Categorized by package name.
- injection: Categorized by package name.
transfer category: Contains transfers.

Additionally, each subdirectory contains a corresponding README file that provides a brief overview of the relevant vulnerability types.

4.4.2. How to Use it? (An Example)

Users can directly integrate the configuration files from this collection into the Configuration File for the Tai-e taint analysis, or modify and extend them as needed to better meet specific analysis requirements.

Here is an example of how to use the configuration files from this collection. If the user needs to detect an RCE (Remote Code Execution) injection vulnerability in a Java project using the Jetty software library, the following steps can be taken to modify the taint configuration file:

Add the source rules related to the Jetty software library from the file source/injection/jetty/jetty-http/jetty-http.yml.
Add the sink rules related to the RCE type injection vulnerability from the file sink/injection/java/rce/command.yml.
Add the transfer rules related to String type from the file transfer/string-transfers.yml.

After these steps, the taint configuration file will be as follows:

source:
  - { kind: call, method: "<org.eclipse.jetty.http.HttpCookie: java.lang.String getName()>", index: result, type: "java.lang.String" }
  - { kind: call, method: "<org.eclipse.jetty.http.HttpCookie: java.lang.String getValue()>", index: result, type: "java.lang.String" }
  - { kind: call, method: "<org.eclipse.jetty.http.HttpCookie: java.lang.String asString()>", index: result, type: "java.lang.String" }
#...

sinks:
  - { method: "<java.lang.Runtime: java.lang.Process exec(java.lang.String)>", index: 0 }
  - { method: "<java.lang.Runtime: java.lang.Process exec(java.lang.String[])>", index: 0 }
  - { method: "<java.lang.Runtime: java.lang.Process exec(java.lang.String, java.lang.String[])>", index: 0 }
#...

transfer:
  - { method: "<java.lang.String: java.lang.String substring(int)>", from: base, to: result }
  - { method: "<java.lang.String: java.lang.String substring(int,int)>", from: base, to: result }
#...

5. How to Develop A New Analysis on Tai-e?

Tai-e is highly extensible. To develop a new analysis and make it available in Tai-e, you just need to follow the two steps below.

5.1. Step 1. Develop An Analysis

At first, you need to implement your analysis class, which should extend either MethodAnalysis, ClassAnalysis or ProgramAnalysis (all in package pascal.taie.analysis) depending on whether the analysis runs on method-, class- or program-level. When writing the analysis class, you need to:

Declare a public static field ID of type String, whose value is identical to the analysis id in the configuration file.
Implement constructor with argument AnalysisConfig, and pass it to the constructor of parent class.
Implement the analysis logic in analyze() method.
- For MethodAnalysis, you need to implement method analyze(IR), which at each time takes the IR of a method as input.
- For ClassAnalysis, you need to implement method analyze(JClass), which at each time takes a class as input.
- For ProgramAnalysis, you need to implement method analyze(). Inter-procedural analyses typically require whole-program information, which can be accessed via the static methods of World, thus we do not pass argument to the analyze() method.

Note that above *Analysis classes are generic and the type parameter is identical to the type of analysis result, which is the return type of the corresponding analyze method, i.e., Tai-e assumes that return value of analyze is the analysis result (and manages results based on such assumption). Below we give some tips that may be useful for developing new analysis.

Get familiar with Tai-e: See Program Abstraction in Tai-e for more information about Tai-e, such as the important classes that you might use when writing new analysis.
Obtain options: Global options are available at World.get().getOptions(); options with respect to each analysis are dispatched to each Analysis object, and can be accessed by getOptions() within the analysis class.
Obtain results of dependent analyses: If your analysis requires the results of some other previously-executed analyses, you can obtain them by calling ir.getResult(id), jclass.getResult(id), or World.get().getResult(id) for method/class/program-level results.

5.2. Step 2. Register the Analysis

To make an analysis available in Tai-e, you need to register it by adding its information (such as analysis id, analysis class, etc.) to the configuration file src/main/resources/tai-e-analyses.yml ("config file" for short), which contains the information of all available analyses. Please refer to Analysis Management for details about analysis registration.

After adding analysis information to config file, your analysis is now available in Tai-e.

5.3. An Example

We give a simple example to illustrate how to add a new analysis to Tai-e.

Suppose that we are going to implement an intra-procedural dead code detection, which requires CFG and the analysis results of live variable analysis and constant propagation. We choose to extend MethodAnalysis, and complete the required tasks as explained in Step 1 (we omit concrete analysis logic for simplicity):

package my.example;

public class DeadCodeDetection extends MethodAnalysis<Set<Stmt>> {

    // declare field ID
    public static final String ID = "my-deadcode";

    // implement constructor
    public DeadCodeDetection(AnalysisConfig config) {
        super(config);
    }

    // implement analyze(IR) method
    @Override
    public Set<Stmt> analyze(IR ir) {
        // obtain results of dependent analyses
        CFG<Stmt> cfg = ir.getResult(CFGBuilder.ID);
        NodeResult<Stmt, CPFact> constants = ir.getResult(ConstantPropagation.ID);
        NodeResult<Stmt, SetFact<Var>> liveVars = ir.getResult(LiveVariable.ID);
        // analysis logic
        Set<Stmt> deadCode;
        ...
        return deadCode;
    }
}

Then we register the analysis by adding its information to src/main/resources/tai-e-analyses.yml (The analysis does not have options, thus we can ignore item options):

- description: dead code detection
  analysisClass: my.example.DeadCodeDetection
  id: my-deadcode
  requires: [ cfg,constprop,livevar ]

That’s it! Now you can run the dead code detection via option -a my-deadcode.

6. Program Abstraction in Tai-e (core classes and IR)

This document introduces Tai-e’s abstraction of the Java program being analyzed. You will likely need to use the classes introduced in this document when developing analyses on top of Tai-e. See Section 2 of Tai-e’s paper for more discussions.

6.1. Core Classes

JClass (in pascal.taie.language.classes) represents classes in the program. Each instance contains various information of a class, such as class name, modifiers, declared methods and fields, etc.
JMethod and JField: (in pascal.taie.language.classes): represents class members, i.e., methods and fields in the program. Each JMethod/JField instance contains various information of a method/field, such as declaring class, name, etc.
ClassHierarchy (in pascal.taie.language.classes): manages all the classes of the program. It offers APIs to query class hierarchy information, such as method dispatching, subclass checking, etc.
Type (in pascal.taie.language.type): represents types in the program. It has several subclasses, e.g., PrimitiveType, ClassTyp, and ArrayType, representing different kinds of Java types.
TypeSystem (in pascal.taie.language.type): provides APIs for retrieving specific types and subtype checking.
World (in pascal.taie): manages the whole-program information of the program. By using its getters, you can access these information, e.g., ClassHierarchy and TypeSystem. World is essentially a singleton class, and you can obtain the instance by calling World.get().

6.2. Tai-e IR

Tai-e IR is typed, 3-address, statement and expression based representation of Java method body.

You could dump IR for the classes of input program to .tir files via option -a ir-dumper. By default, Tai-e dumps IR to its default output directory output/. If you want to dump IR to a specific directory, just use option -a ir-dumper=dump-dir:path/to/dir. ir-dumper is implemented as a class analysis, thus the scope of the classes it dumps are affected by option -scope.

The IR classes reside in package pascal.taie.ir and its sub-packages.

There are three core classes in Tai-e IR:

IR is the central data structure of intermediate representation in Tai-e, and each IR instance can be seen as a container of the information for the body of a particular method, such as variables, parameters, statements, etc. You could easily obtain IR instance of a method by JMethod.getIR() (providing the method is not abstract).
Stmt represents all statements in the program. This interface has a dozen of subclasses, corresponding to various statements. Stmts are stored in IR, and you could obtain them via IR.getStmts().
Exp represents all expressions in the program. This interface has dozens of subclasses, corresponding to various expressions. Exps are associated with Stmts, and you could obtain them via specific APIs of Stmt.

We believe that the API of IR is self-documenting and easy to use. To make IR more intelligible, we present a formal definition (i.e., context-free grammar) below that illustrates all kinds of expressions and statements in the IR, and how Stmt are formed by Exp. Most non-terminals in the grammar corresponds to classes in pascal.taie.ir.

6.2.1. Grammar of Expressions

Var → Identifier
Literal → IntLiteral | LongLiteral | FloatLiteral | DoubleLiteral | StringLiteral | ClassLiteral | NullLiteral | MethodHandle | MethodType
FieldAccess → InstanceFieldAccess | StaticFieldAccess
- InstanceFieldAccess → Var.FieldRef
- StaticFieldAccess → FieldRef
- FieldRef → <ClassType: Type FieldName>
- FieldName → Identifier
ArrayAccess → Var[Var]
NewExp → NewInstance | NewArray | NewMultiArray
- NewInstance → new ClassType
- NewArray → new Type[Var]
- NewMultiArray → new Type LengthList EmptyList
- LengthList → [Var] | [Var]LengthList
- EmptyList → ε | []EmptyList
InvokeExp → InvokeVirtual | InvokeInterface | InvokeSpecial | InvokeStatic | InvokeDynamic
- InvokeVirtual → invokevirtual Var.MethodRef(ArgList)
- InvokeInterface → invokeinterface Var.MethodRef(ArgList)
- InvokeSpecial → invokespecial Var.MethodRef(ArgList)
- InvokeStatic → invokestatic MethodRef(ArgList)
- InvokeDynamic → invokedynamic BootstrapMethodRef MethodName MethodType [BootstrapArgList] (ArgList)
- MethodRef → <ClassType: Type MethodName(TypeList)>
- MethodName → Identifier
- TypeList → ε | Type TypeList'
- TypeList' → ε | , Type TypeList'
- ArgList → ε | Var ArgList'
- ArgList' → ε | , Var ArgList'
- BootstrapMethodRef → MethodRef
- BootstrapArgList → ε | Literal BootstrapArgList'
- BootstrapArgList' → ε | , Literal BootstrapArgList'
UnaryExp → NegExp | ArrayLengthExp
- NegExp → !Var
- ArrayLengthExp → Var.length
BinaryExp → ArithmeticExp | BitwiseExp | ComparisonExp | ConditionExp | ShiftExp
- ArithmeticExp → Var ArithmeticOp Var
- ArithmeticOp → + | - | * | / | %
- BitwiseExp → Var BitwiseOp Var
- BitwiseOp → "|" | & | ^
- ComparisonExp → Var ComparisonOp Var
- ComparisonOp → cmp | cmpl | cmpg
- ConditionExp → Var ConditionOp Var
- ConditionOp → == | != | < | > | ⇐ | >=
- ShiftExp → Var ShiftOp Var
- ShitOp → << | >> | >>>
InstanceOfExp → Var instanceof Type
CastExp → (Type) Var

6.2.2. Grammar of Statements

AssignStmt → New | AssignLiteral | Copy | LoadArray | StoreArray | LoadField | StoreField | Unary | Binary | InstanceOf | Cast
- New → Var = NewExp;
- AssignLiteral → Var = Literal;
- Copy → Var = Var;
- LoadArray → Var = ArrayAccess;
- StoreArray → ArrayAccess = Var;
- LoadField → Var = FieldAccess;
- StoreField → FieldAccess = Var;
- Unary → Var = UnaryExp;
- Binary → Var = BinaryExp;
- InstanceOf → Var = InstanceOfExp;
- Cast → Var = CastExp;
JumpStmt → Goto | If | Switch
- Goto → goto Label;
- If → if ConditionExp goto Label;
- Switch → TableSwitch | LookupSwitch
- TableSwitch → tableswitch (Var) { CaseList default: goto Label; }
- LookupSwitch → lookupswitch (Var) { CaseList default: goto Label; }
- Label → IntLiteral
- CaseList → ε | case IntLiteral: goto Label; CaseList
Invoke → InvokeExp; | Var = InvokeExp;
Return → return; | return Var;
Throw → throw Var;
Catch → catch Var;
Monitor → monitorenter Var; | monitorexit Var;
Nop → nop;

7. Analysis Management

It is very common for an analysis framework to conduct multiple analyses in a single run, e.g., user wants to run many bug detectors to find more bugs, or an analysis depends on the outcomes of other analyses. By design, Tai-e supports these scenarios via a systematic analysis management, as explained in this document.

7.1. Analysis Information Registration

As mentioned in Develop A New Analysis, to add a new analysis to Tai-e, one needs to register its information in analysis configuration file src/main/resources/tai-e-analyses.yml. Each analysis entry consists of five (or less) attributes:

description: a description of the analysis

This attribute is only for documenting purpose.
analysisClass: fully-qualified name of the analysis class

Tai-e loads the analysis classes based on this attribute.
id: a short and unique identifier of an analysis

Tai-e relies on this attribute identify each analysis, so each id must be unique.
requires (optional): a list of dependent analyses

If an analysis requires the results of any other analyses, then we can specify the ids of the dependent analyses in this attribute. At runtime, Tai-e automatically resolves analysis dependencies according to this attributes, ensuring the correctness of execution order for all dependent analyses; besides, this approach frees up developers to concentrate on the specification of their own analysis, and saves their efforts of writing command options when running an analysis.

Each item in requires attribute consists of two parts:
- Analysis id, e.g., A, whose result is required by this analysis.
- A boolean expression in parentheses (optional), e.g., (x=y), indicates that the specified analysis is required only when the expression value is true. The expression value is determined by the runtime values of the specified options, for examples:
  - requires: [A(x=y)]: requires A when runtime value of option x is y
  - requires: [A(x=y&a=b)]: requires A when runtime value of option x is y and runtime value of option a is b
  - requires: [A(x=a|b|c)]: requires A when runtime value of option x is a, b, or c
This feature makes Tai-e more flexible in resolving analysis dependencies. You don’t need to write this attribute for an independent analysis.
options [optional]: a map of default option values

This attribute allowing to specify default values for all options of the analysis. These values can be overwritten by runtime-specified option values. You don’t need to write this attribute if your analysis has no options.

You can see examples about analysis registration in Section 5.1 of our technical report and tai-e-analyses.yml.

7.2. Analysis Plan

At runtime, Tai-e first generates an analysis plan (essentially a list of analyses to be executed) based on tai-e-analyses.yml and runtime-provided option values, and then runs analyses in order according to the plan.

As described in Command-Line Options, there are two approaches to specify the analyses to execute. Next, we will explain how they affect the generated analysis plan.

7.2.1. By Command-Line Options (Option `-a`)

If you specify analyses, say A1,…,An, via option -a, Tai-e will resolve all analyses directly/indirectly required by A1,…,An, and generate an analysis plan (including all these analyses) by topological sorting.

7.2.2. By Plan File (Option `-p`)

Alternatively, you can specify analyses by a plan file, which is a YAML file consisting of a list of analysis entries. Each entry has two attributes:

id: the analysis to be executed.
options: runtime option values for the analysis.

When using option -p, Tai-e will execute the analyses in strict accordance with the plan file, i.e., it neither resolve analysis dependencies nor sort the analyses, thus, the file should include all required analyses, and each analysis should be placed in front of all the other analyses that require it; otherwise, Tai-e will alert.

Composing a plan file from scratch might be tedious. To ease this task, Tai-e always generate a plan file output/tai-e-plan.yml each time you specify analyses with option -a, so that you can easily obtain a plan file and then edit your plan based on it. In addition, we provide auxiliary option -g (--gen-plan-file) and when you use it together with -a, Tai-e will merely generates plan file without actually running the analyses.

7.3. Analysis Result Management

Result management is important for the cases that an analysis requires the results of other analyses, which happen frequently. Depending on the type of analysis, Tai-e automatically stores the results in various locations:

For a method-level analysis, Tai-e stores its results in the IR, i.e., argument of MethodAnalysis.analyze(IR).
For a class-level analysis, Tai-e stores its results in the JClass, i.e., argument of ClassAnalysis.analyze(JClass).
For a program-level analysis, Tai-e stores its results in World.

Benefiting from the result management, the developers only need to remember one API, getResult(id) (id is identifier of the analysis), to obtain results of any types of analyses, e.g., ir.getResult(id) for method-level analysis, jclass.getResult(id) for class-level analysis, and world.getResult(id) for program-level analysis.

With aforementioned mechanisms, it is fairly simple to coordinate multiple analyses in Tai-e.

8. Pointer Analysis Framework

Pointer analysis is one the most important fundamental static analyses. Tai-e provides a versatile, efficient and extensible pointer analysis framework, which supports different kinds of heap abstraction and context sensitivity variants. It is able to produce more sound and faster pointer analyses than other pointer analysis frameworks, under both context-insensitive and context-sensitive settings (see Tai-e’s paper for more details).

A distinguishing feature of our pointer analysis framework is its analysis plugin system, which enables to conveniently develop and add new analyses (that need to interact with pointer analysis) to the framework in a modular manner and make it easier to maintain and extend. Currently, many analyses in Tai-e have been implemented as plugins of our pointer analysis framework, such as reflection analysis, lambda analysis, exception analysis, and taint analysis.

Below we introduce key options of pointer analysis and the analysis plugin system.

8.1. Options

The analysis id of pointer analysis is pta, and here we list its key options:

Context sensitivity: cs:ci|k-[obj|type|call][-k’h]
- Default value: ci (context insensitivity)
- Specify context sensitivity variant of the pointer analysis.It supports context insensitivity, and k-limiting object/type/call-site sensitivity, e.g., 1-obj and 2-call.By default, the limit for heap contexts is k-1 (the recommended one).If you want to specify other limit for heap contexts, say k', just append -k’h, e.g., 2-type-2h.
Only analyze application code: only-app:[true|false]
- Default value: false
- When set to true, the pointer analysis only analyzes application code (and ignores library code).
Implicit entries: implicit-entries:[true|false]
- Default value: true
- Specify whether to consider the methods that are called implicitly by the JVM as entry points of the pointer analysis.When it is false, these methods are not considered as entry points, leading to a possibly unsound points-to result.
String constants: distinguish-string-constants:<strategy>
- Default value: reflection
- Specify which string constants to distinguish.Currently support the following strategies:
  - reflection: only distinguish reflection-relevant string constants, i.e., class, method, and field names.
  - null: do not distinguish any string constants, i.e., merge all of them.
  - all: distinguish all string constants.
  - <predicate-class>: You could implement your strategy to distinguish string constants. In this case, just give fully-qualified name of your predicate class here. See IsReflectionString as an example.
Object merging: merge-string-objects/merge-string-builder/merge-exception-objects:[true|false]
- Default value: true.
- Specify whether to merge corresponding objects.
Advanced analysis: advanced:<analysis>
- Default value: null
- Enable advance pointer analysis technique.Currently, we have integrated following techniques:
  - Zipper-e (option value: zipper-e): introduced in our TOPLAS'20 paper.
  - Zipper (option value: zipper): introduced in our OOPSLA'18 paper.
  - Scaler (option value: scaler): introduced in our FSE'18 paper.
  - Mahjong (option value: mahjong): introduced in our PLDI'17 paper.
Reflection log: reflection-log:<path/to/log>
- Default value: null
- Specify the path to reflection log file. For the reflective calls specified in the log file, pointer analysis will resolve them by their targets in the log file. (currently supports the output format of TamiFlex, and see ReflectiveAction.log as an example).
Reflection inference: reflection-inference:<strategy>
- Default value: string-constant.
- Specify strategy for static reflection inference.This option can work together with reflection-log, and if the targets of a reflective call are given in the log, reflection inference will ignore the call.Currently support the following strategies:
  - String constant based inference (option value: string-constant): resolve reflective calls by string constants.
  - Solar (option value: solar): introduced in our TOSEM'19 paper.
  - No inference (option value: null): disable reflection inference.
Taint analysis: taint-config:<path/to/config>
- Default value: null
- Specify the path to configuration file for taint analysis, which defines sources, sinks, and taint transfers. Taint analysis will be enabled when this file is given. See Taint Analysis for more details.
Plugins: plugins:[<pluginClass>,…]
- Default value: []
- Activate plugins.To enable a plugin, just add fully-qualified name of the plugin class to this list.
Dump points-to results (without context information): dump-ci:[true|false]
- Default value: false
- Specify whether to dump points-to results.
Dump points-to results (with context information): dump:[true|false]
- Default value: false
- Specify whether to dump points-to results.
Time limit: time-limit:<time-limit>
- Default value: -1
- Specify a time limit for pointer analysis (unit: second).When it is -1, there is no time limit.

8.2. Analysis Plugin System

We explain how this analysis plugin system works.As shown in figure below:

The analysis plugin system includes a pointer analysis solver (pascal.taie.analysis.pta.core.solver.Solver) and a number of analyses that communicate with it.Each of these analyses is referred to as an analysis plugin that needs to implement interface pascal.taie.analysis.pta.plugin.Plugin.The interactions between pointer analysis solver and analysis plugin are carried out by calling each other’s APIs of Solver and Plugin, which are highlighted in blue and red, respectively.The Solver APIs have been implemented in the framework, and developers only need to implement the related APIs of Plugin, which are invoked by Solver at different stages (e.g., initialization and finishing) or on different events (e.g., discovery of new points-to relations and call edges).The additional auxiliary APIs, e.g., Solver.addStmts() and Plugin.onNewMethod(), are optional and designed to make it easier to implement specific analysis logics.

Let us briefly illustrate the basic working mechanism that drives those core APIs. Assuming you are implementing the onNewPointsToSet() method of an analysis Plugin, this means whenever an interested variable’s (parameter CSVar) points-to set (parameter PointsToSet) is changed (i.e., it points to more objects), you need to encode your logic to reflect the side effect made by this change; the final consequence of such an effect, from the perspective of pointer analysis, is to modify the points-to set of any related pointers or to add call graph edges at pertinent call sites. Accordingly, you should call Solver.addPointsTo() or Solver.addCallEdge() to alert the solver of these modifications. Conversely, during each analysis iteration, the solver calls Plugin.onNewPointsToSet() and Plugin.onNewCallEdge() of every plugin to notify them of any changes to the variables' points-to sets or call graph edges, respectively. As a result, to add a new analysis that interacts with pointer analysis, developers just need to implement a few methods of Plugin in accordance with the requirement, as previously described.

This analysis plugin system is currently being used by a number of ongoing internal projects implemented by different developers (these projects will be released when finished), and the feedback from developers is very promising: everyone agrees that it can fulfill their practical needs and is simple to understand and apply. For more details of the analysis plugin system, please see Section 4.1 of Tai-e’s paper and the source code (specifically, the interfaces Plugin and Solver, which are self-documenting).

8.3. An Example of Plugin

We use an example to illustrate how to develop a new analysis plugin and add it to the pointer analysis framework.For simplicity, we omit the concrete analysis logics in the example.

Suppose we are implementing a taint analysis that interacts with pointer analysis.It requires following steps.

Create a plugin class that implements Plugin interface.

package my.example;

public class TaintAnalysis implements Plugin {

Implement necessary APIs of Plugin with the analysis logics.

    private Solver solver;

    @Override
    public void setSolver(Solver solver) {
        this.solver = solver;
    }

    @Override
    public void onNewCallEdge(Edge<CSCallSite, CSMethod> edge) {
        if (/* edge target is a taint source method */) {
            Obj taint = ... // generate taint object
            // add it to points-to set of LHS variable of the call site
            solver.addPointsTo(context, lhs, heapContext, taint);
        }
    }

    @Override
    public void onFinish() {
        // collect detected taint flows and report them
    }
}

Activate your analysis plugin.

Analysis plugins are loaded via reflection, so that you do not need to modify existing code to integrate the plugin. Simply add the plugin class name to the plugins option of pointer analysis to turn it on:

... -a pta=plugins:[my.example.TaintAnalysis];...

That’s it! Your taint analysis will run together with the pointer analysis.

9. Publications

Tian Tan and Yue Li. Tai-e: A Developer-Friendly Static Analysis Framework for Java by Harnessing the Good Designs of Classics. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA. July 17—21, 2023 (ISSTA'23).
Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu and Yue Li. Context Sensitivity without Context: A Cut-Shortcut Approach to Fast and Precise Pointer Analysis. In Proceedings of the ACM on Programming Languages, 2023 (PLDI'23).
Tian Tan, Yue Li, Xiaoxing Ma, Chang Xu, and Yannis Smaragdakis. Making Pointer Analysis More Precise by Unleashing the Power of Selective Context Sensitivity. In Proceedings of the ACM on Programming Languages, 2021 (OOPSLA'21).
Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. A Principled Approach to Selective Context Sensitivity for Pointer Analysis. ACM Transactions on Programming Languages and Systems, 2020 (TOPLAS'20).
Yue Li, Tian Tan, and Jingling Xue. Understanding and Analyzing Java Reflection. ACM Transactions on Software Engineering and Methodology, 2019 (TOSEM'19).
Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. Scalability-First Pointer Analysis with Self-Tuning Context-Sensitivity. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA, November 04-09, 2018 (ESEC/FSE'18).
Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. Precision-Guided Context Sensitivity for Pointer Analysis. Proceedings of the ACM on Programming Languages, 2018 (OOPSLA'18).
Tian Tan, Yue Li, and Jingling Xue. Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, Barcelona, Spain, June 18-23, 2017 (PLDI'17).
Tian Tan, Yue Li, and Jingling Xue. Making k-Object-Sensitive Pointer Analysis More Precise with Still k-Limiting. In 23rd International Static Analysis Symposium, Edinburgh, UK, September 8-10, 2016, Proceedings (SAS'16).
Yue Li, Tian Tan, Yifei Zhang, and Jingling Xue. Program Tailoring: Slicing by Sequential Criteria. In Proceeding of 30th European Conference on Object-Oriented Programming, July 18-22, 2016, Rome, Italy (ECOOP'16).
Yue Li, Tian Tan, and Jingling Xue. Effective Soundness-Guided Reflection Analysis. In 22nd International Static Analysis Symposium, Saint-Malo, France, September 9-11, 2015, Proceedings (SAS'15).
Yue Li, Tian Tan, Yulei Sui, and Jingling Xue. Self-Inferencing Reflection Resolution for Java. In 28th European Conference, Uppsala, Sweden, July 28 * August 1, 2014. Proceedings (ECOOP'14).

Tai-e Reference Documentation

1. Setup Tai-e in IntelliJ IDEA

1.1. Step 0

1.2. Step 1

1.3. Step 2

1.4. Step 3

1.5. Step 4

1.6. Step 5

1.7. Step 6

2. How to Run Tai-e (command-line options)?

2.1. Prerequisites

2.2. Program Options

2.3. Analysis Options

2.3.1. General Analysis Options

2.3.2. Specific Analysis Options

2.4. Other Options

2.5. A Usage Example of Command-Line Options

3. How to Specify and Access Types, Classes, and Class Members (Methods and Fields)

3.1. Type Signatures

3.1.1. Primitive Types

3.1.2. Reference Types

Class Types (Including Interfaces and Enums)

Array Types

3.1.3. Void Type

3.1.4. Programmatically Accessing a Type via Signature

3.2. Class and Class Member Signatures

3.2.1. Class Signatures

3.2.2. Method Signatures

3.2.3. Field Signatures

3.2.4. Programmatically Accessing a Class or Member via Signature

3.3. Signature Patterns

3.3.1. Name Wildcards

3.3.2. Class Signature Pattern

3.3.3. Method Signature Pattern

3.3.4. Field Signature Pattern

3.3.5. Programmatically Accessing Multiple Classes or Members via Signature Pattern

4. How to Use Taint Analysis?

4.1. Enabling Taint Analysis

4.1.1. YAML Configuration File

Interactive Mode

4.1.2. Programmatic Taint Configuration Provider

4.2. Configuring Taint Analysis

4.2.1. Basic Concepts

Type, Method, and Field Signatures

Index Reference

Variable Index of A Call Site

Variable Index of A Method

Reference

4.2.2. Sources

Call Sources

Parameter Sources

Field Sources

4.2.3. Sinks

4.2.4. Taint Transfers

Introduction

Configuration

4.2.5. Sanitizers

4.2.6. Multiple Configuration Files

4.3. Output of Taint Analysis

4.3.1. Console Output

4.3.2. Taint Flow Graph

4.4. Pre-prepared Commonly Used Taint Configuration

4.4.1. Organizational structure

4.4.2. How to Use it? (An Example)

5. How to Develop A New Analysis on Tai-e?

5.1. Step 1. Develop An Analysis

5.2. Step 2. Register the Analysis

5.3. An Example

6. Program Abstraction in Tai-e (core classes and IR)

6.1. Core Classes

6.2. Tai-e IR

6.2.1. Grammar of Expressions

6.2.2. Grammar of Statements

7. Analysis Management

7.1. Analysis Information Registration

7.2. Analysis Plan

7.2.1. By Command-Line Options (Option -a)

7.2.2. By Plan File (Option -p)

7.3. Analysis Result Management

8. Pointer Analysis Framework

7.2.1. By Command-Line Options (Option `-a`)

7.2.2. By Plan File (Option `-p`)