Skip to content
This repository has been archived by the owner on Feb 16, 2021. It is now read-only.

DataFrames

Phil Gaiser edited this page Mar 5, 2019 · 4 revisions

DataFrames

A DataFrame is a two-dimensional (sort of like a matrix), size-mutable data structure with labeled columns. But unlike matrices, DataFrames can hold columns of different types. Because the JDK doesn't have DataFrames built-in, the Claymore library provides two core implementations:

  • DefaultDataFrame
  • NullableDataFrame

DefaultDataFrame is the implementation used by default. It works with primitives which means that it does not support null values. Passing a null value to a DefaultDataFrame at any time will cause a runtime exception.
NullableDataFrame is a more flexible implementation which can work with null values. However, since it uses the wrapper objects of all primitives as the underlying structure of its columns, it is generally speaking less efficient than a DefaultDataFrame. When you don't need the ability to work with null you should always use a DefaultDataFrame.

Although you can also use DataFrames by just working with column indices, it is recommended to make use of the column name features and to always address columns by their name. The main reason is that it makes your code more readable while only introducing a (for the vast majority of cases) negligible performance hit. It also allows you to change the internal order of columns (by switching/inserting/removing) without further changes to your code.

All DataFrame versions override the toString() method from Object. For learning and debugging purposes you can always do System.out.println(df) to show the content of your DataFrame in the console.

For serializing and deserializing DataFrames you should use a DataFrameSerializer.

Let's take a look at a quick example:

DataFrame df = new DefaultDataFrame();

The above code will construct an empty non-nullable DataFrame. It has no columns. You could do the following to add a couple of labeled columns:

df.addColumn("col1", new IntColumn());
df.addColumn("col2", new StringColumn());
df.addColumn("col3", new FloatColumn());

Obviously, the first argument to addColumn() is the name of the column to add, and the second argument is the concrete column type.
So far, so good. But our DataFrame is still empty and since an empty DataFrame is not particularly useful, let's add a couple of rows to it:

df.addRow(new Object[]{222, "bla", 12.78f});
df.addRow(new Object[]{333, "blub", 74.21f});
df.addRow(new Object[]{444, "Sandwich", 10.6847f});
df.addRow(new Object[]{555, "some text", 3.14159f});

Now your DataFrame actually holds some data. You can go ahead and perform a query. For example:

System.out.println(df.getString("col2", 2));

This should print out "Sandwich" on your console. We had to call getString() because the column named "col2" is a StringColumn. There are getters and setters for all other primitives as well and they follow the same pattern. For example, let's change the value 555 in the above DataFrame to something else:

df.setInt("col1", 3, 10);

This will set the integer value of the "col1"column at the (row) index 3 to the value 10. You can convince yourself of that change by printing out the entire DataFrame with System.out.println(df). The output should look like this:

_| col1 col2      col3
0| 222  bla       12.78
1| 333  blub      74.21
2| 444  Sandwich  10.6847 
3| 10   some text 3.14159 

Sometimes you need to know how many columns or rows your DataFrame currently has. The following example will simply print out the number of rows and columns of your DataFrame:

System.out.println(df.rows());
System.out.println(df.columns());

Because a column is just a vector of things, we could have created the exact same DataFrame through a constructor like this:

DataFrame df = new DefaultDataFrame(new String[]{"col1", "col2", "col3"}, 
	new IntColumn(new int[]{222, 333, 444, 555}),
	new StringColumn(new String[]{"bla", "blub", "Sandwich", "some text"}),
	new FloatColumn(new float[]{12.78f, 74.21f, 10.6847f, 3.14159f}));

As you can see, all *Column constructors can take an array of their corresponding type as an argument which will construct a column comprised of the data in that array.


Using Null Values

If you ever require your DataFrame to hold null values, you can simply change the used implementation to a NullableDataFrame. If you constructed your DataFrame manually, you must also change the concrete implementation of all columns used. For example, imagine you create a DefaultDataFrame like this:

int[] ints = new int[]{21, 22, 23};
float[] floats = new float[]{57.6f, 58.9f, 59.42f};
char[] chars = new char[]{'a', 'b', 'c'};

Column col1 = new IntColumn(ints);
Column col2 = new FloatColumn(floats);
Column col3 = new CharColumn(chars);

DataFrame df = new DefaultDataFrame(new String[]{"col1","col2","col3"}, col1, col2, col3);

Using a NullableDataFrame instead would require you to change your code to look like this:

int[] ints = new int[]{21, 22, 23};
float[] floats = new float[]{57.6f, 58.9f, 59.42f};
char[] chars = new char[]{'a', 'b', 'c'};

Column col1 = new NullableIntColumn(ints);
Column col2 = new NullableFloatColumn(floats);
Column col3 = new NullableCharColumn(chars);

DataFrame df = new NullableDataFrame(new String[]{"col1","col2","col3"}, col1, col2, col3);

You can probably spot the pattern here. Just add a Nullable-prefix to the concrete implementations you want to use in your code. Of course, all Nullable*Column classes also provide a constructor that expects an array of primitive wrapper objects so you can initialize your columns with null values.

NOTE: You can also change the type of your DataFrame at runtime by calling the static utility method DataFrame.convert(). See Javadocs for details.

Searching in DataFrames

The DataFrame interface specifies various ways to search for specific content in your DataFrame. Use the indexOf() method to find the row index of the specified element. You can also use a regular expression as the search term. The following example will return the row index of the first e-mail in a column named "email" that ends with "@gmail.com":

int index = df.indexOf("email", "^[\\w.]+@gmail\\.com$");

The indexOf() method searches from 0...n (n=number of rows). Therefore the complexity of this operation is O(n). You can also specify an index from which to start searching. For example:

int index = df.indexOf("email", 15, "^[\\w.]+@gmail\\.com$");

will search from index 15 until the end. Alternatively, you can get the indices of all occurrences of your search term in one go:

int[] indices = df.indexOfAll("email", "^[\\w.]+@gmail\\.com$");

Additionally, there is the option to get a sub-DataFrame that holds all columns of the original but only those rows which have the specified search term in the specified column. For example, the following code will return a sub-DataFrame holding all rows that have an e-mail address ending with "@gmail.com" in the "email" column:

DataFrame gmail = df.filter("email", "^[\\w.]+@gmail\\.com$");

Minimum, Maximum and Average

For any column holding numerical data, you can compute the average, the minimum and the maximum of all values within that column. For example, let's say that you have a DataFrame with a column named "age". You could get the average age by calling:

double avg = df.average("age");

In order to support computation of columns holding double values, minimum(), maximum() and average() methods always return a double. If the "age" column in the above example was holding int values, you could safely cast the value returned by minimum() and maximum() to an int:

int min = (int) df.minimum("age");
int max = (int) df.maximum("age");

Sorting

You can sort the content of a DataFrame by a specific column. The algorithm used to perform the sorting is a relatively simple QuickSort implementation with O(n log(n)) on average. It should be performant enough for most use cases.

Example:
Let's assume you have the following DataFrame:

_| id  name           age
0| 100 Seth McFarlane 31
1| 101 Peter Griffin  39
2| 102 Adam West      23
3| 103 Joe Swanson    43
4| 104 Glenn Quagmire 34

You may want to sort all people by their age. Simply call sortBy() and specify the column:

df.sortBy("age");

After the sort operation, the above DataFrame will look like this:

_| id  name           age
0| 102 Adam West      23
1| 100 Seth McFarlane 31
2| 104 Glenn Quagmire 34
3| 101 Peter Griffin  39
4| 103 Joe Swanson    43 

Cloning

Every DataFrame implements the Cloneable interface. Because of that you can copy an entire DataFrame by doing:

DataFrame clone = (DataFrame) df.clone();

Alternatively, you can call the static utility method DataFrame.copyOf() which performs the same operation:

DataFrame clone = DataFrame.copyOf(df);

Iterating over Columns in a DataFrame

Because DataFrame implements the Iterable interface, you can use any DataFrame in a for-each-loop. For example, this is how to iterate over all columns in a DataFrame:

for(Column col : df){
	//do stuff
}

Using static DataFrames

Sometimes you may want to use a DataFrame with a static column structure. Static in this context means that the column structure of your DataFrame does not change at runtime. If that is the case, then you can use the row annotation feature introduced in Claymore version 2.0.0

Basically, with this feature you can declare your own class that should represent a row for your static DataFrame. This makes working with rows a bit neater because you won't have to deal with arrays of Objects anymore. Additionally, you can also construct a new DataFrame based on the structure of your custom class, which overall reduces code.

So let's take a look at a quick example. Imagine you want to have a DataFrame that holds data about different people like in the example of Sorting. Again, if you don't plan to add, remove or insert any columns at runtime, then your DataFrame is considered static and you could quickly write a custom class representing a row in that particular DataFrame. That class may look like this:

public class FamilyGuy implements Row {

	@RowItem("id")
	private int id;
	
	@RowItem("name")
	private String name;
	
	@RowItem("age")
	private byte age;
	
	public FimilyGuy(){ }

	//getters and setters ...

}

So all you have to do is let your custom class implement Row and annotate all fields that correspond to a particular column with RowItem. Row is just a marker interface, so you won't have to actually implement any methods. The RowItem annotation has only one attribute which lets you specify the name of the column the annotated field belongs to. When this attribute is omitted, the identifying name of the annotated field is used instead. In the above code that would certainly work, but it is recommended to always specify the column name in the annotation. An important thing to note is that you must always write the default no-arguments constructor. Of course, in most cases it makes sense to also write getters and setters for your members but that's not necessarily required by this feature.

Now you can for example add a row to your DataFrame like this:

df.addRow(myGuy);

where myGuy is an instance of the FamilyGuy class we just wrote. Getting a row from our DataFrame will now also return an instance of FamilyGuy with all annotated fields already set.

FamilyGuy guy = df.getRowAt(0, FamilyGuy.class);

will return the row at index 0 as a FamilyGuy object.

If you want a DataFrame with the appropriate column structure created for you, then you can use a specific constructor which infers that column structure from the annotated fields in your custom class. In the above example we could have created the DataFrame like this:

DataFrame df = new DefaultDataFrame(FamilyGuy.class);

As you can see, this is much simpler than having to construct all columns and their labels manually.