Table file formats (DRAFT)

Learning Objectives
After completing this lesson, learners should be able to:

Understand the pros and cons of a number of table file formats

Motivation

Next to images, tables are the most prevalent data format in bioimage analysis. For instance, segmenting cells in a image very often results in a cells by features table, including cell shape or intensity measurements. To effectively share these measurements it is very important to understand how to write and read tabular data.

Concept map

graph TD M("Measurements") --> T("Table") T --> F("On disk representation")

Figure

Activities

<a href=#act_ref>Parquet</a>

Parquet is a column-based table format. This enables efficient compression and it allows for the efficient reading a column subsets. In addition, due to the concept of row groups it can also allow for the efficient reading of row subsets. These features make Parquet interesting for handling very large tabular data.

Create a table in memory
Save the table as Parquet
Read the stored Parquet table
- If the library supports it, explore
  - Only reading a subset of columns
  - Only reading a subset of rows

Show activity for:

Java TableSaw
/* # Sources - https://github.com/mobie/mobie-viewer-fiji/blob/main/src/test/java/develop/ExploreTableSawParquet.java - https://github.com/mobie/mobie-viewer-fiji/blob/main/pom.xml # Requirements 0.43.1 0.10.0 tech.tablesaw tablesaw-core ${tablesaw-core.version} net.tlabs-data tablesaw_${tablesaw-core.version}-parquet ${tablesaw-parquet.version} */ import net.tlabs.tablesaw.parquet.TablesawParquetReadOptions; import net.tlabs.tablesaw.parquet.TablesawParquetReader; import net.tlabs.tablesaw.parquet.TablesawParquetWriteOptions; import net.tlabs.tablesaw.parquet.TablesawParquetWriter; import tech.tablesaw.api.DoubleColumn; import tech.tablesaw.api.IntColumn; import tech.tablesaw.api.Table; public class ParquetTableSaw { public static void main( String[] args ) { String tableFilePath = "/Users/tischer/Desktop/table.parquet"; // Create a Tablesaw table IntColumn labelId = IntColumn.create( "label_id", 1, 2, 3 ); DoubleColumn area = DoubleColumn.create("area_um2", 100.0, 123.5, 115.3 ); DoubleColumn circularity = DoubleColumn.create("circularity", 0.95, 0.43, 0.77 ); Table table = Table.create("cells", labelId, area, circularity); // Write as Parquet new TablesawParquetWriter(). write(table, TablesawParquetWriteOptions .builder(tableFilePath) .withOverwrite(true).build() ); // Read only a subset of the columns, // which is possible to due the parquet format table = new TablesawParquetReader() .read( TablesawParquetReadOptions .builder(tableFilePath) .withOnlyTheseColumns("label_id", "area_um2").build() ); System.out.println( "Read columns: " + String.join( ", ", table.columnNames() ) ); } }

Python PyArrow

# Code from: https://arrow.apache.org/docs/python/parquet.html

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# %
# create a dataframe
df = pd.DataFrame({'label_id': [1, 2, 3],
                   'area_um2': ['100.0', '151.3', '121.3'],
                   'circularity': [0.92, 0.73, 0.55]} )

# %
# save df as parquet
table = pa.Table.from_pandas(df)
table_file_path = "cells.parquet"
pq.write_table(table, table_file_path)

# %
# read a column subset
table = pq.read_table(table_file_path, columns=['label_id', 'area_um2'])
print(table)

# %
# convert to pandas dataframe
df = table.to_pandas()
print(df)

# %
# TODO:
# - Write and read "row groups"

Assessment

Fill in the blanks

TODO ___ .
TODO ___ .

Solution

TODO

TODO

Follow-up material

Recommended follow-up modules:

TODO

Learn more:

Wikipedia: Binary image