Template
Learning Objectives
After completing this lesson, learners should be able to:
Understand the pros and cons of a number of table file formats
Motivation
Next to images, tables are the most prevalent data format in bioimage analysis. For instance, segmenting cells in a image very often results in a cells by features table, including cell shape or intensity measurements. To effectively share these measurements it is very important to understand how to write and read tabular data.
Concept map
graph TD
M("Measurements") --> T("Table")
T --> F("On disk representation")
Figure
Activities
<a href=#act_ref>Parquet</a>
Parquet is a column-based table format. This enables efficient compression and it allows for the efficient reading a column subsets. In addition, due to the concept of row groups it can also allow for the efficient reading of row subsets. These features make Parquet interesting for handling very large tabular data.
- Create a table in memory
- Save the table as Parquet
- Read the stored Parquet table
- If the library supports it, explore
- Only reading a subset of columns
- Only reading a subset of rows
- If the library supports it, explore
Show activity for:
Java TableSaw
/* # Sources - https://github.com/mobie/mobie-viewer-fiji/blob/main/src/test/java/develop/ExploreTableSawParquet.java - https://github.com/mobie/mobie-viewer-fiji/blob/main/pom.xml # Requirements0.43.1 0.10.0 tech.tablesaw tablesaw-core ${tablesaw-core.version} */ import net.tlabs.tablesaw.parquet.TablesawParquetReadOptions; import net.tlabs.tablesaw.parquet.TablesawParquetReader; import net.tlabs.tablesaw.parquet.TablesawParquetWriteOptions; import net.tlabs.tablesaw.parquet.TablesawParquetWriter; import tech.tablesaw.api.DoubleColumn; import tech.tablesaw.api.IntColumn; import tech.tablesaw.api.Table; public class ParquetTableSaw { public static void main( String[] args ) { String tableFilePath = "/Users/tischer/Desktop/table.parquet"; // Create a Tablesaw table IntColumn labelId = IntColumn.create( "label_id", 1, 2, 3 ); DoubleColumn area = DoubleColumn.create("area_um2", 100.0, 123.5, 115.3 ); DoubleColumn circularity = DoubleColumn.create("circularity", 0.95, 0.43, 0.77 ); Table table = Table.create("cells", labelId, area, circularity); // Write as Parquet new TablesawParquetWriter(). write(table, TablesawParquetWriteOptions .builder(tableFilePath) .withOverwrite(true).build() ); // Read only a subset of the columns, // which is possible to due the parquet format table = new TablesawParquetReader() .read( TablesawParquetReadOptions .builder(tableFilePath) .withOnlyTheseColumns("label_id", "area_um2").build() ); System.out.println( "Read columns: " + String.join( ", ", table.columnNames() ) ); } } net.tlabs-data tablesaw_${tablesaw-core.version}-parquet ${tablesaw-parquet.version}
Python PyArrow
Assessment
Fill in the blanks
- TODO ___ .
- TODO ___ .
Solution
- TODO
- TODO
Follow-up material
Recommended follow-up modules:
Learn more: