After completing this lesson, learners should be able to:
Understand the pros and cons of a number of table file formats
Motivation
Next to images, tables are the most prevalent data format in bioimage analysis. For instance, segmenting cells in a image very often results in a cells by features table, including cell shape or intensity measurements. To effectively share these measurements it is very important to understand how to write and read tabular data.
Concept map
graph TD
M("Measurements") --> T("Table")
T --> F("On disk representation")
Parquet is a column-based table format. This enables efficient compression and it allows for the efficient reading a column subsets. In addition, due to the concept of row groups it can also allow for the efficient reading of row subsets. These features make Parquet interesting for handling very large tabular data.
Create a table in memory
Save the table as Parquet
Read the stored Parquet table
If the library supports it, explore
Only reading a subset of columns
Only reading a subset of rows
Show activity for:
Java TableSaw
/*
# Sources
- https://github.com/mobie/mobie-viewer-fiji/blob/main/src/test/java/develop/ExploreTableSawParquet.java
- https://github.com/mobie/mobie-viewer-fiji/blob/main/pom.xml
# Requirements
0.43.10.10.0tech.tablesawtablesaw-core${tablesaw-core.version}net.tlabs-datatablesaw_${tablesaw-core.version}-parquet${tablesaw-parquet.version}
*/
import net.tlabs.tablesaw.parquet.TablesawParquetReadOptions;
import net.tlabs.tablesaw.parquet.TablesawParquetReader;
import net.tlabs.tablesaw.parquet.TablesawParquetWriteOptions;
import net.tlabs.tablesaw.parquet.TablesawParquetWriter;
import tech.tablesaw.api.DoubleColumn;
import tech.tablesaw.api.IntColumn;
import tech.tablesaw.api.Table;
public class ParquetTableSaw
{
public static void main( String[] args )
{
String tableFilePath = "/Users/tischer/Desktop/table.parquet";
// Create a Tablesaw table
IntColumn labelId = IntColumn.create( "label_id", 1, 2, 3 );
DoubleColumn area = DoubleColumn.create("area_um2", 100.0, 123.5, 115.3 );
DoubleColumn circularity = DoubleColumn.create("circularity", 0.95, 0.43, 0.77 );
Table table = Table.create("cells", labelId, area, circularity);
// Write as Parquet
new TablesawParquetWriter().
write(table,
TablesawParquetWriteOptions
.builder(tableFilePath)
.withOverwrite(true).build() );
// Read only a subset of the columns,
// which is possible to due the parquet format
table = new TablesawParquetReader()
.read( TablesawParquetReadOptions
.builder(tableFilePath)
.withOnlyTheseColumns("label_id", "area_um2").build() );
System.out.println( "Read columns: " + String.join( ", ", table.columnNames() ) );
}
}
Python PyArrow
# Code from: https://arrow.apache.org/docs/python/parquet.html
importnumpyasnpimportpandasaspdimportpyarrowaspaimportpyarrow.parquetaspq# %
# create a dataframe
df=pd.DataFrame({'label_id':[1,2,3],'area_um2':['100.0','151.3','121.3'],'circularity':[0.92,0.73,0.55]})# %
# save df as parquet
table=pa.Table.from_pandas(df)table_file_path="cells.parquet"pq.write_table(table,table_file_path)# %
# read a column subset
table=pq.read_table(table_file_path,columns=['label_id','area_um2'])print(table)# %
# convert to pandas dataframe
df=table.to_pandas()print(df)# %
# TODO:
# - Write and read "row groups"