CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.
Dario Radečić
Sep 20, 2021
7 min read
Share
Photo by jaikishan patel on Unsplash
Everyone and their grandmother know what a CSV file is.But is it the optimal way to store data?Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.
If you’re storing large volumes of data, opting for CSVs will cost you both time and money.
Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.
Let’s set up the environment before going over the file formats.
Getting started – Environment setup
You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:
conda create --name file_formats python=3.8 conda activate file_formats conda install -c conda forge numpy pandas fastavro pyarrow feather-format jupyter jupyterlabOnce the installation finishes, you can execute the following command to start a JupyterLab session:
jupyter labThe next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:
import numpy as np import pandas as pd import feather import pickle import pyarrow as pa import pyarrow.orc as orc from fastavro import writer, reader, parse_schema np.random.seed = 42 df_size = 10_000_000 df = pd.DataFrame({ 'a': np.random.rand(df_size), 'b': np.random.rand(df_size), 'c': np.random.rand(df_size), 'd': np.random.rand(df_size), 'e': np.random.rand(df_size) }) df.head()Here’s how it looks like:
Image 1 – Random dummy dataset (image by author)
You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.
ORC
ORC stands forOptimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.
In Python, you can use theread_orc()function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.
Here’s an example of writing Pandas DataFrames:
table = pa.Table.from_pandas(df, preserve_index=False) orc.write_table(table, '10M.orc')And here’s the command for reading ORC files:
df = pd.read_orc('10M.orc')You can learn more about ORC here:
Stop Using CSVs for Storage – This File Format is Faster and Lighter
Avro
Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.
Avro has an API for every major programming language, but it doesn’t support Pandas by default.
Here’s the set of commands for saving a Pandas DataFrame to an Avro file:
# 1. Define the schema schema = { 'doc': 'Float data', 'name': 'Data', 'namespace': 'data', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'float'}, {'name': 'b', 'type': 'float'}, {'name': 'c', 'type': 'float'}, {'name': 'd', 'type': 'float'}, {'name': 'e', 'type': 'float'}, ] } parsed_schema = parse_schema(schema) # 2. Convert pd.DataFrame to records - list of dictionaries records = df.to_dict('records') # 3. Write to Avro file with open('10M.avro', 'wb') as out: writer(out, parsed_schema, records)Reading Avro files is no picnic either:
# 1. List to store the records avro_records = [] # 2. Read the Avro file with open('10M.avro', 'rb') as fo: avro_reader = reader(fo) for record in avro_reader: avro_records.append(record) # 3. Convert to pd.DataFrame df = pd.DataFrame(avro_records)You can learn more about Avro here:
CSV Files for Storage? Absolutely Not. Use Apache Avro Instead
Parquet
Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.
The best news is – Pandas has full support for Parquet files.
Here’s the command for writing a Pandas DataFrame to a Parquet file:
df.to_parquet('10M.parquet')And here’s the equivalent for reading:
df = pd.read_parquet('10M.parquet')You can learn more about Parquet here:
CSV Files for Storage? No Thanks. There’s a Better Option
Pickle
You can use thepicklemodule to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats – you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.
The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.
Here’s how to write a Pandas DataFrame to a Pickle file:
with open('10M.pkl', 'wb') as f: pickle.dump(df, f)You’ll only have to change the file mode when reading a Pickle file:
with open('10M.pkl', 'rb') as f: df = pickle.load(f)You can learn more about Pickle here:
Stop Using CSVs for Storage – Pickle is an 80 Times Faster Alternative
Feather
Feather is a data format for storing data frames. It’s designed around a simple premise – to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.
You can use thefeatherlibrary to work with Feather files in Python. It’s the fastest available option currently.
Here’s the command for saving Pandas DataFrames to a Feather file:
feather.write_dataframe(df, '10M.feather')And here’s the command for reading:
df = feather.read_dataframe('10M.feather')You can learn more about Feather here:
Stop Using CSVs for Storage – This File Format Is 150 Times Faster
Comparison time – Which data format should you use?
Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.
Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:
Image 2 – Write time comparison in seconds (CSV: 34.7; ORC: 9.66; Avro: 9.58; Parquet: 2.06; Pickle: 0.5; Feather: 0.304) (image by author)
The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.
Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:
Image 3 – Read time comparison in seconds (CSV: 3.83; ORC: 3.02, Avro: 27.6; Parquet: 1.23; Pickle: 0.193; Feather: 1.16) (image by author)
CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.
And finally, let’s compare the file sizes on the disk:
Image 4 – File size comparison in MB (CSV: 963.5; ORC: 379.5; Avro: 200.3; Parquet: 401.4; Pickle: 400; Feather: 400.1) (image by author)
Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.
To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.
What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required?Let me know in the comments below.