news 2026/4/16 11:27:21

Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.

Dario Radečić

Sep 20, 2021

7 min read

Share

Photo by jaikishan patel on Unsplash

Everyone and their grandmother know what a CSV file is.But is it the optimal way to store data?Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.

If you’re storing large volumes of data, opting for CSVs will cost you both time and money.

Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.

Let’s set up the environment before going over the file formats.


Getting started – Environment setup

You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:

conda create --name file_formats python=3.8 conda activate file_formats conda install -c conda forge numpy pandas fastavro pyarrow feather-format jupyter jupyterlab

Once the installation finishes, you can execute the following command to start a JupyterLab session:

jupyter lab

The next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:

import numpy as np import pandas as pd import feather import pickle import pyarrow as pa import pyarrow.orc as orc from fastavro import writer, reader, parse_schema np.random.seed = 42 df_size = 10_000_000 df = pd.DataFrame({ 'a': np.random.rand(df_size), 'b': np.random.rand(df_size), 'c': np.random.rand(df_size), 'd': np.random.rand(df_size), 'e': np.random.rand(df_size) }) df.head()

Here’s how it looks like:

Image 1 – Random dummy dataset (image by author)

You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.


ORC

ORC stands forOptimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.

In Python, you can use theread_orc()function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.

Here’s an example of writing Pandas DataFrames:

table = pa.Table.from_pandas(df, preserve_index=False) orc.write_table(table, '10M.orc')

And here’s the command for reading ORC files:

df = pd.read_orc('10M.orc')

You can learn more about ORC here:

Stop Using CSVs for Storage – This File Format is Faster and Lighter


Avro

Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.

Avro has an API for every major programming language, but it doesn’t support Pandas by default.

Here’s the set of commands for saving a Pandas DataFrame to an Avro file:

# 1. Define the schema schema = { 'doc': 'Float data', 'name': 'Data', 'namespace': 'data', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'float'}, {'name': 'b', 'type': 'float'}, {'name': 'c', 'type': 'float'}, {'name': 'd', 'type': 'float'}, {'name': 'e', 'type': 'float'}, ] } parsed_schema = parse_schema(schema) # 2. Convert pd.DataFrame to records - list of dictionaries records = df.to_dict('records') # 3. Write to Avro file with open('10M.avro', 'wb') as out: writer(out, parsed_schema, records)

Reading Avro files is no picnic either:

# 1. List to store the records avro_records = [] # 2. Read the Avro file with open('10M.avro', 'rb') as fo: avro_reader = reader(fo) for record in avro_reader: avro_records.append(record) # 3. Convert to pd.DataFrame df = pd.DataFrame(avro_records)

You can learn more about Avro here:

CSV Files for Storage? Absolutely Not. Use Apache Avro Instead


Parquet

Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.

The best news is – Pandas has full support for Parquet files.

Here’s the command for writing a Pandas DataFrame to a Parquet file:

df.to_parquet('10M.parquet')

And here’s the equivalent for reading:

df = pd.read_parquet('10M.parquet')

You can learn more about Parquet here:

CSV Files for Storage? No Thanks. There’s a Better Option


Pickle

You can use thepicklemodule to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats – you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.

The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.

Here’s how to write a Pandas DataFrame to a Pickle file:

with open('10M.pkl', 'wb') as f: pickle.dump(df, f)

You’ll only have to change the file mode when reading a Pickle file:

with open('10M.pkl', 'rb') as f: df = pickle.load(f)

You can learn more about Pickle here:

Stop Using CSVs for Storage – Pickle is an 80 Times Faster Alternative


Feather

Feather is a data format for storing data frames. It’s designed around a simple premise – to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

You can use thefeatherlibrary to work with Feather files in Python. It’s the fastest available option currently.

Here’s the command for saving Pandas DataFrames to a Feather file:

feather.write_dataframe(df, '10M.feather')

And here’s the command for reading:

df = feather.read_dataframe('10M.feather')

You can learn more about Feather here:

Stop Using CSVs for Storage – This File Format Is 150 Times Faster


Comparison time – Which data format should you use?

Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.

Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:

Image 2 – Write time comparison in seconds (CSV: 34.7; ORC: 9.66; Avro: 9.58; Parquet: 2.06; Pickle: 0.5; Feather: 0.304) (image by author)

The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.

Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:

Image 3 – Read time comparison in seconds (CSV: 3.83; ORC: 3.02, Avro: 27.6; Parquet: 1.23; Pickle: 0.193; Feather: 1.16) (image by author)

CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.

And finally, let’s compare the file sizes on the disk:

Image 4 – File size comparison in MB (CSV: 963.5; ORC: 379.5; Avro: 200.3; Parquet: 401.4; Pickle: 400; Feather: 400.1) (image by author)

Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.

To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.


What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required?Let me know in the comments below.

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/14 16:45:30

2025年商业生态系统中的战略协同与价值创造

在2025年的商业生态系统中,战略协同与价值创造成为企业成功的关键要素。随着市场环境的变化和技术的进步,企业意识到单打独斗已无法满足日益复杂的需求。因此,跨界合作和资源共享逐渐成为主流趋势。通过建立多方协作网络,不同企业…

作者头像 李华
网站建设 2026/4/2 15:13:42

【建议收藏】GCC 编译器常用选项速查表(附助记词)

作者:嵌入式兔哥 简介:在嵌入式开发中,GCC 是我们最忠实的伙伴。很多兄弟可能只习惯用 IDE 的一键编译,但一旦转战 Linux 环境或编写 Makefile,理解 GCC 的这些参数就成了基本功。今天兔哥帮大家整理了一份“不求人”的…

作者头像 李华
网站建设 2026/4/15 2:09:34

区块链相关知识

一、区块链的简介 区块链可视为一种特殊的分布式数据库。 首先,区块链的主要作用是存储信息,任何需要保存的信息,都可以写入区块链,也可以从中读取信息,所以视它为数据库。 其次,任何人都可以架设服务器,加入区块链网络,成为一个节点。区块链的世界中没有中心节点,…

作者头像 李华
网站建设 2026/4/12 20:33:03

【完整源码+数据集+部署教程】食品分类2检测系统源码分享[一条龙教学YOLOV8标注好的数据集一键训练_70+全套改进创新点发刊_Web前端展示]

一、背景意义 随着全球人口的不断增长和生活水平的提高,食品安全与营养健康问题日益受到关注。食品种类繁多,消费者在选择食品时不仅关注其营养成分,还对食品的来源、品质和安全性提出了更高的要求。在此背景下,食品分类与检测技术…

作者头像 李华