fastavro¶

The current Python avro package is dog slow.

On a test case of about 10K records, it takes about 14sec to iterate over all of them. In comparison the JAVA avro SDK does it in about 1.9sec.

fastavro is an alternative implementation that is much faster. It iterates over the same 10K records in 2.9sec, and if you use it with PyPy it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some extra JSON encoding/decoding).

If the optional C extension (generated by Cython) is available, then fastavro will be even faster. For the same 10K records it’ll run in about 1.7sec.

Supported Features¶

File Writer
File Reader (iterating via records or blocks)
Schemaless Writer
Schemaless Reader
JSON Writer
JSON Reader
Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
Schema resolution
Aliases
Logical Types
Parsing schemas into the canonical form
Schema fingerprinting

Missing Features¶

Anything involving Avro’s RPC features

Example¶

from fastavro import writer, reader, parse_schema

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'time', 'type': 'long'},
        {'name': 'temp', 'type': 'int'},
    ],
}
parsed_schema = parse_schema(schema)

# 'records' can be an iterable (including generator)
records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]

# Writing
with open('weather.avro', 'wb') as out:
    writer(out, parsed_schema, records)

# Reading
with open('weather.avro', 'rb') as fo:
    for record in reader(fo):
        print(record)

Documentation¶

fastavro¶

fastavro.io¶

fastavro.io

fastavro.repository¶

fastavro.repository