fastavro.write

writer(fo: Union[IO, fastavro.io.json_encoder.AvroJSONEncoder], schema: Union[str, List[T], Dict[KT, VT]], records: Iterable[Any], codec: str = 'null', sync_interval: int = 16000, metadata: Optional[Dict[str, str]] = None, validator: bool = False, sync_marker: bytes = b'', codec_compression_level: Optional[int] = None, *, strict: bool = False, strict_allow_default: bool = False, disable_tuple_notation: bool = False)

Write records to fo (stream) according to schema

Parameters:
  • fo – Output stream
  • schema – Writer schema
  • records – Records to write. This is commonly a list of the dictionary representation of the records, but it can be any iterable
  • codec – Compression codec, can be ‘null’, ‘deflate’ or ‘snappy’ (if installed)
  • sync_interval – Size of sync interval
  • metadata – Header metadata
  • validator – If true, validation will be done on the records
  • sync_marker – A byte string used as the avro sync marker. If not provided, a random byte string will be used.
  • codec_compression_level – Compression level to use with the specified codec (if the codec supports it)
  • strict – If set to True, an error will be raised if records do not contain exactly the same fields that the schema states
  • strict_allow_default – If set to True, an error will be raised if records do not contain exactly the same fields that the schema states unless it is a missing field that has a default value in the schema
  • disable_tuple_notation – If set to True, tuples will not be treated as a special case. Therefore, using a tuple to indicate the type of a record will not work

Example:

from fastavro import writer, parse_schema

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'time', 'type': 'long'},
        {'name': 'temp', 'type': 'int'},
    ],
}
parsed_schema = parse_schema(schema)

records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]

with open('weather.avro', 'wb') as out:
    writer(out, parsed_schema, records)

The fo argument is a file-like object so another common example usage would use an io.BytesIO object like so:

from io import BytesIO
from fastavro import writer

fo = BytesIO()
writer(fo, schema, records)

Given an existing avro file, it’s possible to append to it by re-opening the file in a+b mode. If the file is only opened in ab mode, we aren’t able to read some of the existing header information and an error will be raised. For example:

# Write initial records
with open('weather.avro', 'wb') as out:
    writer(out, parsed_schema, records)

# Write some more records
with open('weather.avro', 'a+b') as out:
    writer(out, None, more_records)

Note: When appending, any schema provided will be ignored since the schema in the avro file will be re-used. Therefore it is convenient to just use None as the schema.

schemaless_writer(fo: IO, schema: Union[str, List[T], Dict[KT, VT]], record: Any, *, strict: bool = False, strict_allow_default: bool = False, disable_tuple_notation: bool = False)

Write a single record without the schema or header information

Parameters:
  • fo – Output file
  • schema – Schema
  • record – Record to write
  • strict – If set to True, an error will be raised if records do not contain exactly the same fields that the schema states
  • strict_allow_default – If set to True, an error will be raised if records do not contain exactly the same fields that the schema states unless it is a missing field that has a default value in the schema
  • disable_tuple_notation – If set to True, tuples will not be treated as a special case. Therefore, using a tuple to indicate the type of a record will not work

Example:

parsed_schema = fastavro.parse_schema(schema)
with open('file', 'wb') as fp:
    fastavro.schemaless_writer(fp, parsed_schema, record)

Note: The schemaless_writer can only write a single record.

Using the tuple notation to specify which branch of a union to take

Since this library uses plain dictionaries to represent a record, it is possible for that dictionary to fit the definition of two different records.

For example, given a dictionary like this:

{"name": "My Name"}

It would be valid against both of these records:

child_schema = {
    "name": "Child",
    "type": "record",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "favorite_color", "type": ["null", "string"]},
    ]
}

pet_schema = {
    "name": "Pet",
    "type": "record",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "favorite_toy", "type": ["null", "string"]},
    ]
}

This becomes a problem when a schema contains a union of these two similar records as it is not clear which record the dictionary represents. For example, if you used the previous dictionary with the following schema, it wouldn’t be clear if the record should be serialized as a Child or a Pet:

household_schema = {
    "name": "Household",
    "type": "record",
    "fields": [
        {"name": "address", "type": "string"},
        {
            "name": "family_members",
            "type": {
                "type": "array", "items": [
                    {
                        "name": "Child",
                        "type": "record",
                        "fields": [
                            {"name": "name", "type": "string"},
                            {"name": "favorite_color", "type": ["null", "string"]},
                        ]
                    }, {
                        "name": "Pet",
                        "type": "record",
                        "fields": [
                            {"name": "name", "type": "string"},
                            {"name": "favorite_toy", "type": ["null", "string"]},
                        ]
                    }
                ]
            }
        },
    ]
}

To resolve this, you can use a tuple notation where the first value of the tuple is the fully namespaced record name and the second value is the dictionary. For example:

records = [
    {
        "address": "123 Drive Street",
        "family_members": [
            ("Child", {"name": "Son"}),
            ("Child", {"name": "Daughter"}),
            ("Pet", {"name": "Dog"}),
        ]
    }
]

Using the record hint to specify which branch of a union to take

In addition to the tuple notation for specifying the name of a record, you can also include a special -type attribute (note that this attribute is -type, not type) on a record to do the same thing. So the example above which looked like this:

records = [
    {
        "address": "123 Drive Street",
        "family_members": [
            ("Child", {"name": "Son"}),
            ("Child", {"name": "Daughter"}),
            ("Pet", {"name": "Dog"}),
        ]
    }
]

Would now look like this:

records = [
    {
        "address": "123 Drive Street",
        "family_members": [
            {"-type": "Child", "name": "Son"},
            {"-type": "Child", "name": "Daughter"},
            {"-type": "Pet", "name": "Dog"},
        ]
    }
]

Unlike the tuple notation which can be used with any avro type in a union, this -type hint can only be used with records. However, this can be useful if you want to make a single record dictionary that can be used both in and out of unions.