I do not know with what weapons World War III will be fought, but World War IV will be fought with sticks and stones
— someone, probably
This is a post-modern Python library for parsing/validating unstructured data, such as JSON returned by an HTTP server or a YAML configuration.
- Make sure you're using Python >= 3.9
- Copy the
simpleparser.py
file from this repository into your project
đź’ˇ Make sure to read the tutorial first. But I'm not gonig to stop you :^)
What gives?
This library stems from my general dissatisfaction with popular existing Python solutions to the very common problem of parsing unstructured data.
-
Parsing type annotations is... complicated. Python doesn't provide a nice framework to do that, and it's generally a mess. How do you automatically generate a parser for a generic class, respecting its variance? I'll sleep better at night without such knowledge.
-
...but type checkers are kinda nice. Unfortunately, in Python there's no way to make a nice declarative tool like zod from TypeScript where types are inferred from the schema, not the other way around.
-
Implicit coercions. That's a bad default. The good default is rejecting invalid data.
I want a string, but you send me an integer. I am not going to guess what you meant, there's something wrong on your side.
If you want the rules to be more relaxed, specify where and how lax you want to be explicitly.
Explicit is better than implicit.
Errors should never pass silently.
Unless explicitly silenced.
-
Simple cases and complex cases. It's easy to optimize for the simple case of needing to map a JSON with 5 fields to a
dataclass
of 5 fields with the same names. However, the real world is often more complicated.Data can be more complicated.
Maybe your data uses
camelCase
for names. Or maybePascalCase
. If it's usingPascalCase
, should HTTPClient beh_t_t_p_client
orhttp_client
, and what aboutIAmAMD
('I am a M.D.')?The data might contain flat data that you want to be nested. It's pretty reasonable to group
{"x", "y"}
to a singlepos
attribute. (Or vice versa --- flatten something that's nested in the raw representation)There's no standard way to represent tagged unions (sum types, variant record, whatever) in JSON/YAML. In fact, Telegram has at least two ways of doing so. Some developers apparently don't believe in tagged unions, and instead model their data as a record with 30 optional fields 🤦 . The Rust library
serde
has some solutions to this, but I haven't seen anything similar in Python.
This is the kind of philosphy I like:
Here's the recipe to solve 90% of your problems. It'a a bit more wordy than just slapping on a decorator or inheriting from a base class, but it's simple code. If you want something more complicated, use the Turing-complete language we already have to express your custom bits.
So this project is not much of a library, it's mostly a suggestion to take an alternative approach to parsing untrusted data using simpler tools that you already have.
For an introduction, we are going to implement a module that works with a small part of Telegram's Bot API, namely the Update
object.
First, we need to decide how to model this thing. For our humble bot, we will only need two update types:
message
: "New incoming message of any kind - text, photo, sticker, etc."edited_message
: "New version of a message that is known to the bot and was edited"
Would this be a good model?
@dataclass(frozen=True)
class Update:
update_id: int
message: Union[Message, None] = None
edited_message: Union[Message, None] = None
I don't think that's going to serve us well. It's going to be hard to work with, because there are
invalid and otherwise awkward states this Update
can be in.
I would use something like this as our model:
@dataclass(frozen=True)
class NewMessage:
message: Message
@dataclass(frozen=True)
class MessageEdited:
message: Message
@dataclass(frozen=True)
class UnsupportedUpdate:
raw: object
UpdateBody = Union[
NewMessage,
MessageEdited,
UnsupportedUpdate,
]
@dataclass(frozen=True)
class Update:
update_id: int
body: UpdateBody
This describes our domain pretty well:
- we don't support every possible update (hence
UnsupportedUpdate
) - there is exactly one "event" in an update
For now, we'll have a very simple model for a message, because we only need a few things from it:
from __future__ import annotations
from typing import Union
from datetime import datetime
from dataclasses import dataclass
@dataclass(frozen=True)
class Message:
message_id: int
sent_at: datetime
author: Union[User, Chat]
text: Union[str, None] = None
@dataclass(frozen=True)
class User:
user_id: int
first_name: str
username: Union[str, None] = None
@dataclass(frozen=True)
class Chat:
chat_id: int
title: str
And here's how you parse a Message
:
from simpleparser import (
is_any_of,
is_int,
is_str,
has_field,
has_optional_field,
ParseError,
Verbose,
)
def is_message(source: object) -> Message:
return Message(
message_id=has_field("message_id", is_int)(source),
sent_at=has_field("date", _is_timestamp)(source),
author=is_any_of(
has_field("sender_chat", _is_chat),
has_field("from", _is_user),
)(source),
text=has_optional_field("text", is_str)(source),
)
def _is_chat(source: object) -> Chat:
return Chat(
chat_id=has_field("id", is_int)(source),
title=is_any_of(has_field("title", is_str))(source),
)
def _is_user(source: object) -> User:
return User(
user_id=has_field("id", is_int)(source),
first_name=has_field("first_name", is_str)(source),
username=has_optional_field("username", is_str)(source),
)
def _is_timestamp(source: object) -> datetime:
timestamp = is_int(source)
try:
return datetime.fromtimestamp(timestamp)
except (ValueError, OverflowError):
raise ParseError(Verbose("Timestamp is too big"))
Let's try our parser on some example messages.
message_from_chat = {
"message_id": 100,
"date": 1676769964,
"sender_chat": {"id": 666, "title": "Some Chat"},
}
print(is_message(message_from_chat))
>>> Message(message_id=100, sent_at=datetime.datetime(2023, 2, 19, 4, 26, 4), author=Chat(chat_id=666, title='Some Chat'), text=None)
message_from_user = {
"message_id": 25045,
"date": 1676769966,
"from": {"id": 11111, "first_name": "Bob"},
"text": "Hello there!",
}
print(is_message(message_from_user))
>>> Message(message_id=25045, sent_at=datetime.datetime(2023, 2, 19, 4, 26, 6), author=User(user_id=11111, first_name='Bob', username=None), text='Hello there!')
bad_message = {
"message_id": 25045,
"date": 1676769966,
"from": {"id": 11111, "first_name": 42},
"text": "Hello there!",
}
is_message(bad_message)
...
Traceback (most recent call last):
File "/.../tutorial.py", line 95, in <module>
is_message(bad_message)
File "/.../tutorial.py", line 43, in is_message
author=is_any_of(
^^^^^^^^^^
File "/.../simpleparser.py", line 289, in _is_any_of
raise ParseError(MultipleErrors(tuple(errors)))
simpleparser.ParseError: all possibilities failed:
- at key 'sender_chat': Key 'sender_chat' not found
- at key 'from': at key 'first_name': expected a string, got <class 'int'>
from simpleparser import map_parser, is_always
def is_update_body(source: object) -> UpdateBody:
return is_any_of(
map_parser(NewMessage, has_field("message", is_message)),
map_parser(MessageEdited, has_field("message_edited", is_message)),
is_always(UnsupportedUpdate(source)),
)(source)
Hm... actually, we're not doing anything with the source besides passing it to other parsers. Let's refactor our code slightly:
from simpleparser import is_anything
is_update_body = is_any_of(
map_parser(NewMessage, has_field("message", is_message)),
map_parser(MessageEdited, has_field("message_edited", is_message)),
map_parser(UnsupportedUpdate, is_anything),
)
Better error messages
This is_any_of
is useful when you have few options, but the error message will not be very clear
with 10 variants. We can give each "branch" a name:
from simpleparser import is_any_of_described
is_update_body = is_any_of_described(
(
"New message",
map_parser(NewMessage, has_field("message", is_message)),
),
(
"Message edited",
map_parser(MessageEdited, has_field("message_edited", is_message)),
),
(
"Unsupported update",
map_parser(UnsupportedUpdate, is_anything),
),
)
def is_update(source: object) -> Update:
return Update(
update_id=has_field("update_id", is_int)(source),
body=is_update_body(source),
)
Let's see our parser in action:
>>> is_update({
... "update_id": 257,
... "message": {
... "message_id": 100,
... "date": 1676769964,
... "sender_chat": {"id": 666, "title": "Some Chat"},
... },
... })
...
Update(
update_id=257,
body=NewMessage(
message=Message(
message_id=100,
sent_at=datetime.datetime(2023, 2, 19, 4, 26, 4),
author=Chat(chat_id=666, title='Some Chat'),
text=None,
),
),
)
>>> is_update({
... "update_id": 257,
... "unknown_update": {
... "duckies": 666,
... },
... })
...
Update(update_id=258, body=UnsupportedUpdate(raw={'update_id': 258, 'unknown_update': {'duckies': 666}}))
>>> is_update({"update_id": "yes!"})
Traceback (most recent call last):
...
simpleparser.ParseError: at key 'update_id': expected integer, got <class 'str'>
What we ended up with isn't bad, but there are some issues, especially as we're going to scale to accept more updates:
-
Performance. The way
is_any_of
works is: it tries all the given options one by one until it finds an option that matches. This makes it very flexible, but it also means that if there are 100 options, the parser will potentially have to go through all the 100 options on every message.In our case, we can optimize this because we know what update we want to parse based on the second key present in the
Update
object. -
Error handling and unknown updates. What happens if Telegram gives us a
message_edited
update with a body that doesn't match our expectations? Right now the parser will classify that as anUnsupportedUpdate
, and we'll probably ignore it. That's very bad! We want to get an error in that case.
Here's one way you can solve the second problem:
from simpleparser import is_dict
def is_update_body(source: object) -> UpdateBody:
raw_dict = is_dict(source)
if "message" in raw_dict:
return NewMessage(is_message(raw_dict["message"]))
elif "message_edited" in raw_dict:
return MessageEdited(is_message(raw_dict["message_edited"]))
else:
return UnsupportedUpdate(raw_dict)
This is still not perfect, we're going to accept updates which have both a message
and
message_edited
. And we're still have a time complexity of O(update_kinds)
.
We can solve both of these problems with a dictionary lookup:
from simpleparser import Expectation
_known_events = {
"message": map_parser(NewMessage, is_message),
"message_edited": map_parser(MessageEdited, is_message),
}
def is_update_body(source: object) -> UpdateBody:
raw_dict = is_dict(source)
keys = raw_dict.keys() - {"update_id"}
if len(keys) != 1:
raise ParseError(Expectation(expected="one key", actual=str(list(keys))))
[event_type] = keys
if event_type in _known_events:
return _known_events[event_type](raw_dict[event_type])
else:
return UnsupportedUpdate(raw_dict)
Do we want to raise an exception on an invalid update from Telegram?
When we poll Telegram, we must specify what update ID we want the updates to start with.
When we get update #100
, we tell Telegram to send updates starting with #101
next time.
So our "main loop" will look something like this:
last_update = 0
while True:
response = requests.get(f"{api_root}/getUpdates", query={"offset": last_update, "timeout": 2}).json()
if not response["ok"]:
logger.error(f"Oh no! We're not OK: {response!r}")
time.sleep(5)
continue
raw_updates = response["result"]
for raw_update in raw_updates:
try:
update = is_update(raw_update)
except ParseError as exc:
logger.error("Wow, telegram sent us something stupid. ", exc_info=exc)
else:
last_update = max(last_update, update.id + 1)
process_update(update)
Do you see the problem? If we get an invalid update, we ignore its ID! If that was the only update in a while, on the next iteration we're going to ask for the same update, without a timeout. Telegram will be very mad and will put us in the dreaded 429 Jail.
Another point is that we might want to still process updates that weren't quite right. Perhaps
we want to keep track of update statistics in process_update
, or something else.
+ from simpleparser improt ErrorValue
+ @dataclass(frozen=True)
+ class InvalidUpdateReceived:
+ error: ErrorValue
+ raw: object
UpdateBody = Union[
NewMessage,
MessageEdited,
UnsupportedUpdate,
+ InvalidUpdateReceived,
]
An ErrorValue
is a representation of what exactly went wrong during parsing.
It contains some clue as to what went wrong and where.
Source code for `ErrorValue`
@dataclass(frozen=True)
class Verbose:
message: str
@dataclass(frozen=True)
class Expectation:
expected: str
actual: str
@dataclass(frozen=True)
class MultipleErrors:
errors: tuple[ErrorValue, ...]
def __post_init__(self) -> None:
if len(self.errors) < 2:
raise RuntimeError("Expected at least two errors for `MultipleErrors`")
@dataclass(frozen=True)
class AtIndex:
index: int
error: ErrorValue
@dataclass(frozen=True)
class AtKey:
key: str
error: ErrorValue
@dataclass
class Note:
note: str
original: ErrorValue
ErrorValue = Union[
Verbose,
Expectation,
MultipleErrors,
AtIndex,
AtKey,
Note,
]
Here's how we can adjust the is_update_body
parser to accomodate this design:
_known_events = {
"message": map_parser(NewMessage, is_message),
"message_edited": map_parser(MessageEdited, is_message),
}
def is_update_body(source: object) -> UpdateBody:
raw_dict = is_dict(source)
keys = raw_dict.keys() - {"update_id"}
if len(keys) != 1:
error = Expectation(expected="one key", actual=str(list(keys)))
return InvalidUpdateReceived(error, source)
[event_type] = keys
event_payload = raw_dict[event_type]
if event_type in _known_events:
try:
return _known_events[event_type](event_payload)
except ParseError as exc:
return InvalidUpdateReceived(exc.error, event_payload)
else:
return UnsupportedUpdate(raw_dict)
A short recap on simpleparser
:
- A parser is a function that accepts an object and either returns its parsed version, or raises
ParseError
- To parse a dictionary with known fields, use
has_field
- If the field can be missing, use
has_optional_field
instead - To try several options in order, use
any_of
- To adjust the output of an already existing parser, use
map_parser
- To accept any object at all, use
is_anything
- If you don't see how to combine existing parsers together in a nice way, write your own from scratch.