Skip to content

Q&A: Should stores transform data?

Thor Whalen edited this page Oct 4, 2019 · 8 revisions

Bob: I've used QuickStore. Very nice. It's very simple to use! But I saw it uses pickle to store data. That's fine, but I don't like pickle because if my class definition changes, I can't get my data back!

Alice: Yes, pickle isn't appropriate. Not only for the reasons you mention though. If you need a different system/platform/language to use the data, you might want to use a more platform-independent serialization. Or if your data has standard formats (such as wav, mp3, etc. for audio), you may want to use those standard formats.

Bob: Yes, I see you have QuickJsonStore, QuickBinaryStore, QuickTextStore.

Alice: Yes, you can use those if it fits your context. They are there to get people started quickly, or especially if you're a newbie. But know, if you're looking forward to growing up, that one of the goals of py2store is to offer tools to be able to make the store you need for your specific context. But now, the goal of py2store is not only to offer easy to use storage objects, but to make these objects do what ever YOU need them to do (but with that complexity hidden away). You have three main aspects you can change more or less easily:

  • Persistence (where/how does is store the data (files, DB, etc.)
  • Serialization (i.e. "data transformation")
  • Indexing (or rather "key mapping" or "key transformation" (e.g. choosing that keys will be absolute file path or relative file paths, or tuples, or namedtuples, or dictionaries (yes, dictionaries, I said it), or custom "key" objects you made for a specific purpose).

Alice: You can use JsonStore as is, if the data you're storing is a python dict and the values of that dict are all already Jsonizable. And if either of those two conditions are not met, you write code to bring it to that level, and there's a place in the store object to put that code (_obj_to_data) so that you can put that code where it belongs: Attached to (but not entangled with) the store object. And most of all remove it from where it DEFINITELY DOES NOT BELONG: The business logic code.

Bob: I see. But storage should only store and read, not transform the data!

Alice: When does "storing data" NOT transform data?

Bob: I mean not to decode, transcode or compress!

Alice: Storage transforms data. Almost always. But some of these transformations are taken for granted, because there's so often "attached" (i.e. "co-occur") with the raw process of storing. Secondly, this is why I emphasize this: Attached to (but not entangled with) the store object.

Bob: I am not talking about what happens the byte level. I am talking about you get the data back as you gave them to storage.

Alice: What does it mean "get back THE data AS you gave it to storage"? There's an unspecified definition of equality there. Know that, I am talking at what happens at the byte level, but also what happens at any other level. Why would something be fine to do at the byte level, but not fine to do at a higher level? Are you saying what the file system does is fine, but what ORMs do is not?

I see this as a great (but common) conception limitation. Some technique is perceived as okay, or even beneficial, for one thing, but not for another.

Bob: Well, yeah: you don't cut a tree with a kitchen knife, or a banana with an axe. Different tools apply to different situations.

Alice: No doubt. But it's easy to explain away with such a sound bite. This reasoning, if applied without really looking at the differences and similarities of each context, can lead to serious limitations. Machine language people resisted assembly. Assembly people resisted C. Now-a-days, such resistance sounds ridiculous. More contemporarily, you'll have some C programmers that will sneer at anything object oriented, or anything that is not typed, or not static. And if you're modern and accept dynamic and untyped languages, you might sneer at frameworks because they "do too much, too easily, and I don't have enough control".

Bob: But there is a difference between "machine language vs assembly" and "C vs interpreted languages".

Alice: No doubt. And these differences make a difference. All I'm pointing out is that there's also a lot more in common than meets the eye at first, and that these similarities can be harnessed if detected.

I thank the engineers of the recent computing history for making programming what it is today. But I'm also of the opinion that coding is carrying a too heavy a legacy from its close-to-hardware engineering origins. It limits the way coding could become. The term "coding" says it all. I'd like to see something resembling "telling a machine what you'd like it to do, or how you'd like it to behave", possibly even "converse (two-way) with a machine to come up with a desired behavior".

But back to our subject: "Should stores transform data"? I say "storage always does that". You say "you need to get the same thing back". I say "you never (or rarely) do". Let's say I have: x = some_byte_array and I store this. Then later, I read it out again: y = retrieve that byte_array What's the same, what's different? I can give you the answer, but you're going to say "but that doesn't matter", and you'll be right. are x and y the same?

  • No, they don't have the same name: But that doesn't matter. But let's say they had the same name... Are they the same now?
  • No, they don't have the same "python object hash" (a hidden attribute of every python object). But that doesn't matter. But let's say they had the same hash... Are they the same now
  • No, the place in memory where the data is actually stored isn't the same as it was when you stored it. But that doesn't matter.

It doesn't matter because all that matters is that I can operate on it as I need to. "Same" is what mathematicians call an "equivalence relationship", and there's many functional/pragmatic ways to define what it means.

To illustrate, do you have a problem with this (try it out!):

my_bytearray = bytearray(b'bytes')
# store this data
with open('test', 'wb') as fp:
    fp.write(my_bytearray)
# read it back
with open('test', 'rb') as fp:
    the_bytearray_i_stored = fp.read()
# the types are different
assert type(my_bytearray) == bytearray
assert type(the_bytearray_i_stored) == bytes
# yet python considers them as equal
assert w == ww
# is python wrong?!?! (perhaps a C fanatic might say so, but that's another story...)

Personally... I don't care of what anyone says about bytes or what they learned in school. I don't work with bytes. I work with information, encapsulated in language constructs (objects, functions, etc.). I don't work with bytes (directly), and most people don't.

At the end of the day, it's your choice. You can find solace in the zen of python and invoke "explicit is better than implicit", and with faith and zeal, write this:

data_to_store = my_serializer(data)
store[here] = data_to_store

and when you need the data, write this

data_from_store = store[here]
data = my_deserializer(data_from_store)

And if you remember to serializer-and-write and read-and-deserialize every time, good for you. But personally, I prefer to do this:

my_store = mk_store(store, my_serializer, my_deserializer)

and then just do this

my_store[here] = data

and that

data = my_store[here]

It's DRY (1), it's COC (2), and it's still SOC (3) (serialization and storage were only coupled when making my_store).