Stores user data across applications

Go to file

Eli Ribble 090f55e225 Temporary WIP for adding commands		2023-12-12 13:41:05 -07:00
datajack	Add the abliity to send a schema write request	2023-12-12 13:41:05 -07:00
.gitignore	Add gitignore for Python directories	2023-12-12 13:40:40 -07:00
LICENSE	Initial commit	2023-12-11 21:37:58 -07:00
README.md	Working protobuf passing.	2023-12-12 13:41:05 -07:00
client.py	Add the abliity to send a schema write request	2023-12-12 13:41:05 -07:00
control.proto	Temporary WIP for adding commands	2023-12-12 13:41:05 -07:00
main.go	Working protobuf passing.	2023-12-12 13:41:05 -07:00
server.py	Temporary WIP for adding commands	2023-12-12 13:41:05 -07:00

README.md

gibtar

The point is can we create something that is database-like without being a traditional database? Here's what we are looking for:

Applications are given access to the fortress on a limited basis. Think "I'm using your webapp, but I keep the data"
Single user, but multi-application/agent by nature.
- My fortress is mine. Just me. One human
- I can have an unlimited number of computational agents read/write the data.
- When I share data with another human, you get a copy, not a link.
No usage of a query language like SQL. All access is programmatic.
ACID compliant transactions
Browser-friendly API
- non-password crypto-auth
- AJAX/HTML/fetch friendly requests
- pipelining
- JSON, probably >:P
Data is never deleted. Time-travel is built-in
- Kinda like Datomic?

Incantations

protoc --python_out proto control.proto

Protocol concerns

Built-in behavior for dealing with a cluster
Network address negotiation (can we use a shorter/closer address? Is a local socket available?)
Key exchange. Every client is uniquely identified.
Schema validation since we will support code generation.
Schema download
Schema upgrade/manipulation

Permissions concerns

2 apps are sharing data.
- Badly-behaving app B can't destroy data because we never delete data, we just get a new version
  - It could be very badly behaving and could make tons of hostile changes for the user to sift through
  - User-requested rollback based on the client app that made the change? Seems very doable.
- Locking semantics?
Behaviors that need permissions
- Reading data
  - By application-level concept
  - No reason to separate out schema reading, you need the schema if you're getting the data.
  - No reason to separate out querying from iterating, it's just a question of speed.
  - No reason to separate subscribing to updates vs
- Writing data
  - By application-level concept
  - All writes are tagged by application for auditing.
- Altering schema
  - Needs heavy warnings, apps need to really know what they are doing if they alter schema from another app
  - Adding new concepts seems like it should be safe, apps share a namespace but don't share tables.
- Querying the set of available namespaces - can leak data of other applications used. Useful to select plugins on the app's side.

Design Questions

When defining a schema we could force the schema definer to not just work at the layer of "tables, indices, relationships" but also "permissions/logical units". The idea here is that if the application is highly denormalized we don't want the end user to have to understand that it takes 15 tables to contain "user data" but instead we want there to be a concept of a User within the schema and all the information the User contains so a communicating app can request "Access to read users".
- How burdensome would this be on developers? What are the downsides if the developers are lazy or get it wrong?
Can we defend against nefarious data exfiltration?
- Hard problem, sandboxing is probably the only valid technical solution here. Apps can exfiltrate via crafted DNS queries, so....the only way to prevent exfiltration is to sandbox.
At-rest encryption by client key?
- Seems like it may really hurt performance... and for questionable gain.
At what point do we handle permissions?
- We want it done all up-front so we can ask for the user's consent over some channel and do that once.
  - We need to enumerate the app-level concepts for the connection.

Handshake

We want to reduce the amount of back-and-forth for latency reasons. Performance matters.
We want to achieve the following in the handshake:
- Establish the validity of the client
- Determine what data format (protobuf, json) to use when speaking with the client
- Set the permissions scope of the connection
- Set the namespace for the purposes of sharing data between applications
- Store information about the client itself for auditing purposes (name, version, cert)
- Determine if the data schema is one the client recognizes and can work with
  - We should not assume schema version semantics - there is no v1.1 v1.2 and the assumption that the client can work with anything below v2.0. That's silly. However, we vastly prefer to be explicit about what schema the client can handle because otherwise we are working on implicit assumptions codified in the imperative code of the client. That's dangerous too. Can we use protobuf-style "I ignore fields I don't understand"?
- Figure out the most efficient path between the client and server to avoid unnecessary network hops.
  - Is this necessary, or does the operating system do this for us?
  - This includes redirecting to another process or server in load-balanced applications.
I think we don't need to specify that we are waiting for user authorization if we don't want to, we can just let the client queue up requests and wait to confirm them.

Client features

Query data lots of neat ways
Subscribe to updates of a particular query, get pushed data on those updates
Transactions
Explicit ordering and unordering of reads/writes
Multiplexed comms to allow parallel reads/writes over a single connection.

Server features

Triggers?

2 Apps, 1 Schema

You have App A and App B. They are sharing data with each other about todo list items. At some point T they both fully agree on the schema.

Do both apps have to fully agree on the schema for all time T?

If the answer to this is "yes" then the two must be upgradede in lock-step with each other in order to continue using both. If A upgrades first it keeps working and B stops working until it upgrades. Syncing between the app developers is super hard. Let's not make the answer "yes". Since the answer is "no" we need to expand the conditions or the behaviors during different types of disagreement. What are all the kinds of disagreement?

A knows of a table B does not.

This is fine, does not hurt B.

A knows of a field B does not.

This is fine too.

A and B disagree where a piece of data should be kept

This is an interesting case because it may be a situation where both apps have fields the other app does not know about where the data should be kept. That is most likely the case, since if B wants to put the data in field Z and A knows about Z then A should know to put the data there.

There is likely not an automated way to deal with this - the storage engine can't understand the semantics of a given data field without something like AGI. Concievably a user could fix this by giving them access to tools to remap data labels, but that seems crazy for most users.

Ultimately what you need is an authority on what the correct representation of the data is. This leads to the idea that every schema should have one-and-only-one owning application, and zero-or-more integrating applications. If there's a conflict, the owner is right and the integrator is wrong.

No, that's bad, because ultimately we want the user to own the data and concievably if they stopped using application A they could just use B and the data would keep working.

well, now I guess we need a standards body or something.

A believes that a field has a different type than what B believes

For this case we can thankfully detect the conflict and inform the app. But then what? It should abort, I guess...