https://unum.cloud/post/2021-12-31-dbms-startups/. Databases in 2021 https://news.ycombinator.com/item?id=29749541 https://ottertune.com/blog/2021-databases-retrospective/ https://news.ycombinator.com/item?id=29731885 https://news.ycombinator.com/item?id=28736405 https://engineering.fb.com/2013/06/25/core-data/tao-the-power-of-the-graph/ . TAO DB (Facebook) https://www.relational.ai/ . written in Julia https://twitter.com/RelationalAI https://www.graphistry.com/ https://docs.datasette.io/en/stable/. Datasette https://www.reddit.com/r/javascript/comments/dmixm5/show_reddit_rxdb_a_reactive_database_where_you/
https://dbdiagram.io/home . DB schema diargrams
https://habr.com/ru/post/456116/  yEd Graph Editor построить красивую схему базы данных

https://www.databass.dev/ BOOK
http://www.redbook.io/
https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U   Advanced database systems
https://queue.acm.org/detail.cfm?id=3220266
https://www.infoq.com/presentations/storage-algorithms/
https://www.youtube.com/watch?v=2SUBRE6wGiA&list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9 time series databases


https://cstack.github.io/db_tutorial/parts/part1.html .  Let build SQLite clone from scratch
 
https://blog.pankajtanwar.in/how-database-indexing-actually-works-internally 
 
https://news.ycombinator.com/item?id=21033951
https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html
https://architecture-database.blogspot.com/2019/03/database-internel-architecture-sqlite.html
http://coding-geek.com/how-databases-work/
https://yunpengn.github.io/blog/
https://dbdb.io/

database in Go from scrach
https://news.ycombinator.com/item?id=21893009

https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-sql-engine/

https://pingcap.com/blog/10x-performance-improvement-for-expression-evaluation-made-possible-by-vectorized-execution/

RDF

https://news.ycombinator.com/item?id=20806916 RDF triplestore

SQL compiler

https://habr.com/ru/company/badoo/blog/461699/ . SQL compiler https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod18.pdf . Query Compiler https://habr.com/ru/company/badoo/blog/461699/ . SQL parsing http://blog.felipe.rs/2019/01/29/demystifying-join-algorithms/ https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ https://numeracy.co/blog/life-of-a-sql-query http://www.sommarskog.se/query-plan-mysteries.html

OrientDB

https://orientdb.com/ . OrientDB https://habr.com/ru/post/324012/ . OrientDB https://habr.com/ru/company/orienteer/blog/324758/ . OrientDB https://habr.com/company/psb/blog/434730/ . In memory DBs https://habr.com/ru/company/oleg-bunin/blog/319968/ . Tarantool, NoSQL https://habr.com/ru/company/oleg-bunin/blog/413557/ . NoSQL https://blog.acolyer.org/2019/08/30/choosing-a-cloud-dbms/ . Choosing a cloud DBMS: architectures and tradeoffs Athena could not run 4 of the 22 TPC-H queries, and Spectrum could not run 2 of them) https://habr.com/ru/post/440306/ Проблемы масштабирования БД в высоконагруженных системах https://habr.com/ru/company/otus/blog/451042/ . Postgres vs Cassandra vs Mongo

MVCC (MultiVersion Concurrency Control)

https://habr.com/ru/post/469415/ https://en.wikipedia.org/wiki/Multiversion_concurrency_control The simplest way is to make all readers wait until the writer is done, which is known as a read-write lock. Locks are known to create contention especially between long read transactions and update transactions. MVCC aims at solving the problem by keeping multiple copies of each data item. In this way, each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the changes have been completed (or, in database terms: until the transaction has been committed.) When an MVCC database needs to update a piece of data, it will not overwrite the original data item with new data, but instead creates a newer version of the data item. Thus there are multiple versions stored. The version that each transaction sees depends on the isolation level implemented. The most common isolation level implemented with MVCC is snapshot isolation. With snapshot isolation, a transaction observes a state of the data as when the transaction started. MVCC introduces the challenge of how to remove versions that become obsolete and will never be read. In some cases, a process to periodically sweep through and delete the obsolete versions is implemented. This is often a stop-the-world process that traverses a whole table and rewrites it with the last version of each data item. https://en.wikipedia.org/wiki/Multiversion_concurrency_control http://elliot.land/post/sql-transaction-isolation-levels-explained If someone is reading from a database at the same time as someone else is writing to it, it is possible that the reader will see a half-written or inconsistent piece of data. There are several ways of solving this problem, known as concurrency control methods. The simplest way is to make all readers wait until the writer is done, which is known as a lock. This can be very slow, so MVCC takes a different approach: each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the changes have been completed (or, in database terms: until the transaction has been committed.) The SQL standard tried to define four isolation levels (read uncommitted, read committed, repeatable read and serializable) - Read uncommitted permits dirty reads, non repeatable reads and phantom reads. - Read committed permits non repeatable reads and phantom reads. - Repeatable read permits only phantom reads. - Serializable does not permit any read errors. Client perform transactions in serial.

Sharding

https://www.citusdata.com/blog/2017/08/09/principles-of-sharding-for-relational-databases/ https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6 https://habr.com/company/oleg-bunin/blog/433370/ https://www.quora.com/What-is-the-difference-between-replication-partitioning-clustering-and-sharding If there is a small table which is part of SQL join with a sharded table then make a replica of small table to all nodes. Consistent hashing. Rendezvous hashing, Highest Random Weight : shard_id = arg max hash(object_id, shard_id). https://habr.com/company/gridgain/blog/430852/ . ACID consistency distributed DBs https://medium.com/capital-one-tech/batch-and-streaming-in-the-world-of-data-science-and-data-engineering-2cc029cdf554 https://www.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA https://harelba.github.io/q/ q allows performing SQL-like statements on tabular text data. https://news.ycombinator.com/item?id=18453133 SQL on CSV

Apache Flink

https://www.infoq.com/presentations/sql-streaming-apache-flink

DB Mongo

https://www.scaleapi.com/blog/athena#asdf https://news.ycombinator.com/item?id=17438516

HANA

https://habr.com/ru/article/436462/

DB Internals

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database . LMDB https://habr.com/company/oleg-bunin/blog/358984/ https://news.ycombinator.com/item?id=17190947 database internals 2-phase commit https://developers.redhat.com/blog/2018/10/01/patterns-for-distributed-transactions-within-a-microservices-architecture/ https://thecuriousdev.org/logstash-elk-stack-tutorial-part-1/ Logstash https://movio.co/en/blog/improving-with-sql-and-charts/ SQL to plot

bitmap index

http://richardstartin.uk/how-a-bitmap-index-works/ https://www.pilosa.com/docs/latest/introduction/ distributed bitmap index

datasette

https://simonwillison.net/2018/Jan/17/datasette-publish/ datasette https://news.ycombinator.com/item?id=16170892 datasette https://www.youtube.com/watch?v=lmP75mp3-Rg https://habrahabr.ru/post/346884/ in memory https://habrahabr.ru/company/ruvds/blog/350310/ Caching https://habrahabr.ru/company/mailru/blog/267469/ type-ahead hints in dropdown http://neerc.ifmo.ru/wiki/index.php?title=%D0%91%D0%BE%D1%80 https://thomaswdinsmore.com/2017/02/01/year-in-sql-engines/ https://ordepdev.me/posts/what-you-should-know-about-database-storage-and-retrieval

ORACLE

https://livesql.oracle.com https://blogs.oracle.com/sql/how-to-convert-rows-to-columns-and-back-again-with-sql-aka-pivot-and-unpivot select TABLE_NAME from user_tables select * from sometable where rownum <= 10 order by name select * from ( select * from emp order by sal desc ) where ROWNUM <= 5;

Links

http://www.olapcube.com/mdxhelp/ http://knexjs.org/

Embedded DBs

https://github.com/pmwkaa/engine.so

SQLite

https://habr.com/ru/post/528882/ http://tech.marksblogg.com/sqlite3-tutorial-and-guide.html http://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ https://github.com/simonw/csvs-to-sqlite https://simonwillison.net/2017/Nov/13/datasette/

DuckDB

Embedded Analytical database https://www.duckdb.org https://shekhargulati.com/2019/12/15/the-5-minute-introduction-to-duckdb-the-sqlite-for-analytics/ https://fosdem.org/2020/schedule/event/duckdb/

MySQL

https://habrahabr.ru/post/351740/ schema performance https://news.ycombinator.com/item?id=15613856 http://www.mysqltutorial.org/mysql-row_number/ MySQL variables: set @n:=-1; select (select @n:= @n+1) n from MyTable limit 60

Cassandra

https://habr.com/ru/company/sberbank/blog/484096/ . Cassandra vs HBase https://habr.com/ru/post/465333/ https://blog.softwaremill.com/7-mistakes-when-using-apache-cassandra-51d2cf6df519 https://blog.emumba.com/apache-cassandra-part-1-introduction-and-key-features-18d02ba0b8cc https://medium.com/techlogs/using-apache-cassandra-a-few-things-before-you-start-ac599926e4b8 http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html https://vadimtropashko.wordpress.com/%e2%80%9csql-design-patterns%e2%80%9d-book/about/ Issue with following table - the partotion become very big very fast. CREATE TABLE raw_data ( sensor text, ts timeuuid, readint int, primary key(sensor, ts) ) WITH CLUSTERING ORDER BY (ts DESC) AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': 1, 'compaction_window_unit': 'DAYS'}; Let change primary key: one partition per sensor per day would be a good choice if we’re storing 50-75MB of data per day. We could just as easily use week (starting from some epoch), or month and year as long as the partitions stay under 100MB. CREATE TABLE raw_data_by_day ( sensor text, day text, ts timeuuid, reading int, primary key((sensor, day), ts) ) WITH CLUSTERING ORDER BY (ts DESC) AND COMPACTION = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': 1}; append-only store compaction tumbstone (deleted marker) Bloom filter garanteed true negatives TimeWindowCompaction SizeTieredCompaction LeveledCompaction materialed view keyspace: bag of tables with replication factor, similar to schema class: SimpleStategy or NetworkStrategy Cassandra: up to 10Tb per node, HDFS has better density rack - aware, region - aware partition key is subset of Primary Key Columns and tables support an optional expiration period called TTL (time-to-live); TTL is not supported on counter columns. Define the TTL value in seconds. Data expires once it exceeds the TTL period and is then marked with a tombstone. Expired data continues to be available for read requests during the grace period, see gc_grace_seconds. Normal compaction and repair processes automatically remove the tombstone data. mirmur3_hash(primary_key) -> token (range of keys) Keyspace attributes: - replication factor - replica placement strategy - column family for keyspace cqlsh> describe keyspaces select columnfamily_name from system.schema_columnfamilies where keyspace_name = 'test'; or cqlsh> use products; cqlsh:products> describe tables; or DESCRIBE COLUMNFAMILIES; if nodes are non in sync then node with latest data wins http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html Consistecy levels (per statement read/write) for reads and writes: ALL - all replicas ack QUORUME > 51% LOCAL_QUORUM in local DC ONE TWO write: commit log and to memtable -> SSTable cassandra.yml fsynced every 10s http://abiasforaction.net/cassandra-query-language-cql-tutorial/ https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/ https://www.instaclustr.com/resource/cassandra-nosql-data-model-design/ https://www.instaclustr.com/cassandra-nosql-data-model-design-2/ https://www.instaclustr.com/apache-cassandra-3-x-and-materialized-views/ https://tekslate.com/cassandra-interview-questions-and-answers https://opencredo.com/how-not-to-use-cassandra-like-an-rdbms-and-what-will-happen-if-you-do/ https://opencredo.com/cassandra-data-modelling-patterns/ https://opencredo.com/training/cassandra-fundamentals-data-modelling/ https://medium.com/walmartlabs/avoid-pitfalls-in-scaling-your-cassandra-cluster-lessons-and-remedies-a71ca01f8c04 https://vitobotta.com/2017/04/02/getting-started-apache-cassandra/ http://cassandra.alteroot.org/ http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html http://lucasmanual.com/blog/quick-intro-to-cassandra-vs-mongodb-with-python/ http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/ https://opencredo.com/spark-testing/ https://www.youtube.com/watch?v=Mvsy2DZhKSw https://www.youtube.com/watch?v=HuDJBTPdaOA https://www.reddit.com/r/cassandra/comments/6l01iu/wide_row_append_only_timeseries_data_no_ttl_what/ https://www.instaclustr.com/resource/apache-spark-apache-cassandra-powering-intelligent-applications/ https://www.instaclustr.com/resource/6-step-guide-to-apache-cassandra-data-modelling-white-paper/ log-structured merge trees (LSMT) LSM trees just dumping stuff at the end of an array. However this means you have to scan that array to do lookups (as opposed to something easier like binary search). Then as your array gets big, you merge and flush it down to a lower "layer" of the lsm tree which is slightly bigger and sorted. And when that one fills, you flush further. And these merge-flushes are nice big sequential writes so that's nice too. http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/ https://www.goetas.com/blog/how-i-approached-software-development-and-why-i-prefer-postgresql-to-mysql/

TimeSeries DB

https://habr.com/ru/company/oleg-bunin/blog/464303/ TimeScaleDB and PipelineDB https://medium.com/netflix-techblog/scaling-time-series-data-storage-part-i-ec2b6d44ba39 http://devconnected.com/4-best-time-series-databases-to-watch-in-2019/ https://blog.outlyer.com/top10-open-source-time-series-databases https://news.ycombinator.com/item?id=20760324 TimeScaleDB and others https://news.ycombinator.com/item?id=18553336 https://news.ycombinator.com/item?id=18402890 https://medium.com/netflix-techblog/scaling-time-series-data-storage-part-i-ec2b6d44ba39 https://medium.com/@Pinterest_Engineering/goku-building-a-scalable-and-high-performant-time-series-database-system-a8ff5758a181 http://db.cs.cmu.edu/seminar2017/ http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html https://db.cs.cmu.edu/seminar2017/ https://oss.redislabs.com/redistimeseries/ https://github.com/sirixdb/sirix . SirixDB https://news.ycombinator.com/item?id=19825566 https://news.ycombinator.com/item?id=15649405

Analytics

https://www.analyticsvidhya.com/blog/2014/09/commonly-asked-puzzles-analytics-interviews/ https://www.reddit.com/r/predictiveanalytics/ https://dzone.com/big-data-analytics-tutorials-tools-news https://dzone.com/articles/streamline-the-machine-learning-process-using-apac https://medium.com/rv-data/mleap-providing-near-real-time-data-science-with-apache-spark-c34e7df093ca http://quasar-analytics.org/ Scala engine for analytics