Skip to content →

Notes on Databases and Data Warehouses

  • The most simple database in the world
#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
  • Log refers to an append-only data file
  • When logs get too big that can be broken into segment files and consolidated using compaction
  • Appending to a Log is O(1), but fetching a value is O(n). That’s why we need indexes
  • Database indexes are an additional data structure that speeds up fetches
  • Hash indexes store a key: byte offset in RAM, for quick lookups
  • Log-structure storage segments are not sorted, so SSTables (sorted string tables) were created.
  • SSTables are implemented using Red-Black Trees or AVL Trees that maintain the order when a new key is inserted into the database. This is called a memtable.
  • The most widely-used indexing structure is a B-Tree
  • Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries
  • B-Trees store data on disk, not RAM
  • Log-structured indexes created segments that are megabytes in size; b-trees use blocks / pages that are usually 4Kb in size, in alignment with hardware that also uses fixed-size blocks
  • Pages allow spare space for new entries
  • “B” is balanced.
  • B-Trees use a write-ahead log that temporarily stores any modification before actually updating the B-Tree. This improves reliability and data integrity by handling situations where the database crashes in the middle of a write operation. This is also stored on disk.
  • Write amplification refers the to multiple write operations for the same piece of data throughout the lifetime of a database. For example, writing to a B-Tree requires two writes (one to the write-ahead log, and one to the tree).
  • Databases can have multiple indexes
  • Secondary indexes are crucial for making joins efficient
  • The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file
  • There are many types of indexes
    • Concatenated creates an index on concatenated fields, ex: first_name, last_name -> first_name,last_name
    • Multidimensional creates an index using more than one field, ex: latitude > 1.001 and latitude < 2.0 and longitude > 5.5 and longitude < 10.9
  • The term “transaction” comes from when databases were used for business transactions, like tracking purchases, paying employees, etc.
  • Data Analytics is basically scanning over a huge number of records, only reading a few columns per record, and calculating aggregate statistics (such as count, sum, or average)
  • A Data Warehouse is a database dedicated to analytics queries (as opposed to transactional processes)
  • OLTP (online transaction processing) databases need to be highly available and have low-latency reads and writes. Doing a massive analytics query on an OLTP database can increase latency, so that’s why they are separated into “data warehouses”.
  • The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
  • Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL)
  • Data Warehouses
    • Amazon Redshift
    • Apache Hive
    • Facebook Presto
  • A Fact Table is a table in a data warehouse that contains (usually petabytes of) rows of transactional data.
  • Star schema is a way of storing data in a warehouse. The fact table is in the center and it consists of columns that are foreign keys to other tables (dimension tables) that represent the who, what, where, when and why of the event.
  • A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions.
  • Column-Oriented Storage: don’t store all the values from one row together, but store all the values from each column together instead. Each column is stored in a separate file.
  • Column files are often repetitive, which makes them good candidates for compression using bitmap encoding

Published in Today I Learned