-
-
Save ZapDos7/18d5f4cd08664008764a5c2b9f369370 to your computer and use it in GitHub Desktop.
- RDBMS = relational database management system
- data stored in tables of related data entries (rows, columns)
- each row is a record
- each column is a field
- actions are performed by SQL statements (not case sensitive keywords)
SELECT
- extracts dataUPDATE
- updates data (beware of missingWHERE
clauses)DELETE
- deleted dataINSERT INTO
- introduces dataCREATE DATABASE
ALTER DATABASE
CREATE TABLE
ALTER TABLE
DROP TABLE
CREATE INDEX
(an index is a search key)DROP INDEX
COUNT
e.g.SELECT COUNT(DISTINCT id) FROM Customers;
WHERE
specify conditions- Operators:
=
,>
,<
,>=
,<=
,<>
,BETWEEN
,IN
(for many possible values),AND
,OR
,NOT
LIKE
for pattern search'a%'
: start witha
'%a'
: ends witha
'%a%'
: containsa
*
: any?
: one[]
: e.g.h[oa]t
returnshot
,hat
!
: not-
: range#
: single number%
: any character_
: one character
ORDER BY
, default =ASC
elseDESC
(need column name to specify)NULL
no value e.g. for optional fields, does NOT mean 0. Check like so:IS NULL
/IS NOT NULL
TOP
/LIMIT
MIN()
/MAX()
AVG()
SUM()
Virtual tables based on the result set of an SQL statement
- use indexes
- be mindful of
IN
subqueries
Entity | MySQL | SQL Server |
---|---|---|
DATE | YYYY-MM-DD | YYYY-MM-DD |
DATETIME | YYYY-MM-DD HH:mm:ss | YYYY-MM-DD HH:mm:ss |
TIMESTAMP | YYYY-MM-DD HH:mm:ss | a unique number |
YEAR | YYYY or YY | N/A |
SMALLDATETIME | N/A | YYYY-MM-DD HH:mm:ss |
- Install PostgreSQL (& specify a password)
- Verify installment:
psql --version
(should return something likepsql (PostgreSQL) 14.1
) - Connect to a db like so:
psql -d database_name -U username
where-d
is short for--dbname
and-U
(note the caps) is short for--username
. The default user & default db when installing PostgreSQL are namedpostgres
so the command ispsql -d postgres -U postgres
- Enter the password specified in step (1).
Alternatively one can use the SQL Shell (psql)
. There is a prompt for a server, a database, a port and a username. Pressing enter
selects the default values, which are localhost
, postgres
, 5432
, and postgres
. Again the password is required.
The environment looks like so:
postgres=#
Commands:
\? # list all commands
\h COMMAND # documentation of a specific command
q # quit command
\l # list all dbs
\l+ # list all dbs & extra info
\c db_name # change selected db
\dt # list all tables from selected db
\d table_name # describe a table (\d+ yields more info about the table)
\e # open editor
\i path_to_file/file_name.sql # read input from a file as if you had typed it into the terminal
\timing # measure time a query takes
\copy films(title, year, running_time) FROM 'path_to_file' DELIMITER ‘,’ CSV HEADER; # import data from csv file
pg_dump -U username database_name > path_to_file/filename.sql # backup a db (can be into an .sql or even a .tar file)
psql -U username -d database_name -f filename.sql # restore db from .sql file
pg_restore -U username -d database_name path_to_file/filename.tar # restore db from .tar file
# create tables, delete dbs, alter tables, insert data, select like SQL
Types of indexes:
- B-tree Index: the most commonly used type of index to efficiently store and retrieve data in PostgreSQL. It's the default index type. Whenever we use the CREATE INDEX command without specifying the type of index we want, PostgreSQL will create a B-tree index for the table or column.
CREATE INDEX index_name ON table_name; -- for table
CREATE INDEX index_name ON table_name (column_name); -- for a column
- Hash Indexes: designed for fast key-value lookups. When a query condition requires equality checks on indexed columns, hash indexes can provide extremely fast retrieval, as the hash function directly determines the location of the desired data. Hash indexes are most suitable for equality comparisons, such as = or IN operations.
CREATE INDEX hash_name ON table_name USING HASH (column_name);
- GiST and SP-GiST Indexes: GiST (Generalized Search Tree) and SP-GiST (Space-Partitioned Generalized Search Tree) indexes are advanced index types in PostgreSQL that provide support for a wide range of data types and search operations. They are particularly useful for handling complex data structures and spatial data, GiST indexes are what you use if you want to speed up full-text searches.
CREATE INDEX index_geometry ON table_name USING GIST (geometry_column); -- GiST index on a geometry column
CREATE INDEX index_text_search ON table_name USING SPGIST (tsvector_column); -- SP-GiST index on a tsvector column
- BRIN Indexes: Block Range Index, is an index type in PostgreSQL designed to provide efficient indexing for large tables with sorted data. BRIN index contains the minimum and maximum in a group of database pages.
CREATE INDEX timestamp ON table_name USING BRIN (column);
\l : shows all dbs (equivalent: SELECT dbname FROM pg_database;
\c dbname : use this db
\dt : show tables of public schema
\dn+ : show privileges
\dt *.* : show tables of all schemas
\dt schema.* : show tables of specific schema
\d table_name : info of table
\d+ table_name : same as above
\dt *.table_name* : same as above
Relational DB Management System | MongoDB |
---|---|
Database | Database |
Table | Collection |
Tuple/Row | Document |
Column | Field |
Table Join | Embedded Documents |
Primary Key | Primary Key (**) |
(server) mysqld /orcale |
mongod |
(client) mysql /sqlplus |
mongo |
(**) Mongo's PK:
- Default key
_id
provided by MDB itself - 12 hex bytes, unique
- 4 bytes: timestamp
- 3 bytes: machine ID
- 2 bytes: process ID
- 3 bytes: incremented value
- schema-less (is a document db which holds documents which may differ)
- structure of a single object is clear
- no complex joins
- deep query-ability
- tuning
- easy to scale
- mapping of app objects to db objects not needed
- use of internal memory for storing (windowed) working set → faster data access
$ sudo service mongodb start
$ sudo service mongodb stop
$ sudo service mongodb restart
$ mongo # → access db
- embedded data model (denormalized): you can embed all related data in a single document
e.g.:
{ _id, personal_details: { fname, lname dob }, contact: { email, phone }, address: { city, area, state } }
- normalized data model: refer to subdocuments of original document using references
e.g.
Employee {_id} Personal_Details {id, fname, lname, dob} Contact {id, email, phone} Address {id, city, area, state}
use DB_NAME # create DB if doesn't exist, else return existing db
db.help() # help
db # check currently selected db
show dbs # list dbs - must have at least 1 document in db in order to display
db.dropDatabase() # deletes selected db
db.createCollection(collName, optionalConfigOptions) # options:
# capped: boolean (if true, fixed size => deletes older records - we need to specify size param)
# autoIndexId (boolean: for _id field - default false)
# size
# max (max number of documents)
db.collection.insert({data data})
show collections
db.collection.drop()
db.collection.insert({data data})
db.collection.insertOne({data data})
db.collection.insertMany({data data}, {data data})
db.collection.find() # returns all fields of this collection
db.collection.find().pretty() # returns all fields of this collection in formatted fashion
db.collection.find().limit(NUM) # find until NUM
db.collection.find().skip(NUM) # ignore NUM first documents
# these can be chained
db.collection.findOne()
db.collection.update(CRITERIA, DATA)
db.collection.save({_id: ObjectId(), NEW_DATA}) # replace existing file
db.collection.findOneAndUpdate(CRITERIA, NEW_DATA) # update values of existing document
db.collection.updateOne(filters, update) # update single document based on filter
db.collection.updateMany(filters, update) # update all documents based on filter
db.collection.remove(CRIERIA) # delete a document
db.collection.remove(CRIERIA, 1) # delete only 1 document
db.collection.remove({}) # delete all documents
- String
- Integer
- Boolean
- Double
- MinMaxKeys
- Arrays
- Timestamp
- Object
- Null
- Symbol
- Date
- ObjectId
- BinaryData
- Code
- Regex
Name | Description |
---|---|
$eq | Matches values that are equal to a specified value. |
$gt | Matches values that are greater than a specified value. |
$gte | Matches values that are greater than or equal to a specified value. |
$in | Matches any of the values specified in an array. |
$lt | Matches values that are less than a specified value. |
$lte | Matches values that are less than or equal to a specified value. |
$ne | Matches all values that are not equal to a specified value. |
$nin | Matches none of the values specified in an array. |
Name | Syntax |
---|---|
AND | {"likes" : {$gt: 10}, $and [{"comments" : {$lt : 25}}] |
OR | {"name" : "John", $or [{"name" : "JANE"}] |
NOR | {"name" : "John", $not [{"name" : "JANE"}] |
NOT | {"name" : "John", $NOT [{"name" : "JANE"}] |
A MongoDB schema is created based on workload, relationships & patterns.
- Entity = what exists in our db & is unique & independent of each other
- Attribute = individual properties that discribe an entity
- Based on the project's specs, we can come up with the independent entities we will need (e.g. ebooks, audio books, printed books, users, reviews, authors, publishers for a bookstore db)
- Their quantities are approximations based on internal business stakeholder data (e.g. expect 200.000 audiobooks, 25.000.000 users, 500 publishers etc). Paying extra attention to the largest in number entities is important.
Identify the read & write operations for all user types (e.g. user, admin) in a table like so (We quantify the oprations' (expected) rate based on stakeholder data/estimations):
Entities | Operations | Information Needed | Type (Read/Write) | Rate |
---|---|---|---|---|
Books | Fetch info | Book details + Rating | Read | 10/s |
Reviews | Fetch 10 reviews of a book | Reviews & Review rating | Read | 20/s |
Users | Fetch user details | User details | Read | 5/s |
Books | Add/Update | Book details | Write | 1/s |
Printed books | Sell copy | Stock level | Write | 2/s |
Reviews | Add review | Review & Book rating | Write | 10/s |
Printed Books | (Admin) Fetch low stock printed books | Book details + Stock level | Read | 2/day |
- 1-1
- 1-N
- N-N
We can identify which relationship we need each time by asking these two questions:
- Can entity A be related to more than one entity B?
- Can entity B be related to more than one entity A?
- Reference: separate entities, linked using a key
- Embedding: both entities in a single document
Data that is accessed together should be stored together
Guideline | Question | Embed | Reference |
---|---|---|---|
Simplicity | Would keeping the pieces of info together lead to simpler data model & code? | Y | N |
Go Together | Do the pieces of info have a has-a /contains or similar relationship? |
Y | N |
Query Atomicity | Does the application query the pieces of info together? | Y | N |
Update Complexity | Are the pieces of info updated together? | Y | N |
Archival | Should the pieces of info be archived at the same time? | Y | N |
Cardinality | Is there a high cardinality (current or growing) in the child size of the relationship? | N | Y |
Data Duplication | Would data duplication be too complicated to manage and undesired? | N | Y |
Document Size | Would the combined size of the pieces of info take too much memory or transfer bandwidth for the app? | N | Y |
Document Growth | Would the embedded piece grow without bound? | N | Y |
Workload | Are the pieces of info written at different times in a write-heavy workload? | N | Y |
Individuality | For the children size of the relationship, can the pieces exist by themselves without a parent? | N | Y |
Select based on these guidelines. Sometimes a guideline may be the most important, so it's not always a mere matter of which method gets the most "points".
- Embedding:
- include fields for the child entity into the parent entity
- It's better to group related fields in a subdocument within the parent entity (better organization, covers for duplicate names)
- Referencing
- add key field in one of the two entities (whichever entity is fetched primarily should be the one with the key)
- or both (bidirectional)
- Embedding:
- different ways:
- embed the
many
side as an array of subdocuments in theone
side. - create one subdocument within the parent document
- embed the
- good for no duplicate info
- good for entities (children) that cannot exist without the other (parent)
- different ways:
- Referencing:
- different ways:
- include an array of references in the parent document (preferred)
- include a reference to the parent in each child (preferred if an array is unbounded, it's best to reference the other way around)
- bidirectional references
- different ways:
- Embedding:
- embed the documents from the child side using an array in the parent side in each parent document
- this can lead to data duplication (not always bad - better query performance)
- or, use a single subdocument with key-value pairs
- embed the documents from the child side using an array in the parent side in each parent document
- Referencing:
- use an array of references (in either direction or both)
- this can lead to data duplication
- bidirectional is more expensive to manage
- use an array of references (in either direction or both)
- Different forms of similar entities
- Shared fields or unique?
- Keep polymorphic documents under the same collection, differentiate by a field which indicated what (extra) fields are to be expected in this document & its shape
- Use the aggregation framework to apply
- Run ops when data changes
- Store results for quick access
- Operations
- Mathematical (store calculated value in a document instead of constantly recalculating with each fetching)
- Roll-up (merging data)
- generate statistically valid number that is not exact
- used to reduce resource usage for data that does not need to be exact
- use random number generator to extrapolate the new value in the application level
- trade accuracy for better db performance
- (relational)
JOIN
↔ (MongoDB)$lookup
but both are expensive - To avoid them we use this pattern: reduce or eliminate
JOIN
s by embedding data from multiple documents into the main document - also leads to faster reads
- be mindful of how to minimize duplication (use only the fields you need, prefer fields that don't change)
- be mindful of how to keep duplicate data updated based on source ("What is the list of dependent extended references?", "Do they need to be updated immediately?")
- Mongo assists with schema versioning with no downtime:
- add extra field to documents (
"schema_version": 2
- if it's missing, implicitv1.0
) - use some application support
- existing documents:
- update via application
- update via background task
- before updating schema we need to update the app
- add extra field to documents (
- due to flexible document model: documents with different shape can coexist in the same collection