Database Revision

Block 1

Intro

Data:

facts and statistics collected together for reference or analysis

eg. sounds, images, text, values
Database:
- Collection of logically related data
- Description of this data
- Meets information needs of organisation
- A shared collection of logically related data and its description, designed to meet the information needs of an organization.
- eg. student management database, library database
DBMS:
- A software system which enables users to define, create, maintain, and control access to the database.
- eg. Oracle, MySQL, ACCESS
Data model:
- A model is a representation of ‘real world’ objects and events, and their associations.
- A data model is a graphical description of the components of database.
- An integrated collection of concepts for describing and manipulating data, relationships between data, and constraints on the data in an organization.（textbook)
- a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.(dbs concept)
- eg. er model, relational model
Schema:
- The description of the database is the database schema.
The three-level ANSI-SPARC architecture:
- Objective of three level ANSI-SPARC architecture: Separate each user’s view of the database from the way the database is physically represented.
ADV of DBMS:

• Control of data redundancy • Data consistency • More information from the same amount of data • Sharing of data • Improved data integrity • Improved security • Enforcement of standards• Economy of scale • Balance conflicting requirements • Improved data accessibility and responsiveness • Increased productivity • Improved maintenance through data independence • Increased concurrency • Improved backup and recovery services
DISV of DBMS:

• Complexity • Size • Cost of DBMS • Additional hardware costs • Cost of conversion • Performance • Higher impact of a failure

Relational model

Relational model:

In the relational model, all data is logically structured within relations (tables).
Relation:

A relation is a table with columns and rows.
Attribute:
- An attribute is a named column of a relation.
- a property that describes some aspect of the object that we wish to record.
Domain:

the set of allowable values for one or more attributes.
Tuple:

A tuple is row of a relation.
Degree:

the number of attributes in a relation.
Cardinality:

the number of tuples in a relation.
Relational database:

A collection of normalized relations with distinct relation names.
Candidate Key:

A set of attributes that uniquely identifies a tuple within a relation.
Primary Key:

Candidate key selected to identify tuples uniquely within relation.
Foreign Key:

Attribute, or set of attributes, within one relation that matches candidate key of some other (possibly same) relation.
NULLs:

Represents a value for an attribute that is currently unknown or is not applicable for this tuple.
Entity Integrity:

In a base relation, no attribute of a primary key can be null.
Referential Integrity:

If foreign key exists in a relation, either foreign key value must match a candidate key value of some tuple in its home relation or foreign key value must be null.
General Constraints:

specific constraints on values
Entity-relationship (E-R) modelling :

E-R modelling is a high-level conceptual modelling technique for DB applications
Entity:
- An entity is a “thing” about which data should be stored.
- Group of objects with same properties, identified by enterprise as having an independent existence.
- a distinct object in the organization that is to be represented in the database.
Relationship:

Relationship describes a linkage between two entities
Multiplicity:

number (or range) of possible occurrences of an entity type that may relate to a single occurrence of an associated entity type through a particular relationship.
Cardinality:

Describes maximum number of possible relationship occurrences for an entity participating in a given relationship type.
Participation:
- Determines whether all or only some entity occurrences participate in a relationship.
- Gives the minimum number for an entity occurrences participating in a given relationship type.
Degree of a Relationship:

Number of participating entities in a relationship.
Ternary relationship:

three entities are simultaneously involved.

Block 2

EER

EER model(enhanced entity-relationship model)
specialization/ generalization
participation constraint
- mandatory
- optional
disjoint constraint
- disjoint(or)
- nondisjoint(and)

Logical Database Design

each entity is a relation, break composite attributes
one-to-many: post PK of one-side entity into the many-side entity which acts as a foreign key
one-to-one: put the foreign keys in one or both entities
- both mandatory: combine
- one side mandatory: from optional to mandatory
- both optional: new relation like many to many
one-to-one recursive:
- both mandatory: same relation with a foreign key
- both optional: new relation with PK and foreign key
many-to-many: new relation
mandatory and: one relation
optional and: two relations for superclass and subclasses
mandatory or: each combination of superclass and subclass
optional or: one for each relation

Database design

main phases:

Gather requirements
Conceptual database design
Logical database design
Physical database design

Block 3

Normalization

Data Redundancy
- the details of some data repeat in one relation
Update Anomalies
- Insertion anomalies
  - insert some data but data in the same tuple is absent
- Deletion anomalies
  - delete some data but other important data is deleted as well
- Modification anomalies
  - you must update the data many times
Normalization

produce a suitable set of relations that support the data requirements of an enterprise
- minimal number of necessary attributes
- close logical relationship
- minimal redundancy( deal with Update Anomalies ${\leftarrow}$ Insertion anomalies, Deletion anomalies, Modification anomalies)
- benefits: access and maintain easily, minimal storage space
Decomposition
- Lossless-join property
- Dependency preservation property
Functional Dependencies
- if each value of A in R is associated with exactly one value of B in R.
- determinant: the left-side attributes set
- holds all the time
Full Functional Dependency
- functional dependency, but not on proper subsets of determinant
- partial dependency: not full dependency, some attributes can be removed from determinant
Transitive Dependencies

${A \rightarrow B}$ and ${B \rightarrow C}$ , then C is transitively dependent on A via B
find FDs
- relationships between attributes
- requirements & assumptions
- common experience
find CKs
- only appear on the left, must be part
- only appear on the right, cannot be part
- don’t appear on both sides, must be part
UNF

a table contains one or more repeating groups(i.e. one grid has not only one value)
1NF
- A relation in which the intersection of each row and column contains one and only one value.(i.e. one grid one value)
- UNF ${\rightarrow}$ 1NF: find repeating groups, then flatten them or create new relations to place them
2NF
- in 1NF, every non-primary-key attribute is fully functionally dependent on the PK
- 1NF ${\rightarrow}$ $\to$ 2NF:
  - identify FDs
  - identify PK
  - if PFD exists, remove those attributes and place them into a new relation with their determinants
3NF
- in 2NF, and no non-PK is transitively dependent on the PK
- 2NF ${\rightarrow}$ $\to$ 3NF:
  - identify FDs
  - identify PK
  - if TD exists, remove those attributes and place them into a new relation with their determinants
General 2NF & 3NF
- define on non-CK and CK
BCNF
- A relation is in BCNF if and only if every determinant is a candidate key
- 3NF ${\rightarrow}$ BCNF:
  - find FDs that violate BCNF
  - decompose
  - find FDs
  - find CKs
  - repeat until all relations are in BCNF
- BCNF problem
  
  AB ${\rightarrow}$ C, C ${\rightarrow}$ B When decomposing, we lost both FDs
MVD(Multi-valued Dependency)
- Dependency between attributes (for example, A, B, and C) in a relation, such that for each value of A there is a set of values for B and a set of values for C. However, the set of values for B and C are independent of each other.
- A ${\twoheadrightarrow}$ B, A ${\twoheadrightarrow}$ C
- trivial MVD: B is subset of A, A ${\cup}$ B=R
4NF
- a relation is in 4NF if and only if for every nontrivial multi-valued dependency A ${\twoheadrightarrow}$ B, A is a candidate key of the relation.

Transaction Management

Transaction

series of actions that reads or updates contents of database(action: read, write, change, etc.)

ACID property
- Atomicity: all actions are completed or nothing is completed
- Consistency: valid state cannot be changed, all definitions must be preserved.
- Isolation: each transaction is unaware of other transactions executing concurrently in the system(Independence)
- Durability: change is permanent
outcomes
- commit
- roll back

Concurrency

lots of transactions executed at the same time.

problems(violates ACID property, especially Isolation)

Lost Update Problem

update is overridden

time T1 T2 X

t1 begin 100

t2 begin read(X) 100

t3 read(X) X=X+100 100

t4 X=X-10 write(X) 200

t5 write(X) commit 90/190

t6 commit 90
Uncommitted Dependency Problem

another transaction get uncommitted data

Time T3 T4 X

t1 begin 100

t2 read(X) 100

t3 X = X+100 100

t4 begin write(X) 200

t5 read(X) … 200

t6 X=X-10 rollback 100

t7 write(X) 190/90

t8 commit 190

time	T1	T2	X
t1		begin	100
t2	begin	read(X)	100
t3	read(X)	X=X+100	100
t4	X=X-10	write(X)	200
t5	write(X)	commit	90/190
t6	commit		90

Time	T3	T4	X
t1		begin	100
t2		read(X)	100
t3		X = X+100	100
t4	begin	write(X)	200
t5	read(X)	…	200
t6	X=X-10	rollback	100
t7	write(X)		190/90
t8	commit		190

Inconsistent Analysis Problem

transaction read values that updated by other transactions that make wrong analysis

time	T5	T6	X	Y	X	sum
t1		begin	100	50	25
t2	begin	Sum = 0	100	50	25	0
t3	read(X)	read(X)	100	50	25	0
t4	X = X-10	sum = sum + X	100	50	25	100/90
t5	write(X)	read(Y)	90	50	25	100
t6	read(Z)	sum = sum + Y	90	50	25	150
t7	Z = Z + 10		90	50	25	150
t8	write(Z)		90	50	35	150
t9	commit	read(Z)	90	50	35	150
t10		sum = sum + Z	90	50	35	180
t11		commit	90	50	35	180

schedule

sequence of operations
- serial schedule: no interleaved operations
- Serialisability & serializable: nonserial schedule is equivalent to serial schedule
- conflict: write and use same resource

Concurrency Control Techniques

locking
- shared lock(read)-- no transaction can change it simultaneously
- exclusive lock(write)-- no transaction can read it simultaneously
2-phase locking
- growing phase-- acquire locks, no releasing
- shrinking phase-- release locks, no acquiring
preventing problems with 2PL
Cascading Rollback

One transaction’s failure causes many to fail
- how to deal with? ->Postpone the release of all locks until end of the transaction.
Deadlock

two (or more) transactions are each waiting for locks held by the other to be released(crossed locks)
- how to deal with?
  - Timeouts
    
    Transaction that requests lock will only wait for a system-defined period of time. Out of this period, locks time out.
  - Deadlock prevention
    
    Conservative 2PL: All data items have to be locked at the beginning of a transaction
Timestamping
Granularity
Database Recovery
- Log File
- Checkpointing
  
  undo transactions at ${t_{failed}}$
  
  redo transactions between ${t_{checkpoint}}$ and ${t_{failed}}$

Block 4

DDBMS

Architecture

2-tier architecture
Client, database sever
3-tier architecture
Client(User Interface), application sever(data processing, business logic), database sever (data validation, database access )
Multi-tier
Extends from 3-tier

Distributed Database

Distributed Database
Interrelated collection of data, physically spread over a computer network
Distributed DBMS
Software system allows management, but makes distribution transparent
Single logical database split into fragments that distributed over a computer network
Distributed processing
Centralized database
Can be accessed by lots of computers over computer network

Types

Homogeneous DDBMS
all sites use same DBMS product
Heterogeneous DDBMS
Different DBMS products
Sol. Gateways

Architecture for DDBMS

Excepted functionality
Communication
Data dictionary
Query processing
Concurrency control
Recovery
Reference architecture
Global external schemas
Global conceptual schema
Fragmentation schema & allocation schema
Schema for each local DBMS
Components
Local DBMS
Data communication
Global system catalogue
DBMS

Design – key issues

Fragmentation

relation may be divided into a number of sub-relations, which are distributed
- Dis:
  - Performance --slower
  - Integrity --difficult
- Correctness
  - Completeness
  - Reconstruction
  - Disjointness
- Types
  - Horizontal
  - Vertical
  - Mixed
  Use relational algebra to obtain fragmentation of different type
Allocation

each relation is stored at the site with optimal distribution
Replication

copy of fragment may be maintained at several sites

Strategy for data allocation

Centralized --distributed processing
Fragmented
Complete Replication
Selective Replication

Transparency

Distribution transparency
Feels like a single logical entity
- fragmentation transparency
- location transparency
- replication transparency

Naming transparency
DDBMS must ensure two sites don’t have same name objects

Solution
- create central name sever
- add prefix to the name of each object

Transaction transparency
- distributed transactions maintain DDB’s integrity and consistency
- access data stored in more than one site
- divided into subtransactions
- DDBMS ensures synchronization
Performance transparency
- like a centralized DBMS
- DQP(Distributed Query Processor): generate optimized execution strategy, deal with cost of different types(like communication cost, which is most significant)
- strategies: move which data to which site
- communication time=C0 + (no_of_bits_in_message/transmission_rate)

ADV & DISADV

ADV
- Reflects organizational structure
- Improved shareability and local autonomy
- Improved availability
- Improved reliability
- Improved performance
- Economics
- Modular growth
DISADV
- Complexity
- Cost
- Security
- Integrity control more difficult
- Lack of standards
- Lack of experience
- Database design more complex

XML(eXtensible Markup Language)

is a markup specification language
Standard for data representation and exchange(data storage)
enables designers to create their own customised tags

<!--Bookstore with no DTD -->
<Bookstore>
    <Book ISBN="0590353403" price="20" Edition="1st">
    	<Title>Harry Potter</Title>
        <Author>
            <FirstName>J.k.</FirstName>
            <LastName>Rowling</LastName>
        </Author>
    </Book>
</Bookstore>

Elements
- must be properly nested
- <name>Mike</name> is an element
- <EMPTYELEMENT/> is the abbreviation of empty element
Attributes
- Attribute is placed inside start-tag
- vs Elements
  - only one in a tag
  - cannot describe the structure
declaration
- <?xml version=“1.0” encoding=“UTF-8” standalone=“no” ?>
comments
- <!-- This is a comment–>
well-formed XML
- adheres to basic structural requirements
  - single root element
  - matched tags, proper nesting
  - unique attributes within elements
Displaying: Use rule-based languages to translate to HTML
DTD(Document Type Definitions)
- Defines the valid syntax of an XML document.
- Specifies elements, attributes, nesting, ordering, number of occurrences, also special attribute types ID and IDREF(S).
XSD(XML Schema Definition)
- XML schema is the definition (both in terms of its organization and its data types) of a specific XML structure.
- • Type • Key declaration • References • Occurrence constraints
Relational Model vs XML

Relation XML

structure tables(rows & columns) hierarchical(tree)

schema fixed flexible

queries SQL not simple

ordering none ordered

implementation mature add-on

	Relation	XML
structure	tables(rows & columns)	hierarchical(tree)
schema	fixed	flexible
queries	SQL	not simple
ordering	none	ordered
implementation	mature	add-on

Data Mining

Data Mining
- The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions.
- Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.
application
- Retail / Marketing
- Banking
- Insurance
- Medicine
Operations
- Predictive modelling
- Database segmentation
- Link analysis
- Deviation detection
Techniques

specific implementations of the data mining operations.
- predictive modelling
  - Classification
    - tree induction
    - neural induction
  - supervised learning
  - linear regression
  - non-linear regression
- Database segmentation
  - Unsupervised learning
  - Less precise
- Link analysis
  - Establishing links between sets of records
- Deviation detection
Supervised learning vs unsupervised learning
- Supervised learning
  - Learning patterns from labelled datasets
  - 2 phases: training and testing
- Unsupervised learning
  - From unlabelled datasets
Stages
- Data selection
- Data cleaning
- Data mining/transformation
- Data evaluation
Data warehouse
- Well equipped for mining
- Store the data for mining
- Populated with clean, consistent data

NoSQL

Big data: the amount data being created has skyrocketed
Not only SQL / Not relational
Features
- flexible schema
- quicker cheaper to set up
- massive scalability
- relaxed consistency, high performance, availability(client accessible)
- N: more programming
- N: fewer guarantees
BASE
Relaxation of ACID properties(No ACID constraints)

basically available, soft state, eventually consistent
CAP theorem

assumption: many nodes, nodes contain replicas of partitions of data
- consistency: data has the same value at the same time
- availability: available even if a sever is down
- partition tolerance: A query has a answer though system is partitioned
you can pick 2 from CAP to satisfy the requirement
MapReduce
- map: filtering and sorting
- reduce: summary
Key-value Stores: efficiency, scalability, fault-tolerance
- model: key-value: pairs
Document Stores
- model: key-document: pairs
Graph database system
- node-edge
semi-structured
- best for massive data that is used for read
- not for frequently updating
Not used for
- require frequent updates
- require ACID(~= integrity and atomicity)
- require more structured