Database Revision
Block 1
Intro
-
Data:
facts and statistics collected together for reference or analysis
eg. sounds, images, text, values
-
Database:
- Collection of logically related data
- Description of this data
- Meets information needs of organisation
- A shared collection of logically related data and its description, designed to meet the information needs of an organization.
- eg. student management database, library database
-
DBMS:
- A software system which enables users to define, create, maintain, and control access to the database.
- eg. Oracle, MySQL, ACCESS
-
Data model:
- A model is a representation of ‘real world’ objects and events, and their associations.
- A data model is a graphical description of the components of database.
- An integrated collection of concepts for describing and manipulating data, relationships between data, and constraints on the data in an organization.(textbook)
- a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.(dbs concept)
- eg. er model, relational model
-
Schema:
- The description of the database is the database schema.
-
The three-level ANSI-SPARC architecture:
- Objective of three level ANSI-SPARC architecture: Separate each user’s view of the database from the way the database is physically represented.
-
ADV of DBMS:
• Control of data redundancy • Data consistency • More information from the same amount of data • Sharing of data • Improved data integrity • Improved security • Enforcement of standards• Economy of scale • Balance conflicting requirements • Improved data accessibility and responsiveness • Increased productivity • Improved maintenance through data independence • Increased concurrency • Improved backup and recovery services
-
DISV of DBMS:
• Complexity • Size • Cost of DBMS • Additional hardware costs • Cost of conversion • Performance • Higher impact of a failure
Relational model
-
Relational model:
In the relational model, all data is logically structured within relations (tables).
-
Relation:
A relation is a table with columns and rows.
-
Attribute:
- An attribute is a named column of a relation.
- a property that describes some aspect of the object that we wish to record.
-
Domain:
the set of allowable values for one or more attributes.
-
Tuple:
A tuple is row of a relation.
-
Degree:
the number of attributes in a relation.
-
Cardinality:
the number of tuples in a relation.
-
Relational database:
A collection of normalized relations with distinct relation names.
-
Candidate Key:
A set of attributes that uniquely identifies a tuple within a relation.
-
Primary Key:
Candidate key selected to identify tuples uniquely within relation.
-
Foreign Key:
Attribute, or set of attributes, within one relation that matches candidate key of some other (possibly same) relation.
-
NULLs:
Represents a value for an attribute that is currently unknown or is not applicable for this tuple.
-
Entity Integrity:
In a base relation, no attribute of a primary key can be null.
-
Referential Integrity:
If foreign key exists in a relation, either foreign key value must match a candidate key value of some tuple in its home relation or foreign key value must be null.
-
General Constraints:
specific constraints on values
-
Entity-relationship (E-R) modelling :
E-R modelling is a high-level conceptual modelling technique for DB applications
-
Entity:
- An entity is a “thing” about which data should be stored.
- Group of objects with same properties, identified by enterprise as having an independent existence.
- a distinct object in the organization that is to be represented in the database.
-
Relationship:
Relationship describes a linkage between two entities
-
Multiplicity:
number (or range) of possible occurrences of an entity type that may relate to a single occurrence of an associated entity type through a particular relationship.
-
Cardinality:
Describes maximum number of possible relationship occurrences for an entity participating in a given relationship type.
-
Participation:
- Determines whether all or only some entity occurrences participate in a relationship.
- Gives the minimum number for an entity occurrences participating in a given relationship type.
-
Degree of a Relationship:
Number of participating entities in a relationship.
-
Ternary relationship:
three entities are simultaneously involved.
Block 2
EER
- EER model(enhanced entity-relationship model)
- specialization/ generalization
- participation constraint
- mandatory
- optional
- disjoint constraint
- disjoint(or)
- nondisjoint(and)
Logical Database Design
- each entity is a relation, break composite attributes
- one-to-many: post PK of one-side entity into the many-side entity which acts as a foreign key
- one-to-one: put the foreign keys in one or both entities
- both mandatory: combine
- one side mandatory: from optional to mandatory
- both optional: new relation like many to many
- one-to-one recursive:
- both mandatory: same relation with a foreign key
- both optional: new relation with PK and foreign key
- many-to-many: new relation
- mandatory and: one relation
- optional and: two relations for superclass and subclasses
- mandatory or: each combination of superclass and subclass
- optional or: one for each relation
Database design
main phases:
- Gather requirements
- Conceptual database design
- Logical database design
- Physical database design
Block 3
Normalization
-
Data Redundancy
- the details of some data repeat in one relation
-
Update Anomalies
-
Insertion anomalies
- insert some data but data in the same tuple is absent
-
Deletion anomalies
- delete some data but other important data is deleted as well
-
Modification anomalies
- you must update the data many times
-
-
Normalization
produce a suitable set of relations that support the data requirements of an enterprise
- minimal number of necessary attributes
- close logical relationship
- minimal redundancy( deal with Update Anomalies Insertion anomalies, Deletion anomalies, Modification anomalies)
- benefits: access and maintain easily, minimal storage space
-
Decomposition
- Lossless-join property
- Dependency preservation property
-
Functional Dependencies
- if each value of A in R is associated with exactly one value of B in R.
- determinant: the left-side attributes set
- holds all the time
-
Full Functional Dependency
- functional dependency, but not on proper subsets of determinant
- partial dependency: not full dependency, some attributes can be removed from determinant
-
Transitive Dependencies
and , then C is transitively dependent on A via B
-
find FDs
- relationships between attributes
- requirements & assumptions
- common experience
-
find CKs
- only appear on the left, must be part
- only appear on the right, cannot be part
- don’t appear on both sides, must be part
-
UNF
a table contains one or more repeating groups(i.e. one grid has not only one value)
-
1NF
- A relation in which the intersection of each row and column contains one and only one value.(i.e. one grid one value)
- UNF 1NF: find repeating groups, then flatten them or create new relations to place them
-
2NF
- in 1NF, every non-primary-key attribute is fully functionally dependent on the PK
- 1NF 2NF:
- identify FDs
- identify PK
- if PFD exists, remove those attributes and place them into a new relation with their determinants
-
3NF
- in 2NF, and no non-PK is transitively dependent on the PK
- 2NF 3NF:
- identify FDs
- identify PK
- if TD exists, remove those attributes and place them into a new relation with their determinants
-
General 2NF & 3NF
- define on non-CK and CK
-
BCNF
-
A relation is in BCNF if and only if every determinant is a candidate key
-
3NF BCNF:
- find FDs that violate BCNF
- decompose
- find FDs
- find CKs
- repeat until all relations are in BCNF
-
BCNF problem
AB C, C B When decomposing, we lost both FDs
-
-
MVD(Multi-valued Dependency)
- Dependency between attributes (for example, A, B, and C) in a relation, such that for each value of A there is a set of values for B and a set of values for C. However, the set of values for B and C are independent of each other.
- A B, A C
- trivial MVD: B is subset of A, AB=R
-
4NF
- a relation is in 4NF if and only if for every nontrivial multi-valued dependency A B, A is a candidate key of the relation.
Transaction Management
Transaction
series of actions that reads or updates contents of database(action: read, write, change, etc.)
-
ACID property
- Atomicity: all actions are completed or nothing is completed
- Consistency: valid state cannot be changed, all definitions must be preserved.
- Isolation: each transaction is unaware of other transactions executing concurrently in the system(Independence)
- Durability: change is permanent
-
outcomes
- commit
- roll back
Concurrency
lots of transactions executed at the same time.
-
problems(violates ACID property, especially Isolation)
-
Lost Update Problem
update is overridden
time T1 T2 X t1 begin 100 t2 begin read(X) 100 t3 read(X) X=X+100 100 t4 X=X-10 write(X) 200 t5 write(X) commit 90/190 t6 commit 90 -
Uncommitted Dependency Problem
another transaction get uncommitted data
Time T3 T4 X t1 begin 100 t2 read(X) 100 t3 X = X+100 100 t4 begin write(X) 200 t5 read(X) … 200 t6 X=X-10 rollback 100 t7 write(X) 190/90 t8 commit 190 -
Inconsistent Analysis Problem
transaction read values that updated by other transactions that make wrong analysis
time T5 T6 X Y X sum t1 begin 100 50 25 t2 begin Sum = 0 100 50 25 0 t3 read(X) read(X) 100 50 25 0 t4 X = X-10 sum = sum + X 100 50 25 100/90 t5 write(X) read(Y) 90 50 25 100 t6 read(Z) sum = sum + Y 90 50 25 150 t7 Z = Z + 10 90 50 25 150 t8 write(Z) 90 50 35 150 t9 commit read(Z) 90 50 35 150 t10 sum = sum + Z 90 50 35 180 t11 commit 90 50 35 180
-
-
schedule
sequence of operations
- serial schedule: no interleaved operations
- Serialisability & serializable: nonserial schedule is equivalent to serial schedule
- conflict: write and use same resource
Concurrency Control Techniques
-
locking
- shared lock(read)-- no transaction can change it simultaneously
- exclusive lock(write)-- no transaction can read it simultaneously
-
2-phase locking
- growing phase-- acquire locks, no releasing
- shrinking phase-- release locks, no acquiring
-
preventing problems with 2PL
-
Cascading Rollback
One transaction’s failure causes many to fail
- how to deal with? ->Postpone the release of all locks until end of the transaction.
-
Deadlock
two (or more) transactions are each waiting for locks held by the other to be released(crossed locks)
-
how to deal with?
-
Timeouts
Transaction that requests lock will only wait for a system-defined period of time. Out of this period, locks time out.
-
Deadlock prevention
Conservative 2PL: All data items have to be locked at the beginning of a transaction
-
-
-
Timestamping
-
Granularity
-
Database Recovery
-
Log File
-
Checkpointing
undo transactions at
redo transactions between and
-
Block 4
DDBMS
Architecture
- 2-tier architecture
Client, database sever - 3-tier architecture
Client(User Interface), application sever(data processing, business logic), database sever (data validation, database access ) - Multi-tier
Extends from 3-tier
Distributed Database
- Distributed Database
Interrelated collection of data, physically spread over a computer network - Distributed DBMS
Software system allows management, but makes distribution transparent
Single logical database split into fragments that distributed over a computer network - Distributed processing
Centralized database
Can be accessed by lots of computers over computer network
Types
- Homogeneous DDBMS
all sites use same DBMS product - Heterogeneous DDBMS
Different DBMS products
Sol. Gateways
Architecture for DDBMS
- Excepted functionality
Communication
Data dictionary
Query processing
Concurrency control
Recovery - Reference architecture
Global external schemas
Global conceptual schema
Fragmentation schema & allocation schema
Schema for each local DBMS - Components
Local DBMS
Data communication
Global system catalogue
DBMS
Design – key issues
-
Fragmentation
relation may be divided into a number of sub-relations, which are distributed
-
Dis:
- Performance --slower
- Integrity --difficult
-
Correctness
- Completeness
- Reconstruction
- Disjointness
-
Types
- Horizontal
- Vertical
- Mixed
Use relational algebra to obtain fragmentation of different type
-
-
Allocation
each relation is stored at the site with optimal distribution
-
Replication
copy of fragment may be maintained at several sites
Strategy for data allocation
- Centralized --distributed processing
- Fragmented
- Complete Replication
- Selective Replication
Transparency
- Distribution transparency
Feels like a single logical entity- fragmentation transparency
- location transparency
- replication transparency
-
Naming transparency
DDBMS must ensure two sites don’t have same name objectsSolution
- create central name sever
- add prefix to the name of each object
-
Transaction transparency
- distributed transactions maintain DDB’s integrity and consistency
- access data stored in more than one site
- divided into subtransactions
- DDBMS ensures synchronization
-
Performance transparency
- like a centralized DBMS
- DQP(Distributed Query Processor): generate optimized execution strategy, deal with cost of different types(like communication cost, which is most significant)
- strategies: move which data to which site
- communication time=C0 + (no_of_bits_in_message/transmission_rate)
ADV & DISADV
- ADV
- Reflects organizational structure
- Improved shareability and local autonomy
- Improved availability
- Improved reliability
- Improved performance
- Economics
- Modular growth
- DISADV
- Complexity
- Cost
- Security
- Integrity control more difficult
- Lack of standards
- Lack of experience
- Database design more complex
XML(eXtensible Markup Language)
-
is a markup specification language
-
Standard for data representation and exchange(data storage)
-
enables designers to create their own customised tags
1 | <!--Bookstore with no DTD --> |
-
Elements
- must be properly nested
- <name>Mike</name> is an element
- <EMPTYELEMENT/> is the abbreviation of empty element
-
Attributes
- Attribute is placed inside start-tag
- vs Elements
- only one in a tag
- cannot describe the structure
-
declaration
- <?xml version=“1.0” encoding=“UTF-8” standalone=“no” ?>
-
comments
- <!-- This is a comment–>
-
well-formed XML
- adheres to basic structural requirements
- single root element
- matched tags, proper nesting
- unique attributes within elements
- adheres to basic structural requirements
-
Displaying: Use rule-based languages to translate to HTML
-
DTD(Document Type Definitions)
- Defines the valid syntax of an XML document.
- Specifies elements, attributes, nesting, ordering, number of occurrences, also special attribute types ID and IDREF(S).
-
XSD(XML Schema Definition)
- XML schema is the definition (both in terms of its organization and its data types) of a specific XML structure.
- • Type • Key declaration • References • Occurrence constraints
-
Relational Model vs XML
Relation XML structure tables(rows & columns) hierarchical(tree) schema fixed flexible queries SQL not simple ordering none ordered implementation mature add-on
Data Mining
-
Data Mining
- The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions.
- Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.
-
application
- Retail / Marketing
- Banking
- Insurance
- Medicine
-
Operations
- Predictive modelling
- Database segmentation
- Link analysis
- Deviation detection
-
Techniques
specific implementations of the data mining operations.
-
predictive modelling
- Classification
- tree induction
- neural induction
- supervised learning
-
linear regression
-
non-linear regression
- Classification
-
Database segmentation
-
Unsupervised learning
-
Less precise
-
-
Link analysis
- Establishing links between sets of records
-
Deviation detection
-
-
Supervised learning vs unsupervised learning
- Supervised learning
- Learning patterns from labelled datasets
- 2 phases: training and testing
- Unsupervised learning
- From unlabelled datasets
- Supervised learning
-
Stages
-
Data selection
-
Data cleaning
-
Data mining/transformation
-
Data evaluation
-
-
Data warehouse
- Well equipped for mining
- Store the data for mining
- Populated with clean, consistent data
NoSQL
-
Big data: the amount data being created has skyrocketed
-
Not only SQL / Not relational
-
Features
- flexible schema
- quicker cheaper to set up
- massive scalability
- relaxed consistency, high performance, availability(client accessible)
- N: more programming
- N: fewer guarantees
-
BASE
Relaxation of ACID properties(No ACID constraints)basically available, soft state, eventually consistent
-
CAP theorem
assumption: many nodes, nodes contain replicas of partitions of data
- consistency: data has the same value at the same time
- availability: available even if a sever is down
- partition tolerance: A query has a answer though system is partitioned
you can pick 2 from CAP to satisfy the requirement
-
MapReduce
- map: filtering and sorting
- reduce: summary
-
Key-value Stores: efficiency, scalability, fault-tolerance
- model: key-value: pairs
-
Document Stores
- model: key-document: pairs
-
Graph database system
- node-edge
-
semi-structured
- best for massive data that is used for read
- not for frequently updating
-
Not used for
- require frequent updates
- require ACID(~= integrity and atomicity)
- require more structured