Block 1

Intro

  1. Data:

    facts and statistics collected together for reference or analysis

    eg. sounds, images, text, values

  2. Database:

    • Collection of logically related data
    • Description of this data
    • Meets information needs of organisation
    • A shared collection of logically related data and its description, designed to meet the information needs of an organization.
    • eg. student management database, library database
  3. DBMS:

    • A software system which enables users to define, create, maintain, and control access to the database.
    • eg. Oracle, MySQL, ACCESS
  4. Data model:

    • A model is a representation of ‘real world’ objects and events, and their associations.
    • A data model is a graphical description of the components of database.
    • An integrated collection of concepts for describing and manipulating data, relationships between data, and constraints on the data in an organization.(textbook)
    • a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.(dbs concept)
    • eg. er model, relational model
  5. Schema:

    • The description of the database is the database schema.
  6. The three-level ANSI-SPARC architecture:

    • Objective of three level ANSI-SPARC architecture: Separate each user’s view of the database from the way the database is physically represented.
  7. ADV of DBMS:

    • Control of data redundancy • Data consistency • More information from the same amount of data • Sharing of data • Improved data integrity • Improved security • Enforcement of standards• Economy of scale • Balance conflicting requirements • Improved data accessibility and responsiveness • Increased productivity • Improved maintenance through data independence • Increased concurrency • Improved backup and recovery services

  8. DISV of DBMS:

    • Complexity • Size • Cost of DBMS • Additional hardware costs • Cost of conversion • Performance • Higher impact of a failure

Relational model

  1. Relational model:

    In the relational model, all data is logically structured within relations (tables).

  2. Relation:

    A relation is a table with columns and rows.

  3. Attribute:

    • An attribute is a named column of a relation.
    • a property that describes some aspect of the object that we wish to record.
  4. Domain:

    the set of allowable values for one or more attributes.

  5. Tuple:

    A tuple is row of a relation.

  6. Degree:

    the number of attributes in a relation.

  7. Cardinality:

    the number of tuples in a relation.

  8. Relational database:

    A collection of normalized relations with distinct relation names.

  9. Candidate Key:

    A set of attributes that uniquely identifies a tuple within a relation.

  10. Primary Key:

    Candidate key selected to identify tuples uniquely within relation.

  11. Foreign Key:

    Attribute, or set of attributes, within one relation that matches candidate key of some other (possibly same) relation.

  12. NULLs:

    Represents a value for an attribute that is currently unknown or is not applicable for this tuple.

  13. Entity Integrity:

    In a base relation, no attribute of a primary key can be null.

  14. Referential Integrity:

    If foreign key exists in a relation, either foreign key value must match a candidate key value of some tuple in its home relation or foreign key value must be null.

  15. General Constraints:

    specific constraints on values

  16. Entity-relationship (E-R) modelling :

    E-R modelling is a high-level conceptual modelling technique for DB applications

  17. Entity:

    • An entity is a “thing” about which data should be stored.
    • Group of objects with same properties, identified by enterprise as having an independent existence.
    • a distinct object in the organization that is to be represented in the database.
  18. Relationship:

    Relationship describes a linkage between two entities

  19. Multiplicity:

    number (or range) of possible occurrences of an entity type that may relate to a single occurrence of an associated entity type through a particular relationship.

  20. Cardinality:

    Describes maximum number of possible relationship occurrences for an entity participating in a given relationship type.

  21. Participation:

    • Determines whether all or only some entity occurrences participate in a relationship.
    • Gives the minimum number for an entity occurrences participating in a given relationship type.
  22. Degree of a Relationship:

    Number of participating entities in a relationship.

  23. Ternary relationship:

    three entities are simultaneously involved.

Block 2

EER

  1. EER model(enhanced entity-relationship model)
  2. specialization/ generalization
  3. participation constraint
    • mandatory
    • optional
  4. disjoint constraint
    • disjoint(or)
    • nondisjoint(and)

Logical Database Design

  1. each entity is a relation, break composite attributes
  2. one-to-many: post PK of one-side entity into the many-side entity which acts as a foreign key
  3. one-to-one: put the foreign keys in one or both entities
    • both mandatory: combine
    • one side mandatory: from optional to mandatory
    • both optional: new relation like many to many
  4. one-to-one recursive:
    • both mandatory: same relation with a foreign key
    • both optional: new relation with PK and foreign key
  5. many-to-many: new relation
  6. mandatory and: one relation
  7. optional and: two relations for superclass and subclasses
  8. mandatory or: each combination of superclass and subclass
  9. optional or: one for each relation

Database design

main phases:

  • Gather requirements
  • Conceptual database design
  • Logical database design
  • Physical database design

Block 3

Normalization

  1. Data Redundancy

    • the details of some data repeat in one relation
  2. Update Anomalies

    • Insertion anomalies

      • insert some data but data in the same tuple is absent
    • Deletion anomalies

      • delete some data but other important data is deleted as well
    • Modification anomalies

      • you must update the data many times
  3. Normalization

    produce a suitable set of relations that support the data requirements of an enterprise

    • minimal number of necessary attributes
    • close logical relationship
    • minimal redundancy( deal with Update Anomalies {\leftarrow} Insertion anomalies, Deletion anomalies, Modification anomalies)
    • benefits: access and maintain easily, minimal storage space
  4. Decomposition

    • Lossless-join property
    • Dependency preservation property
  5. Functional Dependencies

    • if each value of A in R is associated with exactly one value of B in R.
    • determinant: the left-side attributes set
    • holds all the time
  6. Full Functional Dependency

    • functional dependency, but not on proper subsets of determinant
    • partial dependency: not full dependency, some attributes can be removed from determinant
  7. Transitive Dependencies

    AB{A \rightarrow B} and BC{B \rightarrow C} , then C is transitively dependent on A via B

  8. find FDs

    • relationships between attributes
    • requirements & assumptions
    • common experience
  9. find CKs

    • only appear on the left, must be part
    • only appear on the right, cannot be part
    • don’t appear on both sides, must be part
  10. UNF

    a table contains one or more repeating groups(i.e. one grid has not only one value)

  11. 1NF

    • A relation in which the intersection of each row and column contains one and only one value.(i.e. one grid one value)
    • UNF {\rightarrow} 1NF: find repeating groups, then flatten them or create new relations to place them
  12. 2NF

    • in 1NF, every non-primary-key attribute is fully functionally dependent on the PK
    • 1NF {\rightarrow} 2NF:
      • identify FDs
      • identify PK
      • if PFD exists, remove those attributes and place them into a new relation with their determinants
  13. 3NF

    • in 2NF, and no non-PK is transitively dependent on the PK
    • 2NF {\rightarrow} 3NF:
      • identify FDs
      • identify PK
      • if TD exists, remove those attributes and place them into a new relation with their determinants
  14. General 2NF & 3NF

    • define on non-CK and CK
  15. BCNF

    • A relation is in BCNF if and only if every determinant is a candidate key

    • 3NF {\rightarrow} BCNF:

      • find FDs that violate BCNF
      • decompose
      • find FDs
      • find CKs
      • repeat until all relations are in BCNF
    • BCNF problem

      AB {\rightarrow} C, C {\rightarrow} B When decomposing, we lost both FDs

  16. MVD(Multi-valued Dependency)

    • Dependency between attributes (for example, A, B, and C) in a relation, such that for each value of A there is a set of values for B and a set of values for C. However, the set of values for B and C are independent of each other.
    • A {\twoheadrightarrow} B, A {\twoheadrightarrow} C
    • trivial MVD: B is subset of A, A{\cup}B=R
  17. 4NF

    • a relation is in 4NF if and only if for every nontrivial multi-valued dependency A {\twoheadrightarrow} B, A is a candidate key of the relation.

Transaction Management

Transaction

series of actions that reads or updates contents of database(action: read, write, change, etc.)

  • ACID property

    • Atomicity: all actions are completed or nothing is completed
    • Consistency: valid state cannot be changed, all definitions must be preserved.
    • Isolation: each transaction is unaware of other transactions executing concurrently in the system(Independence)
    • Durability: change is permanent
  • outcomes

    • commit
    • roll back

Concurrency

lots of transactions executed at the same time.

  1. problems(violates ACID property, especially Isolation)

    • Lost Update Problem

      update is overridden

      time T1 T2 X
      t1 begin 100
      t2 begin read(X) 100
      t3 read(X) X=X+100 100
      t4 X=X-10 write(X) 200
      t5 write(X) commit 90/190
      t6 commit 90
    • Uncommitted Dependency Problem

      another transaction get uncommitted data

      Time T3 T4 X
      t1 begin 100
      t2 read(X) 100
      t3 X = X+100 100
      t4 begin write(X) 200
      t5 read(X) 200
      t6 X=X-10 rollback 100
      t7 write(X) 190/90
      t8 commit 190
    • Inconsistent Analysis Problem

      transaction read values that updated by other transactions that make wrong analysis

      time T5 T6 X Y X sum
      t1 begin 100 50 25
      t2 begin Sum = 0 100 50 25 0
      t3 read(X) read(X) 100 50 25 0
      t4 X = X-10 sum = sum + X 100 50 25 100/90
      t5 write(X) read(Y) 90 50 25 100
      t6 read(Z) sum = sum + Y 90 50 25 150
      t7 Z = Z + 10 90 50 25 150
      t8 write(Z) 90 50 35 150
      t9 commit read(Z) 90 50 35 150
      t10 sum = sum + Z 90 50 35 180
      t11 commit 90 50 35 180
  2. schedule

    sequence of operations

    • serial schedule: no interleaved operations
    • Serialisability & serializable: nonserial schedule is equivalent to serial schedule
    • conflict: write and use same resource

Concurrency Control Techniques

  1. locking

    • shared lock(read)-- no transaction can change it simultaneously
    • exclusive lock(write)-- no transaction can read it simultaneously
  2. 2-phase locking

    • growing phase-- acquire locks, no releasing
    • shrinking phase-- release locks, no acquiring
  3. preventing problems with 2PL

  4. Cascading Rollback

    One transaction’s failure causes many to fail

    • how to deal with? ->Postpone the release of all locks until end of the transaction.
  5. Deadlock

    two (or more) transactions are each waiting for locks held by the other to be released(crossed locks)

    • how to deal with?

      • Timeouts

        Transaction that requests lock will only wait for a system-defined period of time. Out of this period, locks time out.

      • Deadlock prevention

        Conservative 2PL: All data items have to be locked at the beginning of a transaction

  6. Timestamping

  7. Granularity

  8. Database Recovery

    • Log File

    • Checkpointing

      undo transactions at tfailed{t_{failed}}

      redo transactions between tcheckpoint{t_{checkpoint}} and tfailed{t_{failed}}

Block 4

DDBMS

Architecture

  1. 2-tier architecture
    Client, database sever
  2. 3-tier architecture
    Client(User Interface), application sever(data processing, business logic), database sever (data validation, database access )
  3. Multi-tier
    Extends from 3-tier

Distributed Database

  1. Distributed Database
    Interrelated collection of data, physically spread over a computer network
  2. Distributed DBMS
    Software system allows management, but makes distribution transparent
    Single logical database split into fragments that distributed over a computer network
  3. Distributed processing
    Centralized database
    Can be accessed by lots of computers over computer network

Types

  1. Homogeneous DDBMS
    all sites use same DBMS product
  2. Heterogeneous DDBMS
    Different DBMS products
    Sol. Gateways

Architecture for DDBMS

  1. Excepted functionality
    Communication
    Data dictionary
    Query processing
    Concurrency control
    Recovery
  2. Reference architecture
    Global external schemas
    Global conceptual schema
    Fragmentation schema & allocation schema
    Schema for each local DBMS
  3. Components
    Local DBMS
    Data communication
    Global system catalogue
    DBMS

Design – key issues

  1. Fragmentation

    relation may be divided into a number of sub-relations, which are distributed

    • Dis:

      • Performance --slower
      • Integrity --difficult
    • Correctness

      • Completeness
      • Reconstruction
      • Disjointness
    • Types

      • Horizontal
      • Vertical
      • Mixed

      Use relational algebra to obtain fragmentation of different type

  2. Allocation

    each relation is stored at the site with optimal distribution

  3. Replication

    copy of fragment may be maintained at several sites

Strategy for data allocation
  1. Centralized --distributed processing
  2. Fragmented
  3. Complete Replication
  4. Selective Replication

Transparency

  1. Distribution transparency
    Feels like a single logical entity
    • fragmentation transparency
    • location transparency
    • replication transparency
  • Naming transparency
    DDBMS must ensure two sites don’t have same name objects

    Solution

    • create central name sever
    • add prefix to the name of each object
  1. Transaction transparency

    • distributed transactions maintain DDB’s integrity and consistency
    • access data stored in more than one site
    • divided into subtransactions
    • DDBMS ensures synchronization
  2. Performance transparency

    • like a centralized DBMS
    • DQP(Distributed Query Processor): generate optimized execution strategy, deal with cost of different types(like communication cost, which is most significant)
    • strategies: move which data to which site
    • communication time=C0 + (no_of_bits_in_message/transmission_rate)

ADV & DISADV

  • ADV
    • Reflects organizational structure
    • Improved shareability and local autonomy
    • Improved availability
    • Improved reliability
    • Improved performance
    • Economics
    • Modular growth
  • DISADV
    • Complexity
    • Cost
    • Security
    • Integrity control more difficult
    • Lack of standards
    • Lack of experience
    • Database design more complex

XML(eXtensible Markup Language)

  • is a markup specification language

  • Standard for data representation and exchange(data storage)

  • enables designers to create their own customised tags

1
2
3
4
5
6
7
8
9
10
<!--Bookstore with no DTD -->
<Bookstore>
<Book ISBN="0590353403" price="20" Edition="1st">
<Title>Harry Potter</Title>
<Author>
<FirstName>J.k.</FirstName>
<LastName>Rowling</LastName>
</Author>
</Book>
</Bookstore>
  1. Elements

    • must be properly nested
    • <name>Mike</name> is an element
    • <EMPTYELEMENT/> is the abbreviation of empty element
  2. Attributes

    • Attribute is placed inside start-tag
    • vs Elements
      • only one in a tag
      • cannot describe the structure
  3. declaration

    • <?xml version=“1.0” encoding=“UTF-8” standalone=“no” ?>
  4. comments

    • <!-- This is a comment–>
  5. well-formed XML

    • adheres to basic structural requirements
      • single root element
      • matched tags, proper nesting
      • unique attributes within elements
  6. Displaying: Use rule-based languages to translate to HTML

  7. DTD(Document Type Definitions)

    • Defines the valid syntax of an XML document.
    • Specifies elements, attributes, nesting, ordering, number of occurrences, also special attribute types ID and IDREF(S).
  8. XSD(XML Schema Definition)

    • XML schema is the definition (both in terms of its organization and its data types) of a specific XML structure.
    • • Type • Key declaration • References • Occurrence constraints
  9. Relational Model vs XML

    Relation XML
    structure tables(rows & columns) hierarchical(tree)
    schema fixed flexible
    queries SQL not simple
    ordering none ordered
    implementation mature add-on

Data Mining

  1. Data Mining

    • The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions.
    • Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.
  2. application

    • Retail / Marketing
    • Banking
    • Insurance
    • Medicine
  3. Operations

    • Predictive modelling
    • Database segmentation
    • Link analysis
    • Deviation detection
  4. Techniques

    specific implementations of the data mining operations.

    • predictive modelling

      • Classification
        • tree induction
        • neural induction
      • supervised learning
      • linear regression

      • non-linear regression

    • Database segmentation

      • Unsupervised learning

      • Less precise

    • Link analysis

      • Establishing links between sets of records
    • Deviation detection

  5. Supervised learning vs unsupervised learning

    • Supervised learning
      • Learning patterns from labelled datasets
      • 2 phases: training and testing
    • Unsupervised learning
      • From unlabelled datasets
  6. Stages

    • Data selection

    • Data cleaning

    • Data mining/transformation

    • Data evaluation

  7. Data warehouse

    • Well equipped for mining
    • Store the data for mining
    • Populated with clean, consistent data

NoSQL

  1. Big data: the amount data being created has skyrocketed

  2. Not only SQL / Not relational

  3. Features

    • flexible schema
    • quicker cheaper to set up
    • massive scalability
    • relaxed consistency, high performance, availability(client accessible)
    • N: more programming
    • N: fewer guarantees
  4. BASE
    Relaxation of ACID properties(No ACID constraints)

    basically available, soft state, eventually consistent

  5. CAP theorem

    assumption: many nodes, nodes contain replicas of partitions of data

    • consistency: data has the same value at the same time
    • availability: available even if a sever is down
    • partition tolerance: A query has a answer though system is partitioned

    you can pick 2 from CAP to satisfy the requirement

  6. MapReduce

    • map: filtering and sorting
    • reduce: summary
  7. Key-value Stores: efficiency, scalability, fault-tolerance

    • model: key-value: pairs
  8. Document Stores

    • model: key-document: pairs
  9. Graph database system

    • node-edge
  10. semi-structured

    • best for massive data that is used for read
    • not for frequently updating
  11. Not used for

    • require frequent updates
    • require ACID(~= integrity and atomicity)
    • require more structured