Using the example query from Figure 2.1 in Basic Syntax (Single-table Query): Clauses and Language Elements we established a basic understanding of SQL syntax (the first “s”) using a single-table query to illustrate the SELECT statement, the binding order of its clauses, and common SQL language elements. The natural next step for understanding SQL syntax is to introduce the ON/JOIN subclauses for performing multi-table SELECT statements — one of, if not the most critical DBI concepts. Before delving into ON/JOIN syntax, though, we’ll cover some foundational knowledge about the basic structure (or schema) of relational database management systems (RDBMS) — the second “s.” This chapter will focus on this concept of schema (and how to navigate it) before syntactically evaluating a multi-table example query in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses.
Relational database management systems (RDBMS’s) range considerably in their complexity and size, but consist of many themed tables (formally called relations) with schematic uniqueness and relatedness through the assignment of primary keys and foreign keys, respectively. Within tables there are rows (formally called tuples) and columns (formally called attributes). The importance of these concepts lies in understanding table relationships, which includes the mapping between tables, the participation (required or optional), and the multiplicity (e.g., one-to-one, one-to-many, many-to-many). Internalizing those concepts allows the DBI to proceed with more certainty about their expected query results, the structure of their output, and whether it matches the desired level of analysis (more about this later).
Tables are the functional units within an RDBMS schema and the primary objects of interest when performing JOIN operations. Understanding how an RDBMS system organizes its data into entities, and potentially further into types or hierarchies, is the first step in choosing tables for an analysis. As discussed in the Introduction the Epic Clarity Data Dictionary can be searched in various ways to help identify potentially relevant tables and their contents and relationships.
Each row from an RDBMS table is unique and represents an atomic unit of relevance. For example, in HFHS Epic that could be patient encounters, medication administration record actions, laboratory component results, flowsheet row documentation, and billing claims. Put more simply, if you perform a single table query that results in 1000 rows, the relevant question should be, A thousand what? Furthermore, incorporating table joins into queries can directly affect the atomic unit (or granularity) of each row by changing the cardinality which in turn can drastically affect interpretation of results. As we’ll see in the next section, the uniqueness of a table’s rows is schematically determined by its columns.
Columns serve multiple functions within a table, the most important of which for DBIs are the enforcement of each row’s uniqueness, linkage to other tables, indexing, and — the most practical aspect — information about each row or its raw values/measurements.
Ensuring the uniqueness of table rows is part of columns’ schematic function. But at the end of the day users are more interested in the practical information that comes from table columns, regardless of their schematic role — the columns help tell the story. This chapter is dedicated more towards syntax and schema, but later chapters will apply these principles towards real use cases, which will draw upon more practical knowledge about the Clarity schema.
Uniqueness is a fundamental constraint for rows in a relational database table. The reason for this (besides sound database design and function) is practical and goes back to the key question users should ask themselves whenever they envision their ultimate query results — what is the meaning of each row and how do we work with that analytically. Below are some important operational definitions for ascertaining a table’s schematic uniqueness and thus understanding the ultimate meaning of rows from source tables and query results:
Candidate Key or Minimal Superkey: The minimal set of columns (≥ 1) that is necessary to uniquely identify the rows within a table. The value or combination of values within this minimal set of columns appear only once in the table.
Primary Key: The candidate key selected as the official unique identifier for rows in a given table. Candidate keys not selected as the primary key are sometimes referred to as alternate keys.
So far the only keys we have discussed are candidate keys and their associated entities, which conferred table rows their schematic uniqueness. However, candidates keys represent a fraction of the table columns that could be used for performing join operations. Table joins are simply the process of combining one (i.e., self-joining) or more tables based on reference columns from each table and a set of conditions — any column can participate in a join operation (including columns that should never be joined together — e.g., two tables with “comments” columns whose contents will seldom match up). The syntax will be covered in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses. Additionally, table joins can be reductive, neutral, or additive and thus have the potential to change the granularity, or atomic unit of analysis, for a result set — this is a critical schematic concept called join cardinality and will be discussed in greater detail in Schema and Multi-table Queries.
The concept of join columns is distinct from foreign keys; which determines a table’s schematic relatedness, or its referential integrity within the schema. Foreign keys are simply source table columns that refer to the primary key of another table. As mentioned earlier, any column can participate in a table join, regardless if the source column is referencing a primary key or non-primary key in the destination table. Nonetheless, DBIs can identify potentially useful join prospects by consulting with the “Foreign Key Information” section of the Clarity Data Dictionary — using column names (regardless if they’re foreign keys or not) as search terms in the data dictionary can also help accomplish the same thing. Cross-referencing column names to their join prospects is another essential workflow for learning a database schema and deriving practical knowledge for query design.
Indexing refers to the physical storage of table data; which implicates the structure and speed with which queries access that data. Database engineering is outside the scope of this manual (and its author’s knowledge) but table indexes are useful to DBIs for authoring more performant SQL code. In the Clarity Data Dictionary near the top after the “Primary Key” section is the “Index Information” section. There users can determine if a table column influences the storage structure of that table’s data.
For the sake of simple discussion, the most common query performance benefit of indexed columns is when they are part of predicates in the ON/JOIN subclauses (as we’ll discuss in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses or within the join or search condition of an ON subclause (which is a predicate) or WHERE clause. Indexed columns as part of expressions in the SELECT list may also help optimize query performance. Ultimately, of all the Table Columns (attributes) subsections this one is by far the least important, but could be of use for those DBIs who have progressed to the point where query execution time becomes important. Obviously, intended results that take longer are more desirable than unintended results that execute quickly.