While my articles have been well received, there is always something new to cover. Every week I get requests for new sample code and articles explaining some new aspect of using Visual Basic and MySQL. One of the recurring requests has been for a full tutorial covering the creation of an application, from design to deployment, using Visual Basic and MySQL. In response to these requests I have written this article as the first of a series of articles that will provide a full tutorial for the creation of an application using Visual Basic.NET and MySQL. I will cover application design, database design, VB.NET database programming, and deployment of the finished application.
This tutorial will be very hands-on, documenting the creation of a real application that will then be available for download in both source and binary form for future reference. Intermediate files will be available after each tutorial section that can be used to familiarize yourself with the project as it progresses.
In this first installment of the series I will be covering the design of our application’s database. A well-designed database can be critical to the success of your database project, allowing for increased expandability and scalability, not to mention it makes for easier development. In this tutorial I will describe the steps needed to begin your application design, then move on to the design of your database, finishing up with the writing of your table creation statements.
Choosing an Application
The very first thing required when creating an application is a need. Computer programs are created to fulfill needs that the application users have. In my office we have many staff members, each of whom may or may not be in the office at any given time. We currently make use of a simple application that tracks which employees are in the office and displays a short message for those who are out of the office indicating when they will return.
This type of in/out tracking software is fairly common, but this type of application often comes with two limitations; many of these applications are limited to only a certain number of users, usually in the area of 50-100. In addition, these applications often use flat files to store information; requiring that each user have access to a shared network directory in order to use the system. Both of these limitations can be overcome by using an SQL (pronounced “ess-que-ell”, not “sequel”) database. If the application we use at the office had its source code available, modifying it would have been an option (freedom to modify an application to suit your needs is one of the great benefits of using Open Source software and tools). Since the application in question is a relatively simple one, I have chosen to rewrite it using Visual Basic and MySQL.
The idea for your application may come from a desire to “build a better mousetrap” – developing a better version of an existing application. On the other hand you may have a need for an application but cannot find a suitable existing project. Finally, the idea or need may be that of someone else; either a client or your employer. In any case, the first step when developing an application is deciding which application to develop.
Listing Requirements and Features
Once you have determined what application you will be developing, it is important to then decide what specifically the application will do. You will want to define its basic functionality, along with certain features that the software will implement. If you are developing this software for yourself, this stage in the development process can be fairly informal, but it is nevertheless very important. If you are developing software for a client or an employer then creating a list of features and requirements is vital. The list will determine what your responsibilities are as a software developer and will give a clear definition of what is required for your project to be considered complete. It should also be remembered that clients and employers have a tendency to request additional features as the project progresses while not wanting to pay any additional funds to implement them. If you have a list of features in hand you can inform your customer that your job is to implement the listed features and that any additional features will carry an additional cost.
In addition to listing features and requirements, you will also want to develop a timeline and a budget. You would want to schedule significant milestones such as project completion and major phases of work being done. For example, you may wish to schedule a milestone to signify the delivery of application and database designs to the customer. You may also want a milestone for delivery of a working demonstration of the project. Often these milestones coincide with payments from the client when you work as a consultant, with the payments being defined in your budget.
The Development Triangle
On the subject of money and features and timelines, it is important to understand the balance that must exist between the three. Your client will undoubtedly want you to deliver as many features as possible within a small budget and a short time period. It will be important for you to realize that you may sometimes need to reject a requested feature or timeline in order to keep everything in balance. Imagine that time, features, and money are three corners of a triangle:
While you can stretch the corners of this triangle, you cannot change the area it occupies, as the area of the triangle represents your total resources. The impact of this is that every section can only increase at the expense of the other two: If you want a project to have more features you will have to either take more time or spend more money for additional tools or developers. If you want a project to cost less money, you either need to decrease the number of features or allow the project to take more time (this is because either you cut back on the number of developers or allow them to only work on the project in their spare time). Finally, if you want the project done faster you either have to decrease the number of features or spend more money for more tools and/or developers (or at least pay overtime for your existing developers).
When negotiating a feature list and timeline with your client or employer be sure to keep the triangle in mind, and even show it to your customer so they can understand the relationship between features, money, and time. If you are developing the project for your own use you benefit from having the perfect client, but still keep in mind that there is only so much you can do with your limited time.
let’s take a look at the feature and requirements list for our in/out tracking software:
- Login should require a username and password
- User should be prompted for a status message when logged in
- User can choose between no status message, a pre-set status message, or specify their own message when logging in or out
- Last two user-defined status messages should be available in a list along with pre-defined ones
- User should be able to login from any PC with an Internet connection, with an encrypted connection available
- User should be able to flag a fellow user and be notified when they check in
- Users should be divided into separate groups (Sales, Accounting, Programming, etc.)
- Main screen should show a list of all staff (or selected groups) with name, in/out status, status message, and custom fields
- Application should minimize to system tray
- User should be able to customize appearance, window size, font size, date format, name format, colors
- Standard users can change their status, display format, and send other users messages
- Receptionist users can change group users’ status (receptionists can be assigned to multiple groups)
- Managers can add/delete users within their groups, change their status (for their groups), and assign receptionists
- Administrators can create users, assign managers and administrators, and change group memberships. They have no restrictions
- Users should be able to choose between having the application start minimized or maximized
- Users should be able to choose refresh interval for open window
Your features and requirements list should be as detailed as needed for you and your client, and more complex applications should have more detailed documentation. Once your list of features, your budget, and your timeline is approved you can then move on to designing your application.
One good approach to application design is to look at the different entities that will be involved in the application. An entity is simply an object we want to store information about. Designing an application based on its component entities is a basic principle of object oriented programming. In our application the main entity is our users. There are four different kinds of user: Regular User, Receptionist, Manager, and Administrator. There are two approaches we can take we can take to represent our different types of user. We could describe four different entities, one for each type of user, or we could just have one user entity with an attribute that defines its type.
When different entities have a common base but still have major differences it can be beneficial to think of a base entity that is inherited by the other entities. let’s take vehicles as an example. While all vehicles have a common foundation in that they all have wheels and move forwards and backwards, they can still be very unique:
As you can see, each of the vehicles have properties common to all vehicles, such as tires, steering wheels, and radios. At the same time each vehicle is also quite unique. Cars are smaller and usually carry more passengers than trucks. Trucks carry fewer passengers but have more cargo space. Taxis are similar to cars but carry fare-paying passengers and have CB radios and fare calculating equipment. We could even have shown a taxi as being a sub-entity of a car, inheriting all the traits of a car and then adding special traits of its own.
The only unique properties for our user entity is whether they are a manager, administrator, or receptionist. With such a small difference between the different users it is better to create only one entity with a property that designates the user’s administrative responsibilities. Other entities in our application include groups and events. let’s look at each entity in detail.
A user is any user in our system. Users can login to our application, set their status, and view the status of others. A user has the following properties:
- Phone Number
- Administrator (Yes/No)
- Creation Date
- Deleted (Yes/No)
- Custom Status Messages
- Groups user manages
- Groups user is a receptionist for
The deleted flag is used to determine whether the user should appear in lists. While we could simply delete a user from our database when they leave the company, it is better to simply mark them as deleted. While this will occupy more space in the database, it is a good record-keeping practice. You may notice that there is no In/Out status associated with the user; this is because their status will be recorded in the Events entity.
A group is a set of users who share something in common. In most businesses staff members would be grouped by department; such as Accounting, Sales, and Support. It may be useful to allow users to be members of multiple groups so that a user can be in his department group and perhaps also be listed in a special group set up for an inter-departmental project. The following properties will apply to groups:
- Creation Date
- Deleted (Yes/No)
A private group is one that cannot be viewed by those outside the group. One example would be a group of executives who do not wish their status to be known by non-executives.
An event is simply a change in a user’s status. When a user signs out, an event will be generated that will record which user the event refers to, whether the user signed in or out, their status message, and who caused the event. While normally a user will cause his/her own event, receptionists are allowed to sign other users in and out. We will want to be able to track who signed a given user in or out. The properties of events are as follows:
- Status Message
Once we have our entities defined, we can add them to a diagram:
In reality there are a few more entities but these three main entities are sufficient for the purposes of this tutorial.
Now that we have our entities defined we need to look at the relationships between them. MySQL is a relational database and as such the data within it is defined by its relations. One way to define entity relationships is in terms of the quantities on each side of the relationship between two given entities. For example, a user can belong to one group (accounting), or many groups (accounting, year end team, Christmas party committee). This is called a one-to-many relationship. Each group will contain many users. The relationship between the user and group entities is a many-to-many relationship (many users belonging to many groups).
The relationship between users and events is different. An event will refer to one user, no more and no less. A user, on the other hand, will have many events that refer to him or her. This means that a one-to-many relationship will exist between the events and users.
These two types of relationship (one-to-many and many-to-many) will be taken into consideration when designing our database tables. For now let’s just draw some arrows to identify a relationship, with the arrowheads identifying the ‘many’ side of the relationship:
Relationships can be optional when a relationship can be defined as zero-to-many, as is actually the case between the user and event entities. A new user may have never signed in or out. In this case there would be no events related to the user.
Our entities will influence our software in two ways. The entities that we have designed will be used later on to lay the foundation for our database tables. In addition, these entities will also help form classes, a building block of our final application (the use of classes in application programming is referred to as object oriented programming and will be covered in an upcoming tutorial section).
There are a wide variety of ways to define data entities and their relationships. The description above is simply a general overview to get you started thinking in terms of entities, and does not represent any particular established method for describing and diagramming entities. Do you need to learn one of the established methods? That depends.
When working on your own at developing an application it is often enough to quickly sketch out the entities and their properties and relationships using a system that you understand and are comfortable with. If you find yourself managing a large project and working with other developers it becomes important to adopt a more formalized approach.
One good system that is becoming increasingly popular is UML, or the Unified Modeling Language. There are a myriad of books on the subject of UML, some of which are even specifically directed at using UML to design Visual Basic 6 applications. An example of the latter would be Developing Applications with Visual Basic and UML by Paul Reed. A good portal to UML resources is available from IBM at http://www-306.ibm.com/software/rational/uml/. Pay particular attention to the document “Introduction to the Unified Modeling Language“.
Designing The Database
Once we have our entities and their relationships defined, we can move on to designing the database schema itself. When designing a database, we will want to draw our tables, with one icon per table, and include the table name and columns that make up the table. I will be using the Visio design tool by Microsoft to produce the database diagrams, but you can use any drawing tool you have available (including the decidedly low-tech pencil). let’s start with the following:
Choosing A Primary Key
We now have all our entities listed with the fields that will make up their tables. The first thing we need to add to our tables is a PRIMARY KEY. A primary key is a column in your table that will uniquely identify every row in the table. In the User table you may be tempted to use Name or perhaps Phone Number as a primary key, but your company may have more than one John Smith, or your employees may have to share phones.
While it is possible to have more than one column together as a composite primary key (such as making the primary key the combination of Name and Phone Number), you have to be certain that every entry will be unique (what happens when the two John Smiths share the same phone?). So what do we use as a primary key for the user table? In this case it is probably best to introduce a new column, called User_ID. This column will be an integer and will be incremented automatically by MySQL through the use of the AUTO_INCREMENT keyword (more on AUTO_INCREMENT later).
For the Group table we can use the Name column as a primary key, as each group will be required to have a different name so as to avoid confusion when choosing which group to place a user in. For the Event table we could use the Timestamp column as a primary key, but there is one caveat with this approach: the TIMESTAMP data type is only accurate down to the second, which means that if two users record an event within the same second (not unthinkable on a busy system with hundreds of users), the primary key’s requirement for uniqueness will be violated. In this case we have one of two choices: we could use a combination of the Timestamp and User columns as a primary key, safely assuming that a single user is unlikely to cause two events within the same second. By using the primary key of another table (the User_ID from the User table) as part of a table’s primary key, we would create what is known as an Identifying Relationship.
An identifying relationship is one where the child table or entity cannot exist without its parent. User and Group have a Non-Identifying Relationship because a User can exist without belonging to a Group, and a Group can exist without any members. An Event, on the other hand, is meaningless without having a user to refer to.
Another option is to create a new column (Event_ID), which would once again be an AUTO_INCREMENT integer value. Typically either approach would be valid, but there is one important fact we should keep in mind: MySQL and Visual Basic use different date formats. The MySQL format of YYYY-MM-DD HH:MM:SS is often misinterpreted by Visual Basic, which uses a date format of M/D/YYYY HH:MM:SS. let’s look at the differences, first with a MySQL query and then with a simple VB example:
mysql> select now(); +---------------------+ | now() | +---------------------+ | 2004-01-05 13:29:03 | +---------------------+ 1 row in set (0.02 sec)
And now in the VB Immediate window:
?now 1/5/2004 1:25:38 PM
As you can see there are some big differences in date handling. Aside from the order of the values, MySQL uses leading zeros when representing numbers smaller than ten, and uses a 24-hour clock. These differences in how dates are handled can produce serious problems for Visual Basic as VB will be expecting one format while MySQL will be expecting another. For this reason it is recommended that you NEVER use any date type columns as part of a table’s primary key. Therefore we will use an Event_ID column as the primary key of our event table. While this does not produce an identifying relationship, we shall still treat it as one within our application.
let’s take a look at our diagram now that it has primary keys:
You may have noticed that I changed the Name column of the Group table to Group_Name. The reason for this is that primary keys are often referred to in database queries that involve multiple tables, and it is easier to say SELECT Group_Name FROM Group, User; rather than SELECT Group.Name FROM Group, User; trust me, its a lot easier this way. As you can see our primary key columns get moved into the center box of our table icons and have been highlighted in bold. With a Visio database diagram, all columns that form part of the primary key are displayed in the center box and have the PK identifier to their left. The boldface highlighting indicates that this is a required field and cannot be left blank (a field which does not have a value assigned is actually assigned the value of NULL, meaning undefined or no value).
Choosing Field Names and Required Fields
Next we will want to clean up our field names and designate the rest of our required fields. A note on field names: while you can use spaces in a field name with MySQL, it will make your life harder as you will need to enclose the name in back ticks (“) every time you want to use it. For example, to access the Creation Date column you would need to query it like this: SELECT `Creation Date` FROM Group; I personally do not like having to use back ticks and always name my columns using underscore characters to represent spaces when referring to primary keys (User_ID), and use capitalization of distinguish names for regular columns (CreationDate). When setting field names you will want to keep readability in mind; you want column names to be short enough to type quickly when using them repeatedly, but you also want to make sure that future developers (including yourself) will be able to understand their meaning:
Now that we have our columns named, let’s define the relationships between the tables. First let’s look at the one-to-many relationship between an Event and a User. When dealing with a one-to-many relationship, we place the primary key of the “many” table as a field in the “one” table (I should note that column, field, and property are synonyms). In this case we already have a User column in the Event table, but let’s make it clearer by renaming it to match the definition in the User table by calling it User_ID. In addition, we can use the relationship arrow to draw our relationship (once again the arrowhead points to the “many” table):
As you can see, an FK1 identifier has been added to the left column; FK stands for Foreign Key. A foreign key is simply a column that refers to the primary key of another table. The InnoDB table handler built into MySQL handles what is known as relational integrity. Relational integrity is the requirement that foreign keys represent real values. For example, if you tried to insert a row into the Event table that had a User_ID of 10, InnoDB would check if a User_ID of 10 existed in the User table and return an error if such a parent row did not exist. Conversely, you would not be allowed to delete a row from the User table if there were rows in the Event table that referred to the row in question. The use of InnoDB is beyond the scope of a tutorial on basic MySQL usage and will receive no further coverage in this tutorial but will be the subject of a future article. Further information of InnoDB can be found at http://www.mysql.com/doc/en/InnoDB.html.
Now that we have established the relationship between Event and User, we can move on to defining the relationship between User and Group. As you will recall, the relationship between User and Group is a many-to-many relationship. This kind of relationship is not as simple to implement as a one-to-many relationship. In a one-to-many relationship we can place the primary key of the “many” table in a column of the “one” table, but this does not translate effectively into our many-to-many scenario: which table will hold the primary key of the other? The answer is that neither table will hold the primary key of the other. Instead, we will create a third table to join the two:
In this case the User_Group table has two foreign keys: the Group_Name and the User_ID. These two foreign keys form a Composite Primary Key, ensuring that there is only one entry in the table for each unique User/Group combination. In addition, two fields have been added that will flag whether the user is a manager or receptionist for the group, allowing us to remove the fields from both tables referring to managers and receptionists. This design allows us to have multiple managers and/or receptionists for each group and also allows a single user to manage multiple groups.
Normalizing The Database
Database normalization is covered in detail in another article I wrote at http://mikehillyer.com/articles/an-introduction-to-database-normalization and I would recommend reading it for more information on database normalization. In this tutorial I will cover the basics as they apply to our example design. Normalization is the process of removing redundant data from your tables in order to improve storage efficiency, data integrity and scalability. This improvement is balanced against an increase in complexity and potential performance losses from the joining together of the normalized tables at query-time.
First Normal Form
The normalization process involves getting our data to conform to three progressive normal forms, and a higher level of normalization cannot be achieved until the previous levels have also been achieved (there are actually five normal forms, but the last two are mainly academic and will not be discussed). The First Normal Form (or 1NF) involves removal of redundant data from horizontal rows. We want to ensure that there is no duplication of data in a given row, and that every column stores the least amount of information possible (making the field atomic).
We have two violations of First Normal Form (1NF) in our database model. The first is that our Name column of the User table holds both the first and last names of our user. This does not represent the smallest amount of data possible. In addition, it makes it difficult to sort users by last name. Finally, our users will not be able to customize their lists to show names in order of first name/last name or last name/first name easily. To solve this we simply split our Name column into two columns:
The second violation of 1NF is the horizontal redundancy of the Status columns (note that while the columns appear vertically in this diagram the columns are actually horizontally arrayed when in table form. Imagine the table as a page on a ledger; with column headings across the top of the page and individual rows below). Not only will it be hard to query horizontally redundant columns (You would need a query like SELECT * FROM User WHERE Status1 = ‘My Status’ OR Status2 = ‘My Status';), but you have just locked your users into only having two custom status messages. If a customer ever demands the ability to store three status messages you will need to add another column and rewrite the sample query I just gave. Horizontally redundant columns can be dealt with by adding another table to store the status messages. This presents us with an opportunity to also add the predefined status messages to the system:
Our new status table will hold every status message ever entered, whether the message is custom or predefined. Our event table then links to the Status table to indicate which Status message the event uses. If a message is pre-defined then the User_ID field will not be populated. If the status message is a custom message then the User_ID field will be populated, associating the status message with its author for later retrieval (remember that one of our features on the list is the retrieval of previous custom status messages for reuse).
Before we declare our table to be in first normal form, there is one other adjustment we can make. Rather than store a flag for Manager and a separate one for Receptionist we can combine the two columns into one Administrative Level column. Since the two positions are mutually exclusive there is not much point in storing both fields separately, as no user will ever be a Manager AND a Receptionist for the same group:
Since a user could have no administrative privileges over the group I have made the field one that is not required. This way the field can be populated with NULL to indicate that the user has no administrative privileges.
Second Normal Form
Where the First Normal Form deals with redundancy of data across a horizontal row, Second Normal Form (or 2NF) deals with redundancy of data in vertical columns (where the same data appears in multiple rows). The normal forms are progressive, so to achieve Second Normal Form, your tables must already be in First Normal Form. In this case, the titles given for the administrative levels (receptionist, manager) could potentially appear in many rows, as a large company could have a lot of receptionists. We deal with vertical redundancy the same way we deal with horizontal redundancy, by adding tables to our schema:
Our new table will allow us to change the title of an administrative level (say from Manager to Supervisor) without having to make the change for every user that is a manager, as would be the case if you had the word manager in every row that it applied to. This improves ease of maintenance. In addition we may find that we eventually need to add more data regarding the rights of the administrative level. With this approach we could easily add flags to the table representing different privileges the user possesses, then allow administrators to create new administrative levels as they see fit; granting various permissions as they go.
Third Normal Form
In Third Normal Form we are looking for data in our tables that is not fully dependent on the primary key, but dependent on another value in the table. Take an address for example: your City and State are not really dependent on you, but on your Postal or Zip code. As such we could create a separate table with Zip Code, City, and State and only store the zip code in the User table. While such an approach may ease maintenance it can also introduce complexity when trying to assemble an address. As such, data should be moved into third normal form when deemed necessary for maintenance of information. In our case there is not really any information in the schema that would be a candidate for further normalization into third normal form.
Choosing Column Types and Writing CREATE TABLE Statements
A CREATE TABLE statement is a special query we pass to MySQL to instruct it to create a new table to store our data. The syntax for CREATE TABLE can be found in the MySQL reference manual at http://www.mysql.com/doc/en/CREATE_TABLE.html. While the syntax may look complicated, it is actually quite simple. To create tables in MySQL, we mainly need a table name, column names, and the type of data that each column will hold. let’s look at the User table first.
The first component of our CREATE TABLE query is the CREATE TABLE statement itself, which indicates the name of our table:
CREATE TABLE User (
The opening round bracket indicates that the lines that follow define columns in our table. When defining a field we need a few pieces of information: first we need the name of the field, taken directly from the diagram above. Second we need the data type of the field. This is where we determine whether the field will store numeric information, alphanumeric information, date and time related information, or other special data types. The full list of data types is available in the MySQL reference manual at http://www.mysql.com/doc/en/Column_types.html, and I recommend referring to the VB/MySQL data type conversion table at http://www.vbmysql.com/articles/visual-basic_mysql/datatypes.html.When choosing data types it is important to keep a balance between row size and usability. Every field will occupy a certain amount of space in your row, and will have a certain capacity for holding data. We want to choose the smallest data type that can reasonably hold our information. let’s take User_ID as an example; our User_ID field will be holding an integer value, and in theory we need one for each employee in our organization. Our choices for data type include TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT. The largest numbers that can be stored in these various data types are 127, 32767, 8388607, 2147483647, and 9223372036854775807 respectively.
In this case we can look at TINYINT with its capacity of 127 and consider it to be too small for our purposes. SMALLINT, with its capacity of 32767 should be more than adequate for even the largest organizations. The reason for choosing the smallest data type possible for our data is that larger data types result in longer rows, and longer rows take more time to search through when performing queries. Therefore it is always in our best interests to keep our rows as small as possible. By choosing the 16 bit SMALLINT data type over, say, the 32 bit INT data type, we have saved almost 2 gigabytes of storage over a million rows. Thus another benefit of choosing smaller data types is better storage efficiency. You can always redefine a column to larger data type later on, even once the server is in production use.
The remainder of our row description is devoted to extra information about our field. First of all, when dealing with numeric data types we have the option of specifying whether the field should be allowed to handle negative numbers. The actual range of numbers that can be stored in a SMALLINT field is -32768 to 32767. In our particular application it does not make sense to have a negative User_ID value. As such it would be beneficial to declare the field UNSIGNED. Not only does this prevent the insertion of negative numbers into our table, it also increases the maximum value we can hold in the field to 65535 (this is because in binary form the leftmost bit of a number is reserved to represent the sign of the number. MySQL will use the leftmost bit to increase the maximum size of the number when the column is defined UNSIGNED).
NOT NULL, DEFAULT, AUTO_INCREMENT, and PRIMARY KEY
Another piece of information we can designate for this field is whether the field can hold NULL values or not. As you recall, a field that is defined as required in our design is one that must have a value when the row is created or later updated, and NULL is a keyword that indicates the field has no value. As such, any field we earlier defined as required must now have the NOT NULL keyword attached to it. NOT NULL will prevent the insertion of NULLs and make the field mandatory.
In combination with NOT NULL comes the DEFAULT keyword. Just because a field is defined NOT NULL does not mean that we have to specify a value when inserting records. Any NOT NULL field that is not specified when the row is inserted will receive the value pre-defined by the DEFAULT keyword. Numeric values will automatically get a default value of zero, while strings will default to an empty string (”). You can use the DEFAULT keyword to override this behavior.
There are two final keywords that are commonly used when creating tables, but which can only be defined once. These are the AUTO_INCREMENT and PRIMARY KEY keywords. AUTO_INCREMENT indicates that if the field is not specified when inserting rows, the database will automatically generate a unique incrementing value and place it in the field defined AUTO_INCREMENT. This is very useful when using integers to identify rows, as the programmer does not have to worry about creating a unique number for each row he creates. The PRIMARY KEY keyword is used to define the field as the primary key of the table. let’s look at the definition of our first column:
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
CHAR vs. VARCHAR
Now that our primary key is defined, let’s look at the name columns of our User table. Our name columns will be handled as strings, which are allowed to hold alphanumeric values. The two basic fields for handling strings are CHAR and VARCHAR, each of which is defined in terms of the longest string it can hold (defined in characters, between 1 and 255). The difference between CHAR and VARCHAR is how they deal with unused space. let’s say you consider the longest possible last name to be 40 characters long, and a user has a name of Winstead (8 characters). In a CHAR field, the field would still be 40 bytes long (1 character = 1 byte), and the last 32 characters would be blank spaces used to pad the string to the full 40 characters (the trailing spaces will be removed when you retrieve the value). This means that no matter how long the actual string is, it will still take a full 40 characters worth of space in the table. VARCHAR on the other hand will store only the 8 characters and will only occupy 8 bytes of space in the table.
Your first instinct may be to choose VARCHAR and benefit from the decreased storage space required in the table, in an effort to increase the speed of searches as I described above, but VARCHAR is an exception to this rule. When you use VARCHAR in a table (as well as certain columns intended for large text and binary values), the width of each row becomes inconsistent. One row may be 40 bytes long, another only 8. MySQL can handle this by recording the length of the row in its file system, but the server must then read the length of each row before searching it and moving on to the next row, where it must check the length again, ad nauseam. By defining text fields as CHAR we will occupy more space on the hard-drive, but there is a speed increase involved as the server will know that each and every row is exactly a given size (such as 120 bytes), allowing it to search through rows without checking their length first. As such I recommend defining all your text fields as CHAR to improve query performance, unless storage space is a strong consideration (and with the prices of today’s hard-drives this is not really that great of a concern, especially with our application).
let’s look at the name columns:
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, LastName CHAR(40) NOT NULL, FirstName CHAR(40) NOT NULL,
As you can see, each field will be a CHAR field that will hold up to 40 characters. The fields are defined NOT NULL to make them mandatory, with the standard default value of a blank string being used since we have not defined it otherwise.
Storing Phone Numbers
There is no one sure way to store a phone number, as the data type used can be dependant on what you plan to do with your phone numbers. If you wish to do searches based on area code you may want to store the different parts of a phone number separately, with a field for area code, a field for the first three digits, a field for the last four digits, and perhaps even an optional field for the extension. This will make it easier to search and sort phone numbers. In our case we are only storing the number for the reference of a user’s fellow staff members, so we will be storing it in a simple CHAR column:
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, LastName CHAR(40) NOT NULL, FirstName CHAR(40) NOT NULL, Phone CHAR(10) NOT NULL,
Our Username and Password fields will also be declared as CHAR columns:
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, LastName CHAR(40) NOT NULL, FirstName CHAR(40) NOT NULL, Phone CHAR(10) NOT NULL, Username CHAR(16) NOT NULL, Password CHAR(40) BINARY NOT NULL,
Handling user logins will be discussed in a future article. The BINARY keyword used with the Password field indicates that we want all comparisons with the Password field to be done in a case-sensitive manner (So that PaSSwoRD and password are treated as different strings when validating users). Strings in MySQL are treated in a case-insensitive manner unless the BINARY keyword is used.
For our Administrator field we will use an ENUM column. An ENUM column can be assigned only one value, chosen from a list of pre-defined values that we declare when creating the field. ENUM works well for representing value sets that will not increase in size as the database matures (for that we use a separate table as with our AdminLevels table). In this case we wish to represent a scenario that can only have two values: true or false (either you are an administrator or you are not). Assigning an ENUM involves created a comma-separated set of strings listing all possible values (you can define up to 65535 different potential values to an ENUM column, but if you need more than five or six you are probably better off creating a table to store the values):
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, LastName CHAR(40) NOT NULL, FirstName CHAR(40) NOT NULL, Phone CHAR(10) NOT NULL, Username CHAR(16) NOT NULL, Password CHAR(20) BINARY NOT NULL, Administrator ENUM('TRUE', 'FALSE') NOT NULL,
The last two columns of our table are the Created and Deleted fields. The Deleted field is once again a true/false column and will be treated the same as the Administrator field. The Created field represents the date on which the record was created. MySQL provides four columns for handling date information, each of which is fairly self-explanatory: DATE, TIME, DATETIME, and YEAR (There is a fifth TIMESTAMP column that we will discuss later). When choosing a data type to use consider what kind of information is needed. By far the most commonly used formats are DATE and DATETIME, as there is not as much use for just the time of day or year when recording information (consider that most people do not search for time information regardless of day). For our purposes the DATE column is adequate as it is not essential to know the time of day that the row was created:
CREATE TABLE User ( User_ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, LastName CHAR(40) NOT NULL, FirstName CHAR(40) NOT NULL, Phone CHAR(10) NOT NULL, Username CHAR(16) NOT NULL, Password CHAR(20) BINARY NOT NULL, Administrator ENUM('TRUE', 'FALSE') NOT NULL, Deleted ENUM('TRUE', 'FALSE') NOT NULL, Created DATETIME NOT NULL )TYPE = MyISAM;
The closing round bracket indicates the end of the column definitions. The TYPE keyword is optional because the default table type is MyISAM, but we can use the keyword later to indicate that we wish to use one of the other MySQL table handlers, such as InnoDB, HEAP, or BDB to name a few. More information on table handlers can be found at http://www.mysql.com/doc/en/Table_types.html, but for the purposes of this article we will stick to the default MyISAM table handler. The semicolon character (;) tells the MySQL server that we have finished our query and that it can go ahead and execute it. Execution of queries will be covered in a future tutorial, for now we will simply be preparing the queries for future use. I recommend you create your queries in a text editor such as notepad and save them using a filename such as table_creation_statements.sql. The .sql extension will indicate that the file holds SQL queries.
One special data type available with many databases is TIMESTAMP. A timestamp column is very similar to the DATETIME field but has certain special properties. If your table has a timestamp column then MySQL will automatically populate the field with the current time when a row is inserted without specifying a value for the timestamp column (or if the timestamp column is specified as being NULL). The timestamp column will also be updated to the current time whenever a row is updated with new information (once again as long as the column is either unspecified or set to NULL).
One useful property of timestamp columns is their behavior when you have more than one timestamp column in a table. When you have multiple timestamp columns in a table, all timestamp fields will be assigned the current time when a new row is inserted, but only the leftmost timestamp column will be updated during subsequent updates of the table. This means that you can place two timestamp fields on a table, with the right one recording the creation date of the row and the left one tracking the last update on the row. We will use the TIMESTAMP data type to record the creation time of events in our Event table, using the name Timestamp. Typically you cannot use a data type as a column name as it is a reserved word, but the use of Timestamp as a column name is an exception in MySQL:
CREATE TABLE Event ( Event_ID MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, User_ID SMALLINT UNSIGNED NOT NULL, Message_ID MEDIUMINT UNSIGNED NOT NULL, Timestamp TIMESTAMP, Status ENUM('In','Out') NOT NULL, Creator SMALLINT UNSIGNED )TYPE = MyISAM;
A couple of things to note is that the In/Out column was renamed Status to better reflect its meaning (as In/Out tells you what the column can hold but not what it means). In addition, the Creator column was changed from a required field to a non-required one. The logic behind this is that a NULL creator can signify that the user the event describes is the creator of the event, which could help later on when determining how often events are created by users other than the one the event describes. It is important to note that design can be an iterative process, and that changes like this are regularly made. The important thing is to remember to update your documentation when making changes:
Composite Primary Keys
In the User_Group table we have a composite primary key. We can’t define both columns with the PRIMARY KEY keyword and must instead use a different syntax:
CREATE TABLE User_Group ( Group_Name CHAR(20) NOT NULL, User_ID SMALLINT UNSIGNED NOT NULL, Level_ID TINYINT UNSIGNED, PRIMARY KEY (User_ID, Group_Name) )TYPE = MyISAM;
When creating a composite primary key we define the primary key after we finish defining our columns.
Defining Our Remaining Tables
The remaining tables in our application will follow the same principles that have been described so far:
CREATE TABLE Status ( Message_ID MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, User_ID SMALLINT UNSIGNED, Message CHAR(255) NOT NULL, Deleted ENUM('True','False') )TYPE = MyISAM; CREATE TABLE Groups ( Group_Name CHAR(20) NOT NULL PRIMARY KEY, Created DATETIME NOT NULL, Scope ENUM('Public','Private'), Deleted ENUM('True','False') )TYPE = MyISAM; CREATE TABLE AdminLevels ( Level_ID TINYINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, Title CHAR(20) NOT NULL )TYPE = MyISAM;
The full set of table definitions is available for download HERE.
It should be noted that once again the issue of reserved words has come up, as the Group table used a reserved word. It is possible to use a reserved word as a column or table name by wrapping it in back ticks (`Group`) every time you use it but I prefer to avoid the hassle. Instead we’ll rename the table to Groups (generally you should avoid using plural forms in your table names):
It should also be noted that just as you should use the same name for a foreign key as you use for the primary key (use User_ID in all table referring to it), you should also make sure to use the same datatype as well. So make certain that all User_ID fields are SMALLINT UNSIGNED, and that all Group_Name fields are CHAR(20). Ensure that all CHAR and VARCHAR columns are the same width (in this case 20 characters) as you do not want one table to be able to store a longer group name than another.
We have created a list of features and requirements for our application that will not only help us establish when our project is complete, but which also helps us define our application’s entities. Our entity design will be used to define our application objects, and is also used to design our database tables. Now that our tables have been defined and designed, we can use the CREATE TABLE statements to create the tables in our MySQL server.
In my next article I will cover installation of the MySQL server under Windows and then cover the basics of creating a database and its component tables.