Databases are designed to create better organization when storing and retrieving data. In order for the records to be properly organized, a primary key should be chosen. This key is a unique identifier for each record.  In order to understand how to choose a primary key for your database, it is important to understand exactly how a database is structured and what the purpose is for choosing a primary key.

A database is comprised of tables and within each table are columns and rows. A column could specify an item such as "customer name" where the row would contain the actual customer's name (i.e. "John Doe"). A primary key is the identifier which uniquely distinguishes each individual record from others in a column. This is important because it's possible that there could be several people named "John Doe" located within a table. 

Primary keys are not required in designing a database, but it is beneficial to choose one to save headaches and confusion later on. Choosing a strong key ensures future data manipulation can be easily managed and a database will typically perform better too.

You can theoretically choose any column to be selected as the primary key, but you'll want it to meet certain conditions. If your database does not meet these conditions, it would make a poor choice for a key.

keyboard question
Credit: Geralt/Pixabay CC0 Public Domain

Not choosing a strong primary key could result in a lot of confusion when trying to distinguish unique records.

Conditions for Choosing a Primary Key

No two rows may have the same key value: Whatever key is chosen, it must be able to act as a standalone as an identifier. In other words, in a customer table, choosing the zip code as a primary key wouldn't be acceptable because many customers share the same zip code. There is nothing unique about this value. On the other hand, a driver's license number would be unique. (But it is important to consider not everyone may have a license, which leads to the next condition).

Every row in the column must have an entry: Primary keys should not have NULL values, so be sure it's an item that every record will include. For instance, phone numbers is one example. While theoretically a phone number can be used to identify a record, it's not a good idea to use as a unique identifier unless you are certain that each customer has a phone number to use as the key (there are other reasons why a phone number may be a bad choice which will be discussed below). If there is no phone number and the space is left blank, it is referred to as a NULL value.

Credit: ClkerFreeVectorImages/Pixabay CC0 Public Domain

When choosing a key, it is important to choose wisely in order to avoid errors.

Primary key values remain the same: In general, primary keys cannot be easily changed or edited. Once the primary key is chosen, it's best for it to stand as a unique identifier (another good reason not to use a phone number since these are not static and people often change their numbers). Although, there are some exceptions, it is recommended to pick a strong primary key when the database is built and stick with this identifier.

Primary keys are not recyclable: For instance, if you lose a customer you cannot assign their identifier to any new customers at a later date.

Ways to Choose a Primary Key

Automated key value: You can design your database so an automated number is generated (i.e. "customer number") to numerically assign each record a number as it is entered into the database. A benefit to this approach is the numbers are disposable and there is no reason why one could need to be changed at a later date.

Phone numbers: This can work, but is not really reliable because people frequently change their phone numbers or may not have a phone. If this happens, this could create problems down the road, especially if the person who later gets the number also happens to be a customer (this happened to me with a retailer - someone else has the number now, but my name is linked to it). Using a phone number typically brings record keepers back to the duplication problem.

Social security number (SSN): While unless someone has stolen the number and is using it illegally, the SNN is a unique identifier, which is why it was so popular for so many years. However, these days, due to privacy and identity theft issues, it's not really a good choice. As laws emerge to counteract issues associated with data breaches, this also makes SSNs bad choices. It's best to design your database without using a SSN (or equivalent in countries outside of the United States) so you do not have issues in the future. Many organizations have already done away with the practice of using SSN as a primary key.

Names and birthdays: These are not good identifiers because it wouldn't be uncommon to have several people with the same name or birthday (using them in combination is a possibility, as the next step covers).

Use a combination of columns: Using several columns as a primary key is an option, however is also not always the best approach, depending upon your organization and what data you are storing. I once built a database that was originally intended to be a simple mailing list for the agency I was working for; basic record keeping and to print labels.

Since the database was initially only to have minimal function, I didn't auto-generate a number or try to find some other personal identifier associated with the people on the list. What I did was use a combination of columns to make each record unique. In this case the "First Name", "Last Name" and "Address" was a primary key. It was a rare instance that two people with the same name lived at the same address. This was in Microsoft Access and, for the most part, it worked. Unfortunately, years later a problem presented itself when two people with the same name lived at that address and both became customers.

If you decide to use more than one column as a key, try to make it as unique as possible and try to visualize if there is potential for duplication. This was a valuable lesson learned for me and a factor to definitely consider when establishing a primary key. In hindsight, I should have chosen a better key, such as an automated number or other unique identifier. 

Other unique identifiers: Customer ID numbers, employee ID numbers or sales transaction numbers (depending on the organization, there are other types of identifiers likely associated with the business' needs), are other options to choose that would be unique.

Build With Flexibility

When building a database, it's best to design with flexibility in mind. When the aforementioned mailing list was built, it was intended to be simple. However, its use expanded over the course of time and was looked at for other purposes too; over the years it had outgrown its original design. It was a lot of work to rebuild as a more robust relational database. It would have been far easier to choose a better unique identifier in the first place.

Man typing
Credit: StartupStockPhotos/Pixabay CC0 Public Domain

It is always best to design a database with flexibility in mind. While you might have very minimal needs in the beginning, over time as the business and/or data grows, it's likely things will change.

When designing a database and selecting a primary key, try and envision any potential uses that the database will be used for - even way down the road; this will ensure you choose the best key possible and your database will be able to grow and expand for other purposes, if needed.


Additional reference: Forta, Ben, SQL in 10 minutes, 3rd Edition