data-integrity-how-to-achieve-and-maintain-it
At the moment of writing this article, I'm working on a corporate Java-based application that stores vast amounts of sensitive data used in a wide variety of computations and processes based on dozens of variables to generate all sorts of reports and forecasts. Having that in mind, you probably get where I'm going with this. The accuracy of the output depends on the input's precision, so that's why we must ensure the data integrity of what we store.
So, what is Data Integrity? The term refers to the accuracy and consistency (validity) of data. Compromised data, after all, makes no use to enterprises; that's why it's a core focus for this type of solution. Data integrity should ensure accuracy, completeness, searchability, and traceability, among other aspects. In simpler terms, data integrity means that you have recorded the data as intended and that it wasn't unintentionally changed throughout its lifecycle.
The concept is simple, but the practice is not. Imagine making a crucial business decision hinging on data that is entirely, or even partially, inaccurate. Organizations routinely make data-driven business decisions, and if you're storing data without integrity, those decisions can have a dramatic effect on the company's bottom line goals.
Here are some of what I consider to be the best practices when working with essential data in the context of web applications:
Validation & Constraints
Data validation is the process of analyzing a dataset to establish certain aspects of data quality and decide on possible remediation steps.
There is a wide range of validation checks, like field-specific (where you check the presence and uniqueness of fields, formatting, bounds, or even XSS checks) or cross-field (where you check the values' consistency within a given time snapshot where there are dependencies).
These are not the only ones but are among the most common, each business having different rules that need to be enforced.
However, the most important thing is that the validations should be present on both front-end and back-end sides.
Also, to add an extra layer of security when validations might not be enough, another option could be the database constraints. Whether they are added at the application level, using an ORM framework, or within the actual database, the constraints will prevent storing corrupt data, contributing to maintaining data integrity.
Audit trails
According to Martin Fowler, audit trails (also called audit logs) are "one of the simplest, yet one of the most effective forms of tracking temporal information." It involves tracking the use of database records and the changes that occur within.
When you audit a database, each operation on the data can be monitored and logged to an audit trail, including information about which database object or data record was touched, what account performed the action and when the activity occurred.
The solutions can vary from a custom implementation using different design patterns like Temporal Object or Temporal Property or logging events using a framework like Spring Data or Hibernate, maybe even some more complex options providing more relevant data. The choice should be made based on the business requirements in terms of traceability and performance.
Sure, it might add a certain level of overhead to the database, but in the long run, it just might save you a lot of time and headaches when you need to trace the changes made within the database.
Tests
Finally, I think testing is one of the most important aspects of any type of application, having the potential to determine the fate of an application. Good testing can catch critical bugs early on, and lousy testing can – and will, if not corrected on time – lead to failure and downtime.
Without undermining the importance of functional testing, which has its significance in the process, the purpose of this article is to provide guidelines at the development level. Therefore, the written tests will be the subject of the following lines. More precisely, unit and integration tests:
Java provides plenty of frameworks that come in handy when talking about tests. One of the most popular and easy to use is JUnit. Not only it's easy to set up and run, but it also supports annotations and allows specific tests to be ignored or grouped and executed together. Don't let the name fool you; it can be used for unit AND integration testing and, used with mocking frameworks like Mockito, can be a powerful tool. You can even enhance your CI/CD with hooks to test the code before deploying it, decreasing the chances for a bug to occur within the application.
But having tests is not enough. The way they are written – more precisely, what they actually test – is way more important. The tests will only catch the bugs in flows that you write them to try. Therefore, you should test both positive and negative scenarios. In the context of testing, the way the system responds to invalid data or system failure is equally important as the happy flow. Moreover, the tests should cover borderline and corner cases, so when a new feature is implemented, you know you didn't break anything.
And talking about coverage, you should be measuring the code coverage of your tests. It's a metric showing – as a percentage - how much of the written code is executed when the tests are run. Generally speaking, the tests should cover all the branches (if/else, switch statements) of the code, and using a dedicated tool can even point out areas that are not tested.
You might think I forgot about TDD. Although Test Driven Development can be a good approach, it might not be suitable in all situations, opposed to the previously described topics. However, when you write a new functionality, the tests should be the next thing you do.
These guidelines bring not only long-term value to the project, but they also get a certain level of safety which you may or may not feel it's necessary yet, but eventually, you will appreciate. I tell you this from my own experience because there were times when I wondered why I was writing this or that, but now I got to acknowledge their importance.
PS: As a side note, I have 2 more bits of advice which might affect the quality of the product - and of the data, at the same time – and which I think should constantly be in the mind of the developers: KISS principle (Keep It Simple, Stupid) and don't be afraid to refactor when it's the case, doing workarounds is more harmful on the long term. But that's something we can discuss in a following article.