By now the topic of what is data? And how it is managed/organized using a database is no longer new. But questions around how best to manage/model data are still ongoing and with that, we will discuss some common database management systems and show some modeling concepts.
Hierarchical Database: Data is categorized into levels which are done based on connection links. This kind of a parent-child relationship. This form is not really scalable as when new data is required, one has to traverse the database, which is really expensive.
Network Database: This is similar to Hierarchical but the difference is that there is a child node that can be connected to multiple parents. It still carries over most of the cons of Hierarchical just less expensive.
Object-Orientated Database: Information is organized in the form of objects. We have Class and Inheritance like object-oriented programming. When you have complex data relationships, ODBMS is a good route and used quite a lot in building applications. Heard of object-relational mapping(ORM). This is the means of accessing the data objects within the code.
Relational Database: Can also be called a SQL DB and it is one of the most popular. The table is used to hold the data. Yes, the popular rows and columns database, with each data row having a relationship as implied by the name.
No-SQL Database: Quite different from the relational database in that It is not the only SQL and it comes in a variety of data models such as document, graph, key-value, etc. Popularity has grown as we look to handle data types. In RDBMS, if there is a change to the data model that might affect the data, the Database Developer might need to make some expensive modifications but not the same with NO-SQL.
NoSQL vs SQL? Which should we be using for our application? A very common question, I think most of the different DBMS are connected as data is ever-changing, and with that comes the need to combine different systems to increase the performance and efficiency of our applications.
The idea is that once presented with a project, you need to have an initial sketch to help understand the data and its requirements. Taking into consideration the business process and inputs from the stakeholders.
There is a lot of content on the different models and techniques, I would not bore you with many details.
Expense Tracker Application: The initial goal of this application is to capture receipts from shopping to help understand monthly expenses and items that were purchased.
We will be designing a data architecture that can handle different data types as the organization looks to have some sort of ML in identifying pictures sent.
Database Design and Architecture
We must understand the application workflow which will aid in the initial design of the database. This will further help in data architecture and orchestration.
- Understand the data|business flow process
- High level proposed application.
- Data format: This will be in a different form from raw text, image, emails, etc
- Data Source: Manual entry from individual shopping merchants.
There is a need to understand the application workflow as it gives some idea into how the backend/database will be designed.
This application will have other parts for now we will be focusing on the expense tracker.
- Expense Tracker
- What are the required fields
- What does a receipt usually contain
- What format will the receipts be entered into the database.?
Data Architecture The different components we will be looking at are:
- Data source
- Data Extraction
- Data storage
- Analysis and Visualization
Tools and Languages
- Spacy NLP: This will come in handy as we work on ML models
- Tika: Handle data extraction
- Airflow: Handles automation and data pipeline.
- MongoDB|Postgres: Extracted data for storage.
- ELK: This helps connect all of our applications. Searching and visualization.
PS: Follow our GitHub repo for more details We will be concluding this series with a detailed description of Database Design while taking a look at the application design/workflow as a guide. A little introduction on how we will be concluding this post in the next few weeks.
Simple UML Design
Expense App: The MVP is to have an app that can read data scanned from a receipt and drop it all in the initial(aka raw) schema. The next steps will be to read through that and pick out each item into a semi-clean schema.