Part 1 — Data Structures and Algorithms in Plain English
First, you need to know that this being a series of posts doesn’t mean we will be able to touch on everything that needs to be known about Data Structures and Algorithms. What you should understand is that this series was created out of an observation of how many developers/engineers tend to skip this part of the process, having ended up finding gratification in being able to build, what they might call, a real-life project with the help of a random video they found on YouTube or article they discovered on Google. Get this and get it now…
Algorithms and data structures form the pivotal part of any computation which makes understanding them… essential.
Sure, you’ve been able to replicate Netflix… I mean, there are plenty of APIs out there that gives you access to video streaming functionalities — but do you actually know that much about the file system? Or how a file can be read into data streams? Unfortunately, the Netflix you cloned has been cloned by quite a number of others since a simple insertion of the phrase “How to Clone Netflix” on YouTube will result in thousands of video tutorial data being sent back to you in just a matter of seconds except your Network decides to… you understand 😎.
But think about it for a second… YouTube receives millions of uploads daily and when you compute the probability density of how many videos fall under the same category and almost exact title, you tend to see that out of these millions of uploads over tens of thousands of videos delivering the same utility (information) are uploaded daily. With a few statistics and discrete probability, one can assume safely that YouTube has over 15 billion videos with the number increasing geometrically daily.
Imagine the engineers at YouTube decide to be like me and made use of a regular For Loop. I mean, it does the job, doesn’t it? Yea! Sure, when the data is still a hundred or less. A loop just doesn’t cut it because loops are linear and linear solutions shouldn’t always be your first choice when you are dealing with a dataset that will scale over time. This is because as the data increases, the number of operations to be performed by your computer increases as an integral multiple. Now, I know that’s a lot. What I simply mean is that if your computer performed 5 operations for 5 data, then when the data becomes 100, it will have to perform 100 operations — that’s just linearity!
It is important that we get a lot of concepts off the ground before actually proceeding to talk about what an Algorithm is and what Data Structures are.
So, if YouTube isn’t making use of a For Loop (and that in fact isn’t so good for HTTP… I mean, most browsers or clients would timeout, right?), how then are they able to search 15 billion data in seconds? Now that’s where data structure comes in… not even the algorithm part.
You see, one approach is that YouTube doesn’t go looking into a library containing all the 15 billion videos as you or I might do… rather, they take advantage of classification and labeling such that each video is classified based on certain properties (Entertainment, lifestyle, health, fitness, tech, etc.). With this classification, the next thing for YouTube to do is to figure out what category your search query falls into and that’s where the algorithm comes in. YouTube uses NLP (Natural Language Processing) to understand your query and then make your search. Now, since my search query might possibly fall under the tech label, YouTube just has to make a search among the, I’m guessing, 600 million tech videos present in its database. But this is still a very large amount of data!
Yes! It is, however, remember that what matters about a computer program is trying to reduce the number of possible operations the computer CPU has to perform before returning a result. What YouTube has to focus on now is how to reduce the number of operations from 15 billion operations that would occur from using a For Loop to something in the lines of 50 operations or 100 operations. Fortunately, this isn’t so difficult to perform.
Ever heard of Binary Search? Or the phrase Decrease And Conquer? Well, Binary Search, in plain English, is simply a quick and efficient way of searching an ordered dataset from its midpoint such that the data gets halved (divided into two) for every operation (iteration) performed. So, if we start with hundred data, it gets halved to 100, 50, 25, 12, 6, 3, 1… and it stops. This means that we only have to run a maximum of 7 operations before finding what we’re looking for. Amazing, right? It definitely is! And this is why we all need to understand Algorithms along with the proper way of structuring data depending on the use case.
Now, back to YouTube, to search the whole 15 billion video data, YouTube will only have to perform a maximum of 34 operations and with the help of NLP reducing the data down to 600 million, YouTube only has to perform a maximum of 29 operations before giving you a search result. Note that I keep using the word maximum as this refers to worst-case scenarios where what you are looking for happens to be the last in the queue. This means that it might not be up to 20 iterations (operations) in some instances. With a little bit of distribution and binomial approximations, one can easily generalize that the number of iterations is given by log N (logarithm of N… where N is the number of data.)
And here comes the Big O notation, the concept of space and time complexity in regards to how a program denies a CPU of what is rightfully its… MEMORY!
Putting the logarithm in a generally acceptable way, we say that the complexity of a Binary Search algorithm is given by.
O(log N) — — Worst-case scenery where the last search returns what we need.
O(1) — — Best-case scenario where the first search returns what we need.
And here’s one last thing to keep you psyched, a Binary Search would have to take on some modifications to make it even useful in the case of a YouTube search where multiple results need to be returned.
Well, now, that we have the whole scenery set, I think it’s time for us to dive into the definitions while taking a look at a few code samples.
But, eh eh… No rush. That will be the second part of this series. Interested in getting informed about the next part of this series while being able to communicate with other devs in one community? Then you might want to check out https://abbrefy.xyz/techmovers
PS: I seriously don’t know how long this series will be.