Mathematics is the foundation on which machine learning is build. We will refresh few basic concepts to get started.


Vector is a construct that represents both a direction as well as a magnitude.

Algebraically, a vector is the collection of coordinates that a point has in a given space. Geometrically, vector is a ray that connects origin to the point.

Following figure shows examples of vectors in two dimensional Euclidean space.

Image for post
Image for post

L2 norm calculates the distance of the vector coordinate from the origin of the vector space. It is also known as the Euclidean norm as it is calculated as…

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Apache Storm cluster runs Topologies, which processes messages forever.

Image for post
Image for post

Components of Storm Cluster

Storm cluster consists of two kinds of nodes (master node and worker node) along with a resource manager (Zookeeper).

  • master node — runs a daemon called “Nimbus”, which is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
  • worker node — runs a daemon called “Supervisor”, which listens for work assigned…

Machine learning (ML) is to train a machine so that it can make decisions for us. This can be achieved by expert system or machine learning.

Expert system is a computer system that emulates the decision-making ability of a human expert.

Expert system are also known as Rule Based Systems. It emulates how a human makes a decision. Humans look at inputs (features) and then based on previous experience, decide the output. For example an expert system for disease diagnosis will use a fact database and then use the if-then statements to infer the disease from symptoms.

For many well…

EC2 is one of the most popular AWS offering. It provides the capabilities of:

  • Renting Virtual Machines (EC2)
  • Storing data on the virtual drives (EBS)
  • Distributing load across machine (ELB)
  • Scaling a service using auto-scaling group (ASG)

ssh to the instance

Note: Ensure write permission to certificate file (chmod 400 certificate_file)

Using certificate location with CLI

ssh -i ~/certificates/aws.pem ec2user@ec2-instance-url

Alternative add the configuration to ~/.ssh/config

Host my-aws-instance
Hostname ec2-instance-url
User ec2user
IdentityFile ~/certificates/aws.pem

Now use the Host name to ssh

ssh my-aws-instance

User Data

It is possible to bootstrap the instance using user data script. Bootstrapping means launching commands when a machine starts. This script…

Initially big data started with collecting huge volume of data and processing it in smaller and regular batches using distributed computing frameworks such as Apache Spark. Changing business requirements needed to produce results within minutes or even in seconds.

This requirement is achieved by running the jobs in smaller interval (micro-batch) as per the result duration. There are various problems that arise with smaller intervals e.g. whether to process all data, which is inefficient, to produce result or does incremental processing and add the new result with the earlier results. How to ensure that records within an interval are available…

Initially software applications were small which can be deployed on a single computer. Over a period of time, data volume processed by these application grew in size. Hence, the requirement for storage and computing power grew. These requirements were fulfilled by rapid advancements in storage and compute hardware by having larger disks and faster CPUs. This way of scaling is termed as vertical scaling, which soon became costlier when applications started being consumed over Internet.

With Internet scale, the data processing and compute need started growing exponentially, which can’t be solved with vertical scaling any more. Google solved these problems…

An overview of using Golang modules with CircleCI, which is a continuous integration and delivery platform.

Golang Modules

Golang modules is a Golang dependency management system. It makes dependency version information explicit and easier to manage. It lets you work from any directory and not just from GOPATH. It allows to install specific version(s) of a dependency package to avoid breaking changes. The go.mod file list the dependency of the project so that all dependencies need not be distributed with the package.

A module is like a package that you can share with other people. …

Virtualization solutions allow multiple operating systems and applications to run in independent partitions on a single computer. Using virtualization capabilities, one physical computer system can function as multiple “virtual” systems.

Virtualizing a platform implies more than a processor partitioning: it also includes other important components that make up a platform, e.g. storage, networking, and other hardware resources. We will look at some of the component virtualization in this article, specific to Intel platform.

PCI Overview

ISA (Industry Standard Architecture) was first standard used for buses connecting peripheral devices to CPU. ISA was developed for 16-bit machines and did it’s job pretty well…

Here is a typical architecture having Sources, Sinks, Connect Cluster, Kafka Cluster and Kafka Streams Applications.

Image for post
Image for post
Kafka Connect and Streams Architecture

Kafka Cluster is made up of multiple brokers. There are Source(s) which we want to get data from and put in Kafka Cluster. In between comes Connect Cluster, which is made of multiple workers. Workers pull data from Sources [1] by specifying the Connector and corresponding configuration and uses the logic embedded in the connector. After getting the data, it pushes this data to Kafka Cluster [2].

Now, data may need to be transformed by transformation, aggregation, joins etc. This is done by using…

Apache Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. Kafka stores streams of records (messages) in topics. Each record consists of a key, a value, and a timestamp. Producers write data to topics and consumers read from topics.

Kafka Overview

Topics, Logs, Partitions and Offsets

Topics refer to a particular stream of data. It is similar to a table in a database. A topic is identified by its name.

For each topic, the Kafka cluster maintains a partitioned log. Topics are split in partitions. …

Manoj Gupta

Software Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store