Integrating LVM with Hadoop

6 min readMar 14, 2021

For integrating LVM with Hadoop first lets talk about LVM.

What is LVM?

LVM, or Logical Volume Management, is a storage device management technology that gives users the power to pool and abstract the physical layout of component storage devices for easier and flexible administration.

Using LVM we can make our storage dynamic. Here we can combine multiple physical volume into one single Hard disk and operate this as a normal Hard disk.

Suppose you have 2 Hard disk of 50–50 GB. But you have one singe file which has data around 60 GB. So you cannot store this data in any of the Hard disk. Using LVM we can create a virtual Hard disk of 100 GB(or 80–90) combining this 2 Hard Disks. Now we can easily store data in this 100GB Hard disk. Some part of data stored in Physical Hard Disk 1 & some part in other Hard disk.

Here LVM will create a separate INODE table and maintain metadata of what data stored in which Hard disk.

I know it’s look like very harder but when we going to do for this, you can see it is very excited and simple like ABC.

Let’s back to our main concern

If you are new to Hadoop you can refer to my this blog where i have covered some basics of Hadoop.

We know that the Hadoop master node running the NameNode process takes care of all the filesystem namespace of HDFS, while the slave nodes provide actual storage to store the files and folders.

In this blog we are focusing on how to integrate LVM with Hadoop.

For the simplicity we are going to take some simple small setup.

Suppose that you have two hard disks inserted into your machine. The first hard disk is of size 20 GB and the other of size 40 GB. While configuring this machine as the slave system, you decide to donate space from the partition made on the second hard disk(40 GB). For the sake of simplicity, assume that we only have one partition on this hard disk, and it is almost the size of the hard disk, although it will be a little less given the space preserved for storing metadata. Also, assume that currently, the cluster consists of only one machine, i.e. a single-node cluster.

Here you are contributing 40GB to the Hadoop cluster but client wants to upload 43GB file. But as you know in this case upload will fail. You thought that we have second Hard disk which is not at all used of 20 GB. But how to use that as we have one single file of 43GB. Here we can use LVM.

So, let’s get started.

Here we have 2 Hard disk and we want to create one single logical hard disk.

Note: I am running this practical on RedHat Linux and already attached 2 hard disk to it.

Step 1: Create a physical volume

You can see that I have two devices here, /dev/sdb and /dev/sdc.

For demo purpose I have used both hard disk of 4 GB.

Let’s create a physical volume out of it.

[root@localhost ~]# pvcreate <hd_name>

Step 2: Create a volume group.

Volume group is basically combining 2 hard disk into one single storage. So at end of this we have one single hard disk of 8 GB.

[root@localhost ~]# vgcreate <vg_name> <hd1_name> <hd2_name>

Here we must have to use same hard disk which we use in above and created PV.

Using vgdisplay we can check all the vg we or system has created. For specific vg you can give vg name.

[root@localhost ~]# vgdisplay datastore

Step 3: Create logical volume.

In this step we are going to create a partition of hard disk.

[root@localhost ~]# lvcreate --size <size_of_partition> --name <name_of_partition> <vgname>

Step 4: Format partition.

As you all know we have to format a partition so we can use it. Here we are going to use ext4 format for formatting partition.

[root@localhost ~]# mkfs.ext4 /dev/<vganem>/<lvname>

Step 5: Mount to the folder.

As you know we have to mount it as we cannot directly interact with device.

This above step is basic LVM concept, but now we want to donate this storage to datanode, so we have to mount it accordingly.

I already created a directory /hadoop/hadoopdata/hdfs/datanode and then mounted the partition.

[root@localhost ~]# mount /dev/datastore/store1 /hadoop/hadoopdata/hdfs/datanode

let’s check its working or not.

Yeah!! we got out 6GB hard disk and we can donate it to the master node.

Summarize the concept using visual …

Let’s donate this storage in Hadoop cluster,

For this you have to add this mount point name in master node hdfs-site.xml file.

Let’s check hdfs report

[root@localhost ~]# hdfs dfsadmin -report

You can see we have have got around 6 GB in Hadoop cluster.

Well most important question is, why we are using LVM and follow this long approach?

The most obvious reason is that it lets you derive space from more than one physical hard disk. But, another significant advantage of using Logical Volumes is that they create dynamic partitions.

What does that mean?

Let’s say we want to increase our HD size from 6 to 7 GB.

In LVM we can extend the size of hard disk on the fly and it will not affect our already stored data.

When we extend the partition size, storage is derived from the volume group. Also, one can reduce logical volumes, in which case the extra storage is returned to the Volume Group.

Let’s go for it…

Step 1: Extend lvm.

[root@localhost ~]# lvextend --size +1G /dev/datasore/store1

Step 2: Reformat our partition.

It will not delete your older data available in the HD.

[root@localhost ~]# resize2fs /dev/datastore/store1

And now go for master node capacity…

So, now the HDFS slave, sharing logical volume is indeed a master of its own — it can decide when it wants to donate more or when to remove unused space from the LV.

Also, I wonder if you noticed that the root volume of our rhel system was also a Logical Volume. So basically we can use this same approach and extend our local virtual machine storage. It would have saved us a lot of time!

Anyway, I hope you enjoyed reading this blog!

Integrating LVM with Hadoop

What is LVM?

Summarize the concept using visual …

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Akshit Modi

No responses yet

More from Akshit Modi

Securing Your Lightning Node: A Deep Dive into Admin Macaroons and TLS Certificates

In this blog, we are going to assume that you are using a containerized approach to launch LND nodes.

Launching WordPress-MySQL cluster (with HA) using AWS-RDS and Kubernetes-WordPress

So, in this story we are going to cover how to use RDS service of amazon, launching mysql-wordpress cluster with High availability.

Starting With Jenkins

As we all know Jenkins is one of the most demanded tool in world of DevOps. Companies who are using Jenkins is increasing day by day. So…

Benefits MNC’s getting using Artificial Intelligence and Machine Learning

Recommended from Medium

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Hadoop Ecosystem (I) — HDFS

Apache Hadoop is a collection of large dataset-processing libraries, which is commonly used in big data analysis and queries. I want to…

Lists

Staff picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

Laziness Does Not Exist

Psychological research is clear: when people procrastinate, there's usually a good reason

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Understanding Jobs, CronJobs, Etcd, and Restart Policies in Kubernetes

Kubernetes is a powerful container orchestration platform that supports diverse workload types, including one-time jobs, recurring tasks…

I Pretended to Be a Man on a Dating Site — And I Hate What I Discovered

As a 23-year-old woman fascinated by human behavior (and, let’s be honest, sometimes just bored and curious), I decided to conduct a…