Skip to content →

Hadoop Notes

I just started a new job and need to learn about the “big data” ecosystem. I’m writing this post to reinforce some learning, improve my memory retention, and to create a reference for myself (and others?) to use in the future.

What is Hadoop?

  • Tries to bring compute as close to the data as possible to decrease network latency and improve overall throughput
  • Consists of two parts: the Hadoop File System (HDFS) and Hadoop MapReduce

HDFS

The Hadoop File System is a file system distributed across a cluster of machines. What makes it special is that it takes input files and chunks them up into smaller files and replicates them across the cluster in order to improve fault tolerance.

MapReduce

MapReduce is an algorithm that applies, or “maps”, a function across a number of records, and then consolidates, or “reduces” the result. It works on top of HDFS.

Terminology

  • NameNode – A master server that manages the file system namespace and regulates access to files. Executes file system namespace operations like opening, closing, and renaming files and directories. Also determines the mapping of blocks to DataNodes
  • DataNode – Machines that manage storage (blocks of chunked up files) attached to the nodes that they run on. Responsible for serving read and write requests from the file system’s clients
  • NameNode and DataNode are pieces of software that can run on a variety of machines, they are not the machines themselves.
  • YARN – Yet Another Resource Manager. Takes the jobs submitted to Hadoop and distributes them across the cluster

CLI Commands

Important Files

  • core-default.xml
    • Read-only default Hadoop core settings.
  • core-site.xml
    • Site-specific core Hadoop settings that overrides core-default.xml. You can add your own parameters here.

Good Resources

Published in Today I Learned