Primer on Security in a Data System

For any system, there multiple aspects to security. Hadoop related projects solve these security concerns differently using different frameworks and libraries. But overall, they all try to address these common security concerns in varying degrees. Some of these aspects are normally solved externally while some of them need solutions within the system.

Perimeter security

Perimeter security is mostly an external concern by restricting access to the system using rules enforced with firewalls or iptables. Usage of proxies etc (Apache Knox) belong in this category. They make sure that systems are accessible only from authorized machines. They also involve authentication and activity monitoring on the client machines to prevent unauthorized access and unwanted activities. The level of perimeter security is normally dictated by corporate policies and sensitivity of the data.



Image result for authentication

Authentication is the process of verifying and identifying the user. Depending on the factors used, there can be a single factor or multi factor authentication. In the hadoop world, kerberos is the most common form of authentication. For a detailed listof how different systems authenticate users, please refer to Authentication Methods across Hadoop projects.


Image result for authorization

Authorization relates to the process by which we determine whether a user can perform an activity on the system. Teh activities could be writing to a file, submitting a job, accessing a column etc. In hadoop projects, authorization is achieved by keeping map between operations and usergroups. The user/group information is normally obtained using LDAP.


Image result for auditing

Normally auditing refers to a record of who did what and when. We can also add activity monitoring also as a function of auditing. Most of the hadoop projects record audit logs separately from that of the regular operational logs. There are many products which can perform activity monitoring. I personally worked in project Apache Eagle which was designed to consume audit logs and then generate alerts based on predefined rules.

Data Protection

Image result for data protection

Data protections refers to protecting the data while in transit and and when stored. The metadata and operational data also needs to protected.  Protection consists both integrity and confidentiality (encryption). Hadoop project itself  supports data protection almost completely, but the related projects supports data protection in varying levels.

You May Also Like

About the Author: Benoy Antony

I am an Apache Hadoop Committer and has been working as an engineer/architect at companies like eBay and Paypal. Please check my LinkedIn Profile for the full profile.

Leave a Reply

Your email address will not be published. Required fields are marked *