Skip to main content

Data & Intelligence

It’s good that Spark Security is turned off by default

Cloud Security In Offshore Software Development Projects

Security in Spark is OFF by default, which means you are fully responsible for security from Day One. Spark supports a variety of deployment types, each with its own set of security levels. Not all deployment sorts are safe in every scenario, and none is secure by default. Take the time to analyze your situation, what Spark can do, and how you may secure your Spark installation.

Know your Algorithms

AES (Advanced Encryption Standard) is a symmetric-key encryption algorithm that is used to encrypt and decrypt data.

RSA (Rivest-Shamir-Adleman) is an asymmetric-key encryption algorithm used to encrypt and decrypt data.

HMAC (Hash-based Message Authentication Code) is a message integrity method that creates a checksum of the data using a cryptographic hash function. The checksum is then used to verify the integrity of the data.

Symmetric key encryption is when the same key is used to encrypt and decrypt data. Asymmetric key encryption is when different keys are utilized to encrypt and decrypt data. Hashes are one-way algorithms that take any input string of any length and modify it, resulting in a given-size output.

Know your protocols

Kerberos is a network authentication protocol that uses secret-key cryptography to authenticate users and services on a network. It is intended to prevent unauthorized access to network resources.

SSL (Secure Sockets Layer) is an encryption security protocol. With SSL, the server has a certificate which it presents to the client when asked for authentication.

TLS (Transport Layer Security) is an encryption security protocol. With TLS,  the client has a certificate which it presents to the server when asked for authentication.

Keystores are used by the server side of a TLS/SSL client-server connection. It typically contains one private key for the host system.

Truststores are used by the client side of a TLS/SSL client-server connection. It contains no keys but may contain root certificates for public certificate authorities.

SSL Configuration

SSL is set up hierarchically. Unless protocol-specific settings overrule the user’s default SSL settings, they may define the defaults for all supported protocols. Set ${ns}.enabled equal to true and set a TLS protocol using ${ns}.protocol. Consider how you will be implementing your keystore and truststore and whether ot not to change ${ns}.needClientAuth from false to true.

Spark RPC

RPC refers to the communication protocols used between Spark processes. For authentication, set the spark.authenticate configuration parameter. Distribution of the shared secret is dependent on deployment. YARN uses YARN RPC encryption distribute the shared secret. Kubernetes will automatically generate a secret for each application and propagate to the executor pods.

Spark implements AES-based encryption for RPC connections to ensure that data is safe and secure during transmission among nodes.

Local Storage

Spark encrypts ordinary data temporarily stored to local disks. This includes shuffle files, shuffle spills, and data blocks saved on disk. Set the spark.io.encryption.enabled property to true (it is false by default). The default spark.io.encryption.keygen.algorithm value is HMACSHA1.

Conclusion

Authentication, confidentiality, and data integrity/authenticity are the three fundamental components of Spark security. Authentication ensures that only authorized individuals can access your data, while encryption ensures its security. Confidentiality ensures that only people with the correct decryption keys have access to information. The data integrity and authenticity of a message is the degree to which it has been altered during transmission from source to destination(s). In any security scenario based on Spark, every component is vital.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us