Security in Spark is OFF by default, which means you are fully responsible for security from Day One. Spark supports a variety of deployment types, each with its own set of security levels. Not all deployment sorts are safe in every scenario, and none is secure by default. Take the time to analyze your situation, what Spark can do, and how you may secure your Spark installation.
Know your Algorithms
AES (Advanced Encryption Standard) is a symmetric-key encryption algorithm that is used to encrypt and decrypt data.
RSA (Rivest-Shamir-Adleman) is an asymmetric-key encryption algorithm used to encrypt and decrypt data.
HMAC (Hash-based Message Authentication Code) is a message integrity method that creates a checksum of the data using a cryptographic hash function. The checksum is then used to verify the integrity of the data.
Symmetric key encryption is when the same key is used to encrypt and decrypt data. Asymmetric key encryption is when different keys are utilized to encrypt and decrypt data. Hashes are one-way algorithms that take any input string of any length and modify it, resulting in a given-size output.
Know your protocols
Kerberos is a network authentication protocol that uses secret-key cryptography to authenticate users and services on a network. It is intended to prevent unauthorized access to network resources.
SSL (Secure Sockets Layer) is an encryption security protocol. With SSL, the server has a certificate which it presents to the client when asked for authentication.
TLS (Transport Layer Security) is an encryption security protocol. With TLS, the client has a certificate which it presents to the server when asked for authentication.
Keystores are used by the server side of a TLS/SSL client-server connection. It typically contains one private key for the host system.
Truststores are used by the client side of a TLS/SSL client-server connection. It contains no keys but may contain root certificates for public certificate authorities.
SSL Configuration
SSL is set up hierarchically. Unless protocol-specific settings overrule the user’s default SSL settings, they may define the defaults for all supported protocols. Set ${ns}.enabled
equal to true and set a TLS protocol using ${ns}.protocol
. Consider how you will be implementing your keystore and truststore and whether ot not to change ${ns}.needClientAuth
from false to true.
Spark RPC
RPC refers to the communication protocols used between Spark processes. For authentication, set the spark.authenticate
configuration parameter. Distribution of the shared secret is dependent on deployment. YARN uses YARN RPC encryption distribute the shared secret. Kubernetes will automatically generate a secret for each application and propagate to the executor pods.
Spark implements AES-based encryption for RPC connections to ensure that data is safe and secure during transmission among nodes.
Local Storage
Spark encrypts ordinary data temporarily stored to local disks. This includes shuffle files, shuffle spills, and data blocks saved on disk. Set the spark.io.encryption.enabled
property to true (it is false by default). The default spark.io.encryption.keygen.algorithm
value is HMACSHA1.
Conclusion
Authentication, confidentiality, and data integrity/authenticity are the three fundamental components of Spark security. Authentication ensures that only authorized individuals can access your data, while encryption ensures its security. Confidentiality ensures that only people with the correct decryption keys have access to information. The data integrity and authenticity of a message is the degree to which it has been altered during transmission from source to destination(s). In any security scenario based on Spark, every component is vital.