Technical Content/ Procedure Example
Technical Content +
Procedures
This content was originally created in markdown, stored in git, then rendered through Gatsby.
They may not use this today. The content is available via web output at the links below, and I
have copied/pasted part of the configuration content below, as well. Please note that the
copied/pasted content does not retain formatting.
●
●
●
●
●
●
●
●
Who is the target audience? Network administrators.
How much of the content did you write? At least 75%.
Does the document represent original writing or existing content that you revised?
Original writing, mostly. There may have been some pieces pulled out of previous
content or internal engineering content.
How did you source the content? Because configuration can be complex, I met directly
with the engineers to ensure I got this right and didn’t miss anything. I also used specs,
Jira tickets, Confluence pages, and the customer success team to help with making sure
I got it right.
Did you follow a style guide? Yes. Absolutely. I created one for this company when I
arrived and used it for all content.
Was the document edited? No.
How did you obtain any code samples? Usually directly from the engineers, at least
initially. Then I verified it on my own.
How did you verify the content? I successfully tested it myself. I double-checked with the
engineers and triple-checked with the solutions consultants to ensure there wasn’t
anything I was missing.
Configure Standalone
Spark
Standalone Spark is bundled with Incorta so no installation is
necessary. You can configure standalone Spark to meet your
requirements.
By default, Spark is installed on the same node as Incorta; however,
you must install Spark on a separate server in most cases.
1. Verify that the following files use the correct machine name,
CNAME, localhost or IP address. These files can be found
from the
/IncortaNode directory.
spark/conf/spark-env.sh - SPARK_MASTER_IP=
spark/conf/spark-defaults.conf - spark.master value:
spark://:7077
2. Verify that the Spark master URL is set in the cluster management
console (CMC) under System Configuration > Server Configs >
Spark Integration > Spark master URL.
3. Create a Spark home environment variable. In Linux, set the
environment variable in your .bash_profile file. For example,
export
SPARK_HOME=/home/incorta/IncortaAnalytics/IncortaNode/spark .
4. Configure options in Spark. For details, see Configuration Options. 5.
Start and Validate Spark. For details, see Start and Validate Standalone
Spark. 6. Test Spark. For details, see Test Spark.
Configuration Options
Set the following configuration option in the
/IncortaNode/spark/conf/spark-env.sh file:
SPARK_WORKER_MEMORY : For standalone Spark, this sets the total
memory that can be used by the Spark instance on this machine.
Set the following configuration option in the
/IncortaNode/spark/conf/spark-defaults.conf file:
spark.cores.max : The maximum number of cores an executor can
use to run a job.
spark.executor.cores : The number of cores assigned to each
executor. This number must be equal to or less than spark.cores.max
and the total number of executors allowed to run -floor(spark.cores.max /
spark.executor.cores)
spark.sql.shuffle.partitions : Defaults to 200. The best setting
can vary depending on the Materialized View. You can find it by
testing. If this value is set too low and data volume is high, the
executor may not have enough memory to complete the job.
spark.driver.memory : The driver Incorta services use to
communicate with Spark. You may need to increase this value from
the default of 4GB for higher data volumes.
Set the following optional configuration options:
spark.worker.cleanup.enabled : Enables periodic cleanup of worker
and application directories. Disabled by default. Set to true to
enable it. spark.worker.cleanup.interval : The frequency, in
seconds, that the worker cleans up old application work directories.
The default is 30 minutes. Modify the value as you deem
appropriate.
spark.worker.cleanup.appDataTtl : Controls how long, in seconds,
to retain application work directories. The default is 7 days, which
is generally inadequate if Spark jobs are run frequently. Modify
the value as you deem appropriate.
Spark tmp directory: Incorta works best when you use a local disk
instead of a shared mounted NFS disk to store temporary files.
Ensure that the tmp folder has at least 5GB.
Start and Validate Standalone Spark
After you change the Spark master URL in the CMC under
System Configuration > Server Configs > Spark Integration >
Spark master URL:
1. Restart all Loader and Analytics services.
2. Use the startSpark.sh script to start the Master and Worker Spark
processes. 3. Validate the Spark Master by opening the Master Web UI
in a browser: http://:9091 and confirm the
page loads and a worker is running:
Test Spark
Once you have Spark up and running, create a materialized view in
Incorta that uses a pySpark script to test that Spark is working. You
must have at least one schema and a table defined to perform this
test.
1. Navigate to the Schemas tab of the Schema page.
2. Select the schema where you want to add a
materialized view. 3. Select + New, then Materialized
View.
4. Select Python from the Language drop-down list.
5. Paste the following script in the Script field. The most basic Spark
Python script reads a table in Incorta and saves it, duplicating the
table.
Copy
df = read("SCHEMA.TABLE") save(df)
6. Select Add Property to provide the "key" and "value" in the respective fields.
7. Select Add.
8. Perform a full load on the materialized view to start the Spark job and load
the materialized view table (Load Table).
9. Verify that the materialized view contains the same number of rows/columns
as the source SCHEMA.TABLE that you duplicated.
Configure Standalone Spark on Separate Server
Standalone Spark performs optimally on a dedicated server. To move or install
the bundled Spark version on a new node, perform the following steps:
1. Set up Incorta and Spark nodes with a shared disk accessible by both nodes.
2. Install Incorta on Incorta-Node.
3. Create a Tenant in Incorta and specify the tenant folder path to be on a
shared disk.
4. Copy SPARK_HOME from Incorta-Node to the shared disk on Spark-Node. 5.
Verify the values in the Spark configuration files the same way you verify the
values for a single node installation.
6. Start the Spark master and worker servers from Spark-Node, not from
Incorta-Node.
Port Requirements
By default Spark is installed on the same node as Incorta. However for
performance improvement, Incorta and Spark can be installed on separate
nodes.
The following table outlines the ports you can use for Spark. All are
incoming/outgoing with no installation parameter.
Port # Purpose Notes
7077 Master Port If Spark exists on a remote server, this port
must be open for remote
access.
7078 Worker Port If Spark exists on a remote server, this port
must be open for remote
access.
8080 MasterWebUI port
8081 workerWebUI port
9091 Spark Master Web UI Port
Setup
For Spark master, this port must be open
for remote access to monitor the Spark
9092 Spark Worker Web UI Port
jobs and logs.
For Spark worker, this port must be open
for remote access to monitor the Spark
jobs and logs.
Minimum Configuration
Configuring Spark to deliver optimum performance heavily depends on the
environment it’s running on, i.e. resources available to use: cores, memory, disk,
etc. To make sure you avoid the most common problems, you need to make
sure that at minimum you have the following configurations set. Disk Space for
Spark Working Directory Spark relies on disk to shuffle data among executors
while executing a query, so, the larger the dataset, the more disk space will
likely be required to accommodate those needs. Also it is highly recommended
that this disk should be SSD for faster performance, and we recommend a
minimum of 500GB free space for a standard production deployment.
Worker Memory
Worker memory is the entire memory space available for Spark executors to
allocate from upon initiation. It needs to be large enough to accommodate all
required executors. This amount should be larger than the collective amount
of memory allocated for the application and driver.
If Spark and Incorta are running on the same machine, make sure that
memory allocated to Incorta + memory allocated to Spark (worker memory) +
other necessary memory spaces (for the OS and for Spark Driver application
and MV’s Driver application) don’t exceed the physical memory of the
machine. Both configurations can be set in SPARK_HOME/conf/spark-env.sh ,
you can start by setting them like this:
SPARK_WORKER_MEMORY=1024g
SPARK_LOCAL_DIRS=/path/to/directory
Where: /path/to/directory is any path to an empty directory on a disk that has
at least 500GB of free disk space. Check the available disk space using the
following command: df -lh which returns an output that looks like this:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 961G 96K 961G 1% /dev tmpfs 961G 0 961G 0% /dev/shm /dev/xvda1 50G
42G 7.1G 86% /
Filesystem Size Used Avail Use% Mounted on
/dev/xvdh 246G 0 600G 1% /disk-1
So, you can set the configuration as follows: SPARK_LOCAL_DIRS=/disk-1.
Incorta SQL App configuration Parameters
As explained in the Detailed Configuration section below, Incorta lets you
control a lot of configuration, here’s a safe starting point for those configs:
SQL App Cores: 32 (number of cores allocated to all executors on all spark
nodes)
SQL App executors: 16 (minimum 2 cores per executor)
SQL App Shuffle partitions: 32 (same as number of cores or double,
recommended not to exceed 200)
SQL App Memory: 320 (Minimum 4GB per executor)
SQL App Driver Memory: 32 (between 4GB and 32GB, should be set to large
value if you are expecting ETL queries that returns large datasets) N.B.: Driver
memory is allocated on the machine where the Spark app was submitted
from, i.e. Incorta machine, so, make sure the machine has enough memory
to accommodate the Driver app
Those values are just guidelines for production environments, you may want to
change them according to the resources available, the datasets in question
and the nature of the queries.
Detailed Configurations
The Cluster Management Console (CMC) provides a number of configurations
to control Spark allocated resources and communication with Spark. Spark
configurations are found under Server Configurations > SQL Interface and
Spark Integration.
Some configurations can be modified and take effect without the need to
restart the server, others can be modified but won’t take effect until Incorta
server has been restarted. Configurations that require a server restart will be
marked by an asterisk (*) in the tables below.
Advantages of Using Spark
Using Spark as your canonical data source, aggregating data from multiple,
heterogeneous data sources, and serving data through a unified API, leverages
Incorta powerful features, for example:
Incorta capacity to handle large volumes of data (versus accessing your
original data sources directly)
Incorta capability to model joins across heterogeneous data coming from
different sources
Boosting query times, harnessing the power of Incorta fast engine You can
send SQL queries directly to Spark, which will hit Incorta data files to gather
result sets
Simplicity of Incorta business schemas, where users can define virtual
schemas of tables and query using those schemas
Convenience of Incorta security filters, where you can restrict access to
certain records in tables according to the user running the queries
Execution Modes
Incorta comes shipped with Spark in case you don’t have a running Spark
cluster or a standalone Spark installed. You can use it for Standalone mode.
That said, Spark can operate on external Spark installations, whether they’re
Standalone or Cluster.
Note that in case of using Cluster mode, you should make sure parquet files
are stored in a shared medium, either an NFS, HDFS or a similar medium.
Spark can be set up to use a remote managed Spark cluster.
Communication with Spark
How you communicate with Spark determines its behavior. Spark exposes two
ports for clients to bind to. Depending on which port the client uses, the query
will be routed as follows:
One port, will direct the query to a decision point where it’s either executed
against Incorta engine or Spark according to the query structure The other,
will push queries directly to Spark Typically, queries running against Incorta
engine will execute faster than those directed to Spark.
Possible Query Paths
Below are the possible paths taken by different queries during execution.
Case #1: Query Will Be Fulfilled by Incorta engine
Case #2: Query Sent to Incorta But Will Be Delegated to Spark
Case #3: Query Sent Directly to Spark
Caching
To accelerate recurring queries, Spark caches results of queries (matching
criteria set in configuration). If the query result set is cached, and the data
didn’t change since the cached results were captured, the query won’t be run
and the cached result set will be returned immediately.
There is another option to refresh the cached result sets periodically. If set,
users will always get data from cache, while the system will refresh the cache
based on in the defined schedule.
© Incorta, Inc. All Rights Reserved. Privacy Policy