http://gethue.com/ SQL client

Working with parquet
https://news.ycombinator.com/item?id=40284225

### ICEBERG and other formats
https://developer.sh/posts/delta-lake-and-iceberg

https://www.youtube.com/watch?v=LiC9vZATv0o&t=8s

https://www.dremio.com/subsurface/comparison-of-data-lake-table-formats-iceberg-hudi-and-delta-lake/

https://mlwhiz.com/blog/2015/05/09/hadoop_mapreduce_streaming_tricks_and_technique/

https://www.tutorialandexample.com/hadoop-interview-questions

https://thegurus.tech/posts/2019/05/hadoop-python/

https://blog.codecentric.de/en/2018/10/window-functions-in-stream-analytics/

https://sujithjay.com/hadoopdb

Apache Ni-Fi
https://habr.com/ru/post/465299/
https://medium.com/@abdelkrim.hadjidj/best-practices-for-using-apache-nifi-in-real-world-projects-3-takeaways-1fe6912101db
https://dzone.com/articles/kylo-self-service-data-ingestion-cleansing-and-val Apachi NiFi Kylo

Vespa
https://vespa.ai/

Pinot
https://pinot.apache.org/
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources
(such as Apache Hadoop and flat files) as well as online sources (such as Apache Kafka).

IoT MQTT
Device Hive
Intel MQTT
http://highscalability.com/blog/2018/4/9/give-meaning-to-100-billion-events-a-day-the-analytics-pipel.html

Apache Druid
https://habr.com/company/odnoklassniki/blog/420469/
https://towardsdatascience.com/introduction-to-druid-4bf285b92b5a

Druid:
Druid is unique among related technologies: it is both
an OLAP database (can be compared with Vertica, RedShift and Snowflake),
a distributed query processor,
a time-series DB,
has stream processing features,
and has UI visualization that supports pivots.
Druid brings its best abilities with numerical and time-series data,
and its ability to continuously ingest real-time business event streams makes it better for real-time queries.
Long-running BI queries that need to touch a lot of historical data will have to go through cold storage upload and will have longer processing times.

Apache Drill
https://mapr.com/blog/how-guide-getting-started-apache-drill/

Apache Drill is a highly scalable open source application framework which includes a SQL query engine.
It can fetch data from a variety of mainly non-relational data stores, such as NoSQL databases.
It is based on a schema-less JSON document model for data,
so it is more flexible but slower than engines based on schema-based columnar data formats. Impala: Apache Impala is a highly scalable, open source, distributed SQL query engine for big data, primarily oriented toward data on Hadoop clusters. It trades off fault tolerance for speed, keeping intermediate results in memory for high performance, and, by some metrics, is the fastest interactive query engine. It is optimized for the Parquet columnar data format, using files on the order of 256 MB. It can perform poorly with a large number of small files, for the same amount of data.

SparkSQL:
Apache SparkSQL is a highly scalable, open source, distributed SQL query engine for big data, with connectors to many data stores. It can deliver very high throughput for schema-based columnar data formats. For very large queries, running hours to days on many processors, it is a good choice, as it captures intermediate results in temporary files and can restart failed parts with low time penalty. On the other hand, the minimum time for a very small query is relatively high; also, the resource usage and service time for small to medium queries are increased by the cost of saving intermediate results if the query plan cannot keep them in memory.

Presto:
Presto is an open source, distributed query engine for big data with mature SQL support. Presto bypasses MapReduce and uses SQL-specific distributed operations in memory. The architecture is designed such that all stages are pipelined so there is no wait time, no need to write to disk in the intermediate, no need to fit all data into the memory, and no disk IO delays. Presto delegates some of the operations to the underlying data stores it queries, thus leveraging their inherent analytics capabilities. Further, Presto can perform cross-platform joins, thus providing centralized support for querying historical data across disparate data sources. Presto has advanced SQL support with capabilities including dynamic filtering, dynamically resolved-functions, SQL-defined functions (CREATE FUNCTION). Like Impala, Presto sacrifices fault tolerance for speed.

Kylin:
Apache Kylin is built to manage OLAP cubes in HBase to support fast SQL queries. OLAP cubes need to store many secondary indexes (one per dimension) and then use fast random access to retrieve the records. Data warehouse files are oriented toward full table scans (“for each item in haystack, add to result set if it looks like a needle”). Kylin is best suitable for smaller cardinality data, and can be a much higher cost option for very large datasets.

Apache Flink
https://www.ververica.com/blog/real-time-experiment-analytics-at-pinterest-using-apache-flink
https://www.infoq.com/presentations/sql-streaming-apache-flink FLINK
https://www.ververica.com/blog/flink-forward-preview-event-time-partitioning-with-apache-flink-apache-iceberg-netflix

http://tech.marksblogg.com/presto-connectors-kafka-mongodb-mysql-postgresql-redis.html Presto + Kafka

Hadoop

conf/core-site.xml:
conf/hdfs-site.xml:
conf/mapred-site.xml:

mapreduce.input.fileinputformat.split.maxsize
mapreduce.input.fileinputformat.split.minsize

My Notes

Performance tuning

Performance

MapR

https://www.java-success.com/01-hadoop-bigdata-overview-interview-questions-answers/

Hadoop BIG DATA Interview Questions
Java notes

Hadoop data formats
https://habr.com/ru/company/mailru/blog/504952/
https://habr.com/ru/company/otus/blog/465069/
https://habr.com/ru/company/alfastrah/blog/458552/

https://www.youtube.com/watch?v=NZLrJmjoXw8 ORC vs Parquet
predicate pushdown , mix/ max, bloom filter per 10,000 rows is better in ORC

CBOR (Concise Binary Object Representation)
https://habr.com/ru/post/208690/
IETF RFC7049. JSON-like, базовые типы – int, float, UTF-8 string, byte string, array, map, примитивы из json.
На основе базовых типов расширяется стандартизированными типами, а так же просто по изъявительному принципу.
В стандартные расширенные типы входят несколько десятков, в том числе и нужные вам Decimal, UUID, дата-время в строковом формате rfc3339 и unix timestamp.
Но без валидации и схем.
https://habrahabr.ru/post/208690/

Avro - row major format

https://habr.com/ru/post/346698/

Avro primary design goal was schema evolution.
In the avro format, we store schema separately from data. Generally avro schema file (.avsc) is maintained.

Like Thrift, protobuf structures are defined via an IDL, which is used to generate stub code for multiple languages.
Also like Thrift, Protocol Buffers do not support internal compression of records, are not splittable,
and have no native MapReduce support. But also like Thrift, the Elephant Bird project can be used to encode protobuf records,
providing support for MapReduce, compression, and splittability.

Avro is a language-neutral data serialization system designed to address the major downside of Hadoop Writables:
lack of language portability. Like Thrift and Protocol Buffers, Avro data is described through a language-independent schema.
Unlike Thrift and Protocol Buffers, code generation is optional with Avro. Since Avro stores the schema in the header of each file,
it’s self-describing and Avro files can easily be read later, even from a different language than the one used to write the file.
Avro also provides better native support for MapReduce since Avro data files are compressible and splittable.
Another important feature of Avro that makes it superior to SequenceFiles for Hadoop applications is support for schema evolution;
that is, the schema used to read a file does not need to match the schema used to write the file.
This makes it possible to add new fields to a schema as requirements change.
Avro schemas are usually written in JSON, but may also be written in Avro IDL, which is a C-like language. As just noted,
the schema is stored as part of the file metadata in the file header. In addition to metadata,
the file header contains a unique sync marker. Just as with SequenceFiles, this sync marker is used to separate blocks in the file,
allowing Avro files to be splittable. Following the header, an Avro file contains a series of blocks containing serialized Avro objects.
These blocks can optionally be compressed, and within those blocks, types are stored in their native format, providing an additional
boost to compression. At the time of writing, Avro supports Snappy and Deflate compression.

Columnar file formats for Hadoop
https://orc.apache.org/
https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
https://www.youtube.com/watch?v=6I5Ia_u4c6E

RCFile

ORC - Optimized Row Columnar :
Schema is with the data, as a part of footer
Data is stored as row groups and stripes.
Each stripe maintains indexes and stats about data it stores.

Parquet

https://www.mungingdata.com/aws/athena-spark-best-friends

From csv to parquet:

val df = spark.read.csv("/mnt/my-bucket/csv-lake/")
spark.write.parquet("/mnt/my-bucket/parquet-lake/")

https://eng.uber.com/presto/

http://www.hydrogen18.com/blog/writing-parquet-records.html

https://habr.com/ru/search/?q=parquet#h
Similar to ORC. Based on Google Dremel
Schema stored in footer
Column oriented storage format
Has integrated compression and indexes

Only ORC and Parquet have the following features

Predicate pushdown where a condition is checked against the metadata to see if the rows need to be read.
Column projection to only read the bytes for the necessary columns.
ORC can use predicate pushdown based on either:

min and max for each column
optional bloom filter for looking for particular values

Parquet only has min/max. ORC can filter at the file level, stripe level, or 10k row level. Parquet can only filter at the file level or stripe level.

The previous answer mentions some of Avro's properties that are shared by ORC and Parquet:

They are both language neutral with C++, Java, and other language implementations.
They are both self describing.
They are both splittable when compressed.
They both support schema evolution.

The MapReduce framework and other ecosystem projects provide RecordReader implementations for many file formats:
text delimited, SequenceFile, Avro, Parquet, and more.

HBASE

https://habr.com/company/sberbank/blog/420425/

HBase records can have an unlimited number of columns, but only a single row key.
This is different from relational databases, where the primary key can be a composite of multiple columns.
This means that in order to create a unique row key for records, you may need to combine multiple pieces of information in a single key.

In HBase, all the row keys are sorted, and each region stores a range of these sorted row keys.
Each region is pinned to a region server (i.e., a node in the cluster).
A well-known anti-pattern is to use a timestamp for row keys because it would make most of the put and get requests
focused on a single region and hence a single region server, which somewhat defeats the purpose of having a distributed system.
It’s usually best to choose row keys so the load on the cluster is fairly distributed;
one of the ways to resolve this problem is to salt the keys.

A table can have one or more column families.
Each column family has its own set of HFiles and gets compacted independently of other column families in the same table.

a) Get is used to read information from the HBase table.
You can read a specific column from a column-family in a table.
b) Put writes the information to the HBase table.
You can write specifically to a column in a column-family in a table.
For an insert, you need to just specify the table name and column data to fill.
For an update, you need to Scan the row first using the column and cell values and then update the new value along with the timestamp.
c) For Delete, mention the table name and version / column / column family to delete.
d) Scan is used access the entire table or a set of records.
e) Increment is used to automatically increment a cell value (row or column).

How do you read and write data using HBase?
Answer: Data is Written into HBase using the following steps:
WAL stores the information to be written for log purposes.
This data is copied into the MemStore which is a temporary memory like RAM.
The MemStore makes HBase faster. From the MemStore, the data is dumped into the HFile which is in the HDFS.
If the MemStore cache is full, the data is directly written into the HFile.
Once data is written successfully into the HDFS, an acknowledgement is passed on to the client.

Hive
Metastore (usually MySQL) stores the table definition
/etc/hive/conf/
hive-site.xml

DROP DATABASE x
DROP TABLE a -- before DROP DATABASE x
https://www.youtube.com/watch?v=vwac18EzGGs . Hive Table dissected
https://www.youtube.com/playlist?list=PLOaKckrtCtNvLuuSkDdx71hAhPyNSqf66
https://www.youtube.com/watch?v=dwd9m1Zl04Q . Hive JOIN OPTIMIZATION

Shuffling is expensive
Hints
/* +STREAMTABLE */
/* +MAPJOIN */

https://community.hortonworks.com/articles/149894/llap-a-one-page-architecture-overview.html

SMB (sort merge backeted) MAP JOIN

Hive metastore. To enable the usage of Hive metastore outside of Hive, a separate project called HCatalog was started.
HCatalog is a part of Hive and serves the very important purpose of allowing other tools (like Pig and MapReduce)
to integrate with the Hive metastore.

Physically, a partition in Hive is nothing but just a sub-directory in the table directory.
CREATE TABLE table_name (column1 data_type, column2 data_type)
PARTITIONED BY (partition1 data_type, partition2 data_type,….);

Partitioning is works better when the cardinality of the partitioning field is not too high .

https://stackoverflow.com/questions/19128940/what-is-the-difference-between-partitioning-and-bucketing-a-table-in-hive
http://www.hadooptpoint.org/difference-between-partitioning-and-bucketing-in-hive/

Clustering aka bucketing on the other hand, will result with a fixed number of files,
since you do specify the number of buckets.
What Hive will do is to take the field, calculate a hash and assign a record to that bucket.
But what happens if you use let's say 256 buckets and the field you're bucketing on has a low cardinality
(for instance, it's a US state, so can be only 50 different values) ?
You'll have 50 buckets with data, and 206 buckets with no data.

CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….)
CLUSTERED BY (column_name1, column_name2, …)
SORTED BY (column_name [ASC|DESC], …)]
INTO num_buckets BUCKETS;

Partitions can dramatically cut the amount of data you're querying.
if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.
bucketing can speed up joins with other tables that have exactly the same bucketing,
if you're joining two tables on the same employee_id, hive can do the join bucket by bucket
(even better if they're already sorted by employee_id since it's going to to a mergesort which works in linear time).

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets.
Partitioning works best when the cardinality of the partitioning field is not too high.

Also, you can partition on multiple fields, with an order (year/month/day is a good example),
while you can bucket on only one field.

HDFS
- rename / move file or directory
hdfs dfs -mv /old/file /new/file
- get admin report / cluster status
hdfs dfsadmin -report
- get rack awareness and number of under replicated blocks
hadoop fsck / -locations -blocks -files | grep -i -C6 miss
- print a compressed file uncompress a file
hdfs dfs -text /path/to/compressed_file
- merge all files in a folder to one file
hdfs dfs -getmerge /path/to/folder /path/to/file
- rm remove file
hdfs dfs -rm /user/hive/warehouse/blah/blah
- kill yarn job
hadoop job -kill job_id
- empty trash
hdfs dfs -expunge
- stream result of pipe stdin to a hadoop file
cat somefile.tsv | hdfs dfs -put - /file/on/cluster.tsv
- put copy file from local to cluster
hdfs dfs -put localfile /user/hadoop/hadoopfile
- create touch a file
hdfs dfs -touchz /path/to/myfile.txt
- tail a file
hdfs dfs -tail /my_file
- return stat information of the path (basic file info)
hdfs dfs -stat /my_file
- delete folder
hdfs dfs -rmr /folder/to/delete
- delete file
hdfs dfs -rm /file/to/delete
- list all files in a directory
hdfs dfs -ls /user/dir1
- get access control list ACL of files and directories
hdfs dfs -getfacl /file
- display size of files and directories
hdfs dfs -du /user/hadoop/dir1
- copy file within the cluster
hdfs dfs -cp /source_file /dest_file
- count number of directories, files and bytes
hdfs dfs -count /hdfs/folder/file
- copy from local machine to cluster with overwrite
hdfs dfs -copyFromLocal -f /local/file.csv /hdfs/folder/
- copy from remote cluster to local machine
hdfs dfs -copyToLocal /hdfs/folder/file.csv /local/folder
- copy from local machine to cluster
hdfs dfs -copyFromLocal /local/file.csv /hdfs/folder/
- stream append to a file using unix pipe stdin
cat somefile | hdfs dfs -appendToFile - /hdfs/my_file.txt
- append to a file
hdfs dfs -appendToFile /local_file.txt /hdfs/my_file.txt
- print a file
hdfs dfs -cat /my_file
- change group of file
hdfs dfs -chgrp group_name /my_file
- change file permissions
hdfs dfs -chmod 1755 /my_file
- change owner of file/directory
hdfs dfs -chown user:group /my_hdfs/file
- set replication factor recursive
hdfs dfs -setrep -R -w 1 /my_hdfs/folder
- change replication factor of a file
hdfs dfs -setrep -w 2 /my_hdfs/file
- Copy Files from one cluster to other
hadoop distcp -pb -overwrite hftp://hdfssource:50070/file/to/copy hdfs://hdfsdest:8020/user/test
- print sequence file
hdfs dfs -text /path/to/file/hdfs
- create directory
hdfs dfs -mkdir /path/to/directory

- hadoop streaming
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py

- set sticky bit (preventing anyone except the superuser, owner from deleting or moving the files within the directory)
sudo -u hdfs hadoop fs -chmod 1777 /tmp