DATAlAKE

Web Scrapy

用户代理 mobile devices browsing the web often see a pared-down ver‐ sion of sites, lacking banner ads, Flash, and other distractions. If you try changing your User-Agent to something like the following, you might find that sites get a little easier to scrape! User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 scrapy architecture The data flow in Scrapy is controlled by the execution engine, and goes like this:

Lakehouse Notes

Replication Method Replication Method - Replication method to use for extracting data from the database. STANDARD replication requires no setup on the DB side but will not be able to represent deletions incrementally. CDC uses the Binlog to detect inserts, updates, and deletes. This needs to be configured on the source database itself. S3 Support in Apache Hadoop Apache Hadoop ships with a connector to S3 called “S3A”, with the url prefix “s3a:“; its previous connectors “s3”, and “s3n” are deprecated and/or deleted from recent Hadoop versions.

Aws S3 Data Lake

https://medium.com/people-ai-engineering/building-a-data-lake-in-aws-9c1fb3876e23 https://towardsdatascience.com/building-a-data-pipeline-from-scratch-on-aws-35f139420ebc

Data Lake

设计目标 存取(入库和分析)高效 节省存储空间 评估单台设备基于采集评率的每年存储成本 http://mysql.rjweb.org/doc.php/datawarehouse