TY - BOOK AU - Marz,Nathan AU - Warren,James TI - Big data: principles and best practices of scalable real-time data systems SN - 1617290343 U1 - 658.4038 23 PY - 2015/// CY - Shelter Island, NY PB - Manning KW - Big data KW - Business intelligence KW - Data mining N1 - Includes index; 1. A new paradigm for Big Data -- Part 1. Batch layer : -- 2. Data model for Big Data -- 3. Data model for Big Data: Illustration -- 4. Data storage on the batch layer -- 5. Data storage on the batch layer: Illustration -- 6. Batch layer -- 7. Batch layer: Illustration -- 8. An example batch layer: Architecture and algorithms -- 9. An example batch layer: Implementation -- Part 2. Serving layer : -- 10. Serving layer -- 11. Serving layer: Illustration -- Part 3. Speed layer : -- 12. Realtime views -- 13. Realtime views: Illustration -- 14. Queuing and stream processing -- 15. Queuing and stream processing: Illustration -- 16. Micro-batch stream processing -- 17. Micro-batch stream processing: Illustration -- 18. Lambda Architecture in depth -- --; 1; A new paradigm for Big Data --; 1.1; How this book is structured --; 1.2; 1.2Scaling with a traditional database --; 1.3; NoSQL is not a panacea --; 1.4; First principles --; 1.5; Desired properties of a Big Data system --; 1.6; The problems with fully incremental architectures --; 1.7; Lambda Architecture --; 1.8; Recent trends in technology --; 1.9; Example application: SuperWebAnalytics.com --; 1.10; Summary -- --; Part 1; Batch layer : -- --; 2; Data model for Big Data --; 2.1; The properties of data --; 2.2; The fact-based model for representing data --; 2.3; Graph schemas --; 2.4; A complete data model for SuperWebAnalytics.com --; 2.5; Summary -- --; 3; Data model for Big Data: Illustration --; 3.1; Why a serialization framework? --; 3.2; Apache Thrift --; 3.3; Limitations of serialization frameworks --; 3.4; Summary -- --; 4; Data storage on the batch layer --; 4.1; Storage requirements for the master dataset --; 4.2; Choosing a storage solution for the batch layer --; 4.3; How distributed filesystems work --; 4.4; Storing a master dataset with a distributed filesystem --; 4.5; Vertical partitioning --; 4.6; Low-level nature of distributed filesystems --; 4.7; Storing the SuperWebAnalytics.com master dataset on a distributed filesystem --; 4.8; Summary -- --; 5; Data storage on the batch layer: Illustration --; 5.1; Using the Hadoop Distributed File System --; 5.2; Data storage in the batch layer with Pail --; 5.3; Storing the master dataset for SuperWebAnalytics.com --; 5.4; Summary -- --; 6; Batch layer --; 6.1; Motivating examples --; 6.2; Computing on the batch layer --; 6.3; Recomputation algorithms vs. incremental algorithms --; 6.4; Scalability in the batch layer --; 6.5; MapReduce: a paradigm for Big Data computing --; 6.6; Low-level nature of MapReduce --; 6.7; Pipe diagrams: a higher-level way of thinking about batch computation --; 6.8; Summary -- --; 7; Batch layer: Illustration --; 7.1; An illustrative example --; 7.2; Common pitfalls of data-processing tools --; 7.3; An introduction to JCascalog --; 7.4; Composition --; 7.5; Summary -- --; 8; An example batch layer: Architecture and algorithms --; 8.1; Design of the SuperWebAnalytics.com batch layer --; 8.2; Workflow overview --; 8.3; Ingesting new data --; 8.4; URL normalization --; 8.5; User-identifier normalization --; 8.6; Deduplicate pageviews --; 8.7; Computing batch views --; 8.8; Summary -- --; 9; An example batch layer: Implementation --; 9.1; Starting point --; 9.2; Preparing the workflow --; 9.3; Ingesting new data --; 9.4; URL normalization --; 9.5; User-identifier normalization --; 9.6; Deduplicate pageviews --; 9.7; Computing batch views --; 9.8; Summary -- --; Part 2; Serving layer : -- --; 10; Serving layer --; 10.1; Performance metrics for the serving layer --; 10.2; The serving layer solution to the normalization/denormalization problem --; 10.3; Requirements for a serving layer database --; 10.4; Designing a serving layer for SuperWebAnalytics.com --; 10.5; Contrasting with a fully incremental solution --; 10.6; Summary -- --; 11; Serving layer: Illustration --; 11.1; Basics of ElephantDB --; 11.2; Building the serving layer for SuperWebAnalytics.com --; 11.3; Summary -- --; Part 3; Speed layer : -- --; 12; Realtime views --; 12.1; Computing realtime views --; 12.2; Storing realtime views --; 12.3; Challenges of incremental computation --; 12.4; Asynchronous versus synchronous updates --; 12.5; Expiring realtime views --; 12.6; Summary -- --; 13; Realtime views: Illustration --; 13.1; Cassandra's data model --; 13.2; Using Cassandra --; 13.3; Summary -- --; 14; Queuing and stream processing --; 14.1; Queuing --; 14.2; Stream processing --; 14.3; Higher-level, one-at-a-time stream processing --; 14.4; SuperWebAnalytics.com speed layer --; 14.5; Summary -- --; 15; Queuing and stream processing: Illustration --; 15.1; Defining topologies with Apache Storm --; 15.2; Apache Storm clusters and deployment --; 15.3; Guaranteeing message processing --; 15.4; Implementing the SuperWebAnalytics.com uniques-over-time speed layer --; 15.5; Summary -- --; 16; Micro-batch stream processing --; 16.1; Achieving exactly-once semantics --; 16.2; Core concepts of micro-batch stream processing --; 16.3; Extending pipe diagrams for micro-batch processing --; 16.4; Finishing the speed layer for SuperWebAnalytics.com --; 16.5; Pageviews over time 262 n Bounce-rate analysis --; 16.6; Another look at the bounce-rate-analysis example --; 16.7; Summary -- --; 17; Micro-batch stream processing: Illustration --; 17.1; Using Trident --; 17.2; Finishing the SuperWebAnalytics.com speed layer --; 17.3; Fully fault-tolerant, in-memory, micro-batch processing --; 17.4; Summary -- --; 18; Lambda Architecture in depth --; 18.1; Defining data systems --; 18.2; Batch and serving layers --; 18.3; Speed layer --; 18.4; Query layer --; 18.5; Summary N2 - "Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built."--Publisher's website ER -