*COD & Shipping Charges may apply on certain items.
Review final details at checkout.
₹3211
₹3742
14% OFF
Paperback
All inclusive*
Qty:
1
About The Book
Description
Author
Learn essential techniques from data warehouse legend Bill Inmon on how to build the reporting environment your business needs now! Answers for many valuable business questions hide in text. How well can your existing reporting environment extract the necessary text from email spreadsheets and documents and put it in a useful format for analytics and reporting? Transforming the traditional data warehouse into an efficient unstructured data warehouse requires additional skills from the analyst architect designer and developer. This book will prepare you to successfully implement an unstructured data warehouse and through clear explanations examples and case studies you will learn new techniques and tips to successfully obtain and analyze text.Master these ten objectives: Build an unstructured data warehouse using the 11-step approach Integrate text and describe it in terms of homogeneity relevance medium volume and structure Overcome challenges including blather the Tower of Babel and lack of natural relationships Avoid the Data Junkyard and combat the Spiders Web Reuse techniques perfected in the traditional data warehouse and Data Warehouse 2.0including iterative development Apply essential techniques for textual Extract Transform and Load (ETL) such as phrase recognition stop word filtering and synonym replacement Design the Document Inventory system and link unstructured text to structured data Leverage indexes for efficient text analysis and taxonomies for useful external categorization Manage large volumes of data using advanced techniques such as backward pointers Evaluate technology choices suitable for unstructured data processing such as data warehouse appliances The following outline briefly describes each chapters content: Chapter 1 defines unstructured data and explains why text is the main focus of this book. The sources for text including documents email and spreadsheets are described in terms of factors such as homogeneity relevance and structure. Chapter 2 addresses the challenges one faces when managing unstructured data. These challenges include volume blather the Tower of Babel spelling and lack of natural relationships. Learn how to avoid a data junkyard which occurs when unstructured data is not properly integrated into the data warehouse. This chapter emphasizes the importance of storing integrated unstructured data in a relational structure. We are cautioned on both the commonality and dangers associated with text based on paper. Chapter 3 begins with a timeline of applications highlighting their evolution over the decades. Eventually powerful yet siloed applications created a spiders web environment. This chapter describes how data warehouses solved many problems including the creation of corporate data the ability to get out of the maintenance backlog conundrum and greater data integrity and data accessibility. There were problems however with the data warehouse that were addressed in Data Warehouse 2.0 (DW 2.0) such as the inevitable data lifecycle. This chapter discusses the DW 2.0 architecture which leads into the role of the unstructured data warehouse. The unstructured data warehouse is defined and benefits are given. There are several features of the conventional data warehouse that can be leveraged for the unstructured data warehouse including ETL processing textual integration and iterative development. Chapter 4 focuses on the heart of the unstructured data warehouse: Textual Extract Transform and Load (ETL). This chapter has separate sections on extracting text transforming text and loading text. The chapter emphasizes the issues around source data. There are a wide variety of sources and each of the sources has its own set of considerations. Extracting pointers are provided such as reading documents only once and recognizing common and different file types. Transforming text requires addressing many considerations discussed in this chapter including phrase recognition stop word filtering and synonym replacement. Loading text is the final step. There are important points to understand here too that are explained in this chapter such as the importance of the thematic approach and knowing how to handle large volumes of data. Two ETL examples are provided one on email and one on spreadsheets. Chapter 5 describes the 11 steps required to develop the unstructured data warehouse. The methodology explained in this chapter is a combination of both traditional system development lifecycle and spiral approaches. Chapter 6 describes how to inventory documents for maximum analysis value as well as link the unstructured text to structured data for even greater value. The Document Inventory is discussed which is similar to a library card catalog used for organizing corporate documents. This chapter explores ways of linking unstructured text to structured data. The emphasis is on taking unstructured data and reducing it into a form of data that is structured. Related concepts to linking such as probabilistic linkages and dynamic linkages are discussed. Chapter 7 goes through each of the different types of indexes necessary to make text analysis efficient. Indexes range from simple indexes which are fast to create and are good if the analyst really knows what needs to be analyzed before the indexing process begins to complex combined indexes which can be made up of any and all of the other kinds of indexes. Chapter 8 explains taxonomies and how they can be used within the unstructured data warehouse. Both simple and complicated taxonomies are discussed. Techniques to help the reader leverage taxonomies including using preferred taxonomies external categorization and cluster analysis are described. Real world problems are raised including the possibilities of encountering hierarchies multiple types and recursion. The chapter ends with a discussion comparing a taxonomy with a data model. Chapter 9 explains ways of coping with large amounts of unstructured data. Techniques such as keeping the unstructured data at its source and using backward pointers are discussed. The chapter explains why iterative development is so important. Ways of reducing the amount of data are presented including screening and removing extraneous data as well as parallelizing the workload. Chapter 10 focuses on challenges and some technology choices that are suitable for unstructured data processing. The traditional data warehouse processing technology is reviewed. In addition the data warehouse appliance is discussed. Chapters 11 12 and 13 put all of the previously discussed techniques and approaches in context through three case studies: the Ablatz Medical Group the Eastern Hills Oil Company and the Amber Oil Company.