CS/Introduction of Coumputer Science

Big Data

WakaraNai 2022. 11. 15. 16:39
728x90
반응형

Three V's of Big Data

- volume : amount of data 규모

- variety : range of data types sources 다양성

- velocity : speed of data in/out 속도

4th veracity 정확성

5th value 가치

Database

DBMS : Database Management System

- Relational Model : SQL

- Query Processing

- Transaction Management : collection of operations that performs a single logical function

    - Concurrency-contrl manager : 동시성 제어. 동시에 발생한 트랜잭셕을 처리하여 데이터베이스의 일관성을 보장

    -  Recovery : Redo & Undo

 

 

Big Data Technologies

NoSQL = Not Only SQL

 

Apache Hadoop : 4th big data

allow for the distributed processing of large datasets

across clusters of computers using simple programming models

Hadoop = HDFS(store) + MapReduce(process)

- Redundant, fault-tolerant data storage

- Parallel computation framework

- job coordination

 

 

 

Hadoop is Not ...

... relational database

... OLTP online transaction processing

... structed data store of any kind

 

 

Hadoop used for ...

- recommendation system

- Natural Language Processing

- Data warehousing

- Market research / forecasting

- Financial analysis

- Correlation engines

- Image/video processing

- log analysis

- social networking

- health

- government

- telecommunication

 

 

Hadoop : HDFS

a distributed file system designed to run on commodity hardware

 

목표

- hardware failure -> detect fault quickly, automatic recovery

- streaming data access -> batch processing (not interactive use by users)

- large data sets

- simple coherency model: write-once-read-many acess

- "Moving Computation is Cheaper than Moving Data"

Hadoop : Map Reduce

a programming model and an associated implementation for

processing and generating large data

 

Programming model - 3 phases

- Map phase

- Sort phase

- Reduce phase

 

장점

- Distribute data and computation

- independent task

- linear scaling in the ideal case

- simple programming model : "end-user" only writes map-reduce tasks

 

단점

- still rough

- programming is very restrictve

- "Joins" operation are tricky and slow

- cluster management is hard (debuggging ...)

- limit scaling

 

Big Data의 5대 신기술

1. Storam and Kafka : stream in-real time

2. Drill and Dremel: ad-hoic querying

3. R : statistical programming language

4. Gremlin and Giraph : graph analysis

5. SAP Hana : in-memory analystics platform

6. Honorable mention :D3 : visualization 차트로 시각화

 

빅데이터 분석 

- 텍스트 마이닝

- 웹 마이닝

- 오피니언 마이닝 (뉴스 등)

- 리얼리티 마이닝 (휴대용 기기 사용량)

- 소셜 네트워크 분석

- 분류

- 군집화

- 기계학습

- 감성 분석

728x90
반응형

'CS > Introduction of Coumputer Science' 카테고리의 다른 글

Computer Security & AI  (0) 2022.11.25
Semantic Analyzer: Scope  (0) 2022.11.19
Network and Wireless  (0) 2022.11.15
Quantum Computing  (0) 2022.11.11
Computer Vision  (0) 2022.11.11