Three V's of Big Data
- volume : amount of data 규모
- variety : range of data types sources 다양성
- velocity : speed of data in/out 속도
4th veracity 정확성
5th value 가치
Database
DBMS : Database Management System
- Relational Model : SQL
- Query Processing
- Transaction Management : collection of operations that performs a single logical function
- Concurrency-contrl manager : 동시성 제어. 동시에 발생한 트랜잭셕을 처리하여 데이터베이스의 일관성을 보장
- Recovery : Redo & Undo
Big Data Technologies
NoSQL = Not Only SQL
Apache Hadoop : 4th big data
allow for the distributed processing of large datasets
across clusters of computers using simple programming models
Hadoop = HDFS(store) + MapReduce(process)
- Redundant, fault-tolerant data storage
- Parallel computation framework
- job coordination
Hadoop is Not ...
... relational database
... OLTP online transaction processing
... structed data store of any kind
Hadoop used for ...
- recommendation system
- Natural Language Processing
- Data warehousing
- Market research / forecasting
- Financial analysis
- Correlation engines
- Image/video processing
- log analysis
- social networking
- health
- government
- telecommunication
Hadoop : HDFS
a distributed file system designed to run on commodity hardware
목표
- hardware failure -> detect fault quickly, automatic recovery
- streaming data access -> batch processing (not interactive use by users)
- large data sets
- simple coherency model: write-once-read-many acess
- "Moving Computation is Cheaper than Moving Data"
Hadoop : Map Reduce
a programming model and an associated implementation for
processing and generating large data
Programming model - 3 phases
- Map phase
- Sort phase
- Reduce phase
장점
- Distribute data and computation
- independent task
- linear scaling in the ideal case
- simple programming model : "end-user" only writes map-reduce tasks
단점
- still rough
- programming is very restrictve
- "Joins" operation are tricky and slow
- cluster management is hard (debuggging ...)
- limit scaling
Big Data의 5대 신기술
1. Storam and Kafka : stream in-real time
2. Drill and Dremel: ad-hoic querying
3. R : statistical programming language
4. Gremlin and Giraph : graph analysis
5. SAP Hana : in-memory analystics platform
6. Honorable mention :D3 : visualization 차트로 시각화
빅데이터 분석
- 텍스트 마이닝
- 웹 마이닝
- 오피니언 마이닝 (뉴스 등)
- 리얼리티 마이닝 (휴대용 기기 사용량)
- 소셜 네트워크 분석
- 분류
- 군집화
- 기계학습
- 감성 분석
'CS > Introduction of Coumputer Science' 카테고리의 다른 글
Computer Security & AI (0) | 2022.11.25 |
---|---|
Semantic Analyzer: Scope (0) | 2022.11.19 |
Network and Wireless (0) | 2022.11.15 |
Quantum Computing (0) | 2022.11.11 |
Computer Vision (0) | 2022.11.11 |