Skip to content

Instantly share code, notes, and snippets.

@karimkhanp
Last active April 6, 2022 13:48
Show Gist options
  • Save karimkhanp/78f8e829ad77bbb99f1b to your computer and use it in GitHub Desktop.
Save karimkhanp/78f8e829ad77bbb99f1b to your computer and use it in GitHub Desktop.
Bigdata resources - Do I miss something. Add and make it richer
Bigdata is like combination of bunch of subjects. Mainly require programming, analysis, nlp, MLP, mathematics.
To see links, Go : http://www.quora.com/What-are-some-good-sources-to-learn-big-data
Here are bunch of courses I came accross:
Introduction to CS Course
Notes: Introduction to Computer Science Course that provides instructions on coding.
Online Resources:
Udacity - intro to CS course,
Coursera - Computer Science 101
Code in at least one object oriented programming language: C++, Java, or Python
Beginner Online Resources:
Coursera - Learn to Program: The Fundamentals,
MIT Intro to Programming in Java,
Google's Python Class,
Coursera - Introduction to Python,
Python Open Source E-Book
Intermediate Online Resources:
Udacity's Design of Computer Programs,
Coursera - Learn to Program: Crafting Quality Code,
Coursera - Programming Languages,
Brown University - Introduction to Programming Languages
Learn other Programming Languages
Notes: Add to your repertoire - Java Script, CSS, HTML, Ruby, PHP, C, Perl, Shell. Lisp, Scheme.
Online Resources: w3school.com - HTML Tutorial, Learn to code
Test Your Code
Notes: Learn how to catch bugs, create tests, and break your software
Online Resources: Udacity - Software Testing Methods, Udacity - Software Debugging
Develop logical reasoning and knowledge of discrete math
Online Resources:
MIT Mathematics for Computer Science,
Coursera - Introduction to Logic,
Coursera - Linear and Discrete Optimization,
Coursera - Probabilistic Graphical Models,
Coursera - Game Theory.
Develop strong understanding of Algorithms and Data Structures
Notes: Learn about fundamental data types (stack, queues, and bags), sorting algorithms (quicksort, mergesort, heapsort), and data structures (binary search trees, red-black trees, hash tables), Big O.
Online Resources:
MIT Introduction to Algorithms,
Coursera - Introduction to Algorithms Part 1 & Part 2,
Wikipedia - List of Algorithms,
Wikipedia - List of Data Structures,
Book: The Algorithm Design Manual
Develop a strong knowledge of operating systems
Online Resources: UC Berkeley Computer Science 162
Learn Artificial Intelligence Online Resources:
Stanford University - Introduction to Robotics, Natural Language Processing, Machine Learning
Learn how to build compilers
Online Resources: Coursera - Compilers
Learn cryptography
Online Resources: Coursera - Cryptography, Udacity - Applied Cryptography
Learn Parallel Programming
Online Resources: Coursera - Heterogeneous Parallel Programming
Tools and technologies for Bigdata:
Apache spark - Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.[1] Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).[2] However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
Database pipelining -
As you will notice it's just not about processing the data, but involves a lot of other components. Collection, storage, exploration, ML and visualization are critical to the proect's success.
SOLR - Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features
S3 - Amazon S3 is an online file storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces. Wikipedia
Hadoop - Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters ofcommodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0. Apache Hadoop
HBase : HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities ofsparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
Zookeeper - Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.[clarification needed] ZooKeeper was a sub project of Hadoop but is now a top-level project in its own right.
Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such asNetflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.
Mahout - Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwisescalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform. Mahout also provides Java libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly,[3] but various algorithms are still missing.
NLTK - The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook.
NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.
For Python-
Scikit Learn
Numpy
Scipy
Freebase - Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions.
DBPedia : DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created as part of theWikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related datasets. DBpedia has been described by Tim Berners-Lee as one of the more famous parts of the decentralized Linked Data effort.
Visualization tool
ggplot in R
Tableu
Qlikview
Mathematics : )
Calculus, Statistic, Probability, linear algebra and coordinate geometry
NER- Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.
Faceted search : Faceted search also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order
Source : Wikipedia, the free encyclopedia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment