Skip to content

Instantly share code, notes, and snippets.

@GzuPark
Created March 6, 2018 15:32
Show Gist options
  • Save GzuPark/781b660ed3d5be7a09371526b7185076 to your computer and use it in GitHub Desktop.
Save GzuPark/781b660ed3d5be7a09371526b7185076 to your computer and use it in GitHub Desktop.
SQL 실습, Big Data (2018 spring), Postechx

1. Write a SQL query that is equivalent to the following relational albebra expression.

  • π_{term} (σ_{docid=2}(Documents)) U π_{term} (σ_{count=3}(Documents))
  • Try both union and union all to see the difference
  • Try "or" instead of "union" and compare results with using union

Answer

  • input
select Term from Documents where Docid=3
UNION
select Term from Documents where count=3;
  • output
base
ds
final
mid
network
sd
structure
  • input
select Term from Documents where Docid=3
UNION ALL
select Term from Documents where count=3;
  • output
ds
sd
final
mid
base
structure
sd
sd
network
  • input
select Term from Documents where Docid=3
EXCEPT
select Term
from (select * from Documents where Docid=3 EXCEPT select * from Documents where count=3) x
where count=3;
  • output
sd

2. Write a SQL query to count the number of documents containing the word "data".

Answer

  • input
select count(Docid)
from Documents
where Term="data";
  • output
4

3. Write a SQL query to find all documents that have more than 3 terms.

Answer

  • input
select Docid 
from Documents 
group by Docid 
having count(Term) > 3;
  • output
1
2
3
4
5

4. Write a SQL query to count the number of documents that contain both the word "data" and "base".

Answer

  • input
select count(a.Docid) 
from (select * from Documents where Term="data") a, 
     (select * from Documents where Term="base") b 
where a.Docid = b.Docid;
  • output
3

5. Write a SQL query to compute the similarity of every pair of documents in Documents.

The similarity here is computed by summing the same term counts of two documents. For example, the similarity of Doc1<'a':2, 'b':1, 'c':3> and Doc2<'b':2, 'c':1, 'd':4>is 1*2('b') + 3*1('c') = 5.

(Hint: to avoid computing the similarity of both (Doc1, Doc2) and (Doc2, Doc1), add a condition of the form a.DocID < b.DocID.)

Answer

  • input
select x.aID, x.bID, sum(x.value) as result 
from (select a.DocID as aID, b.DocID as bID, a.count*b.count as value 
      from Documents a JOIN Documents b on a.Term = b.Term 
      where a.DocID < b.DocID 
      order by a.DocID, b.DocID) x 
group by x.aID, x.bID;
  • output
1,2,10
1,4,13
1,5,30
1,6,20
2,4,2
2,5,22
2,6,10
3,4,17
4,6,4
5,6,8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment