Skip to content

Instantly share code, notes, and snippets.

Avatar

Velotio Technologies velotiotech

View GitHub Profile
View suggestions.txt
+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|runtimeMinutes|'runtimeMinutes' has less than 72% missing values
View Constraint_Suggestions.scala
val constraintResult = { ConstraintSuggestionRunner()
.onData(data)
.addConstraintRules(Rules.DEFAULT)
.run()
}
val suggestionsDF = constraintResult.constraintSuggestions.flatMap {
case (column, suggestions) =>
suggestions.map { constraint =>
(column, constraint.description, constraint.codeForConstraint)
View Validation_metrics.scala
VerificationResult.successMetricsAsDataFrame(spark,validationResult)
.show(truncate=false)
View Validation_results.txt
+--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|constraint |constraint_status|constraint_message |
+--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|SizeConstraint(Size(None)) |Success | |
|MinimumConstraint(Minimum(averageRating,None)) |Success | |
|MaximumConstraint(Maximum(averageRating,None)) |Failure |Value: 10.0 does not meet the constraint requirement!|
|containsURL(titleType
View validate_results.scala
val validationResult: VerificationResult = { VerificationSuite()
.onData(data)
.addCheck(
Check(CheckLevel.Error, "Review Check")
.hasSize(_ >= 100000) // check if the data has atleast 100k records
.hasMin("averageRating", _ > 0.0) // min rating should not be less than 0
.hasMax("averageRating", _ < 9.0) // max rating should not be greater than 9
.containsURL("titleType") // verify that titleType column has URLs
.isComplete("primaryTitle") // primaryTitle should never be NULL
View validators.txt
hasSize, isComplete, hasCompleteness, isUnique, isPrimaryKey, hasUniqueness, hasDistinctness, hasUniqueValueRatio, hasNumberOfDistinctValues, hasHistogramValues, hasEntropy, hasMutualInformation, hasApproxQuantile, hasMinLength, hasMaxLength, hasMin, hasMax, hasMean, hasSum, hasStandardDeviation, hasApproxCountDistinct, hasCorrelation, satisfies, hasPattern, containsCreditCardNumber, containsEmail, containsURL, containsSocialSecurityNumber, hasDataType, isNonNegative, isPositive, isLessThan, isLessThanOrEqualTo, isGreaterThan, isGreaterThanOrEqualTo, isContainedIn
View Metrics_output.txt
+-----------+----------------------+-----------------+--------------------+
|entity |instance |name |value |
+-----------+----------------------+-----------------+--------------------+
|Mutlicolumn|numVotes,averageRating|Correlation |0.013454113877394851|
|Column |tconst |Uniqueness |1.0 |
|Column |tconst |Distinctness |1.0 |
|Dataset |* |Size |7339583.0 |
|Column |averageRating |Completeness |0.14858528066240276 |
|Column |averageRating |Mean |6.886130810579155 |
|Column |averageRating |StandardDeviation|1.3982924856469208 |
View Analysis_Metrics.scala
val runAnalyzer: AnalyzerContext = { AnalysisRunner
.onData(data)
.addAnalyzer(Size())
.addAnalyzer(Completeness("averageRating"))
.addAnalyzer(Uniqueness("tconst"))
.addAnalyzer(Mean("averageRating"))
.addAnalyzer(StandardDeviation("averageRating"))
.addAnalyzer(Compliance("top rating", "averageRating >= 7.0"))
.addAnalyzer(Correlation("numVotes", "averageRating"))
.addAnalyzer(Distinctness("tconst"))
View metrics.txt
ApproxCountDistinct, ApproxQuantile, ApproxQuantiles, Completeness, Compliance, Correlation, CountDistinct, DataType, Distance, Distinctness, Entropy, Histogram, Maximum, MaxLength, Mean, Minimum, MinLength, MutualInformation, PatternMatch, Size, StandardDeviation, Sum, UniqueValueRatio, Uniqueness
View IMDB_Dataset.txt
root
|-- tconst: string (nullable = true)
|-- titleType: string (nullable = true)
|-- primaryTitle: string (nullable = true)
|-- originalTitle: string (nullable = true)
|-- isAdult: integer (nullable = true)
|-- startYear: string (nullable = true)
|-- endYear: string (nullable = true)
|-- runtimeMinutes: string (nullable = true)
|-- genres: string (nullable = true)