Skip to content

Instantly share code, notes, and snippets.

@aschweer
Forked from terrywbrady/README.md
Last active August 29, 2015 14:16
Show Gist options
  • Save aschweer/f9c155e946c38299974e to your computer and use it in GitHub Desktop.
Save aschweer/f9c155e946c38299974e to your computer and use it in GitHub Desktop.

We found that 5M of our 12M statistics records did not have a uid. The absence of this field caused the sharding process to fail.

  • To check how many of your stats records don't have a uid, run a query like so:

    curl --globoff 'http://localhost:8080/solr/statistics/select?q=-uid:[*+TO+*]&rows=0&indent=true' | grep numFound

  • Add the following to solr.xml

      <core name="tstatistics" instanceDir="tstatistics" />
    
  • Actually create that instance directory and copy over the conf subdirectory from the statistics core

  • Double check and if necessary adjust your solr URL in SolrTouch.java

  • Build solrFix-2.0.jar using the pom file listed above

    • when cloning the gist, you need to create the appropriate directory structure to match the class's package
    • use mvn package install and all dependencies will end up in the target directory
    • then in the target directory, run for i in *.jar; do echo -n `realpath $i` && echo -n ':'; done to get the classpath for running the command below
  • Make sure tomcat is running

  • Run the solrFix jar repeatedly until all records have been copied from "statitistics" to "tstatistics". This calls the SolrTouch class which reads each statistics record and copies it (exluding uid and version). This will force the re-initialization of these fields.

This process runs into heap or garbage collection constraints when processing large numbers of items. On line #63, tune the process to set a maximum number of records to process at one time. (Recommended: 100,000 to 500,000)

java -Xmx1000m -cp [classpath from above] edu.georgetown.library.solrFix.SolrTouch

Note you can use this to run it multiple times in a row:

for run in {0..10}; do (as above); done

You can find out how many statistics hits are in your statistics core (ie, how many need to be copied over to tstatistics) by running a query like so:

curl --globoff 'http://localhost:8080/solr/statistics/select?q=*:*&rows=0&indent=true' | grep numFound

In the end, the result you get for that query should be the same as when querying the tstatistics core instead (same command as above, just use tstatistics rather than statistics).

  • Stop solr and swap the statistics and tstatistics cores.

We found that 5M of our 12M statistics records did not have a proper version attribute. The following code override will force the correction of the version during the sharding process

  • Update SolrLogger.java in your DSpace code base to contain the method provided above. This method will exclude version numbers from the CSV export in the sharding process.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>edu.georgetown.library</groupId>
<artifactId>solrFix</artifactId>
<version>2.0</version>
<packaging>jar</packaging>
<name>SolrFix</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<artifactId>solr-solrj</artifactId>
<groupId>org.apache.solr</groupId>
<version>4.7.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpmime</artifactId>
<version>4.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.6</version>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>wstx-asl</artifactId>
<version>4.0.6</version>
</dependency>
<dependency>
<groupId>org.noggit</groupId>
<artifactId>noggit</artifactId>
<version>0.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jul-to-slf4j</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<optional>true</optional>
<version>1.7.7</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>edu.georgetown.library.solrFix.SolrTouch</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<phase>install</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<directory>target</directory>
<outputDirectory>target/classes</outputDirectory>
<finalName>${project.artifactId}-${project.version}</finalName>
<sourceDirectory>src/main</sourceDirectory>
</build>
</project>
public static void shardSolrIndex() throws IOException, SolrServerException {
/*
Start by faceting by year so we can include each year in a separate core !
*/
SolrQuery yearRangeQuery = new SolrQuery();
yearRangeQuery.setQuery("*:*");
yearRangeQuery.setRows(0);
yearRangeQuery.setFacet(true);
yearRangeQuery.add(FacetParams.FACET_RANGE, "time");
//We go back to 2000 the year 2000, this is a bit overkill but this way we ensure we have everything
//The alternative would be to sort but that isn't recommended since it would be a very costly query !
yearRangeQuery.add(FacetParams.FACET_RANGE_START, "NOW/YEAR-" + (Calendar.getInstance().get(Calendar.YEAR) - 2000) + "YEARS");
//Add the +0year to ensure that we DO NOT include the current year
yearRangeQuery.add(FacetParams.FACET_RANGE_END, "NOW/YEAR+0YEARS");
yearRangeQuery.add(FacetParams.FACET_RANGE_GAP, "+1YEAR");
yearRangeQuery.add(FacetParams.FACET_MINCOUNT, String.valueOf(1));
//Create a temp directory to store our files in !
File tempDirectory = new File(ConfigurationManager.getProperty("dspace.dir") + File.separator + "temp" + File.separator);
tempDirectory.mkdirs();
QueryResponse queryResponse = solr.query(yearRangeQuery);
//We only have one range query !
List<RangeFacet.Count> yearResults = queryResponse.getFacetRanges().get(0).getCounts();
for (RangeFacet.Count count : yearResults) {
long totalRecords = count.getCount();
//Create a range query from this !
//We start with out current year
DCDate dcStart = new DCDate(count.getValue());
Calendar endDate = Calendar.getInstance();
//Advance one year for the start of the next one !
endDate.setTime(dcStart.toDate());
endDate.add(Calendar.YEAR, 1);
DCDate dcEndDate = new DCDate(endDate.getTime());
StringBuilder filterQuery = new StringBuilder();
filterQuery.append("time:([");
filterQuery.append(ClientUtils.escapeQueryChars(dcStart.toString()));
filterQuery.append(" TO ");
filterQuery.append(ClientUtils.escapeQueryChars(dcEndDate.toString()));
filterQuery.append("]");
//The next part of the filter query excludes the content from midnight of the next year !
filterQuery.append(" NOT ").append(ClientUtils.escapeQueryChars(dcEndDate.toString()));
filterQuery.append(")");
Map<String, String> yearQueryParams = new HashMap<String, String>();
yearQueryParams.put(CommonParams.Q, "*:*");
yearQueryParams.put(CommonParams.ROWS, String.valueOf(10000));
yearQueryParams.put(CommonParams.FQ, filterQuery.toString());
yearQueryParams.put(CommonParams.WT, "csv");
//Start by creating a new core
String coreName = "statistics-" + dcStart.getYear();
HttpSolrServer statisticsYearServer = createCore(solr, coreName);
System.out.println("Moving: " + totalRecords + " into core " + coreName);
log.info("Moving: " + totalRecords + " records into core " + coreName);
List<File> filesToUpload = new ArrayList<File>();
for(int i = 0; i < totalRecords; i+=10000){
String solrRequestUrl = solr.getBaseURL() + "/select";
solrRequestUrl = generateURL(solrRequestUrl, yearQueryParams);
GetMethod get = new GetMethod(solrRequestUrl);
new HttpClient().executeMethod(get);
InputStream csvInputstream = get.getResponseBodyAsStream();
//Write the csv ouput to a file !
File csvFile = new File(tempDirectory.getPath() + File.separatorChar + "temp." + dcStart.getYear() + "." + i + ".csv");
CSVWriter bw = new CSVWriter(new FileWriter(csvFile));
int excl = -1;
try {
CSVReader reader = new CSVReader(new InputStreamReader(csvInputstream));
String [] nextLine;
String [] firstLine = new String[0];
if ((nextLine = reader.readNext()) != null) {
firstLine = nextLine;
for(int pi=0; pi<firstLine.length; pi++) {
String s = firstLine[pi];
if (s == null) s = "";
if (s.equals("_version_")) {
excl = pi;
break;
}
}
}
for (; nextLine !=null; nextLine = reader.readNext()) {
int sz = firstLine.length;
if (excl > 0) sz--;
String[] outLine = new String[sz];
int outIndex = 0;
for(int pi=0; pi<firstLine.length; pi++) {
String s = (pi > nextLine.length - 1) ? "\"\"" : nextLine[pi];
if (pi == excl) continue;
if (s == null) s = "";
outLine[outIndex++] = s;
}
bw.writeNext(outLine);
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
bw.flush();
bw.close();
//FileUtils.copyInputStreamToFile(csvInputstream, csvFile);
filesToUpload.add(csvFile);
//Add 10000 & start over again
yearQueryParams.put(CommonParams.START, String.valueOf((i + 10000)));
}
for (File tempCsv : filesToUpload) {
//Upload the data in the csv files to our new solr core
try {
ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");
statisticsYearServer.request(contentStreamUpdateRequest);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
statisticsYearServer.commit(true, true);
//Delete contents of this year from our year query !
solr.deleteByQuery(filterQuery.toString());
solr.commit(true, true);
log.info("Moved " + totalRecords + " records into core: " + coreName);
}
FileUtils.deleteDirectory(tempDirectory);
}
package edu.georgetown.library.solrFix;
/*
* java -mx2500M -cp commons-codec-1.6.jar:commons-io-2.4.jar:commons-logging-1.1.3.jar:httpclient-4.3.3.jar:httpcore-4.3.2.jar:httpmime-4.3.3.jar:jcl-over-slf4j-1.7.7.jar:jline-0.9.94.jar:jul-to-slf4j-1.7.7.jar:log4j-1.2.17.jar:netty-3.7.0.Final.jar:noggit-0.5.jar:slf4j-api-1.7.7.jar:slf4j-log4j12-1.7.7.jar:solr-solrj-4.7.2.jar:solrUpdate-2.0.jar:stax2-api-3.0.1.jar:stax-api-1.0.1.jar:woodstox-core-asl-4.0.6.jar:zookeeper-3.4.6.jar edu.georgetown.library.solrUpdate.SolrUpdate
*/
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Map;
import java.util.Vector;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrQuery.ORDER;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.util.ClientUtils;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.SolrInputDocument;
public class SolrTouch {
static String conttype = "";
public static void main(String[] args) {
boolean win = System.getProperty("os.name").startsWith("Windows");
conttype = win ? "text/xml" : "application/xml";
long stime = Calendar.getInstance().getTimeInMillis();
int MAX = 50_000;
String url = "http://localhost:8080/solr/statistics";
String turl = "http://localhost:8080/solr/tstatistics";
try {
HttpSolrServer server = new HttpSolrServer( url );
HttpSolrServer tserver = new HttpSolrServer( turl );
//server.setRequestWriter(new BinaryRequestWriter());
XMLResponseParser xrp = new XMLResponseParser() {
public String getContentType() {return conttype;}
};
SolrQuery tsq = new SolrQuery();
tsq.setQuery("*:*");
tsq.setRows(0);
tserver.setParser(xrp);
QueryResponse tresp = tserver.query(tsq);
int start = (int)tresp.getResults().getNumFound();
tsq = new SolrQuery();
String myQuery = "*:*";
SolrQuery sq = new SolrQuery();
sq.setQuery(myQuery);
sq.setRows(MAX);
sq.setSort("time", ORDER.asc);
server.setParser(xrp);
for(int total = 0; total<100_000 ;) {
System.out.format("%,d%n", start);
sq.setStart(start);
QueryResponse resp = server.query(sq);
SolrDocumentList list = resp.getResults();
if (list.size() == 0) break;
ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<list.size(); i++) {
SolrDocument doc = list.get(i);
SolrInputDocument idoc = new SolrInputDocument();
Map<String, Object> m = doc.getFieldValueMap();
for(String k: m.keySet()){
if (k.equals("uid")) continue;
if (k.equals("_version_")) continue;
idoc.addField(k, m.get(k));
}
idocs.add(idoc);
}
tserver.add(idocs);
tserver.commit();
start += list.size();
long etime = Calendar.getInstance().getTimeInMillis();
total += idocs.size();
System.out.format("%,d updated in %,d sec%n", total, (etime - stime)/1000);
System.gc();
etime = Calendar.getInstance().getTimeInMillis();
//System.out.format("End GC at %,d sec%n", (etime - stime)/1000);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment