Test setup

System setup for JIRA:

  • PC class server, Quad Core, 8GB RAM (real), 16 GB (most of the tests were performed against this instance) and 34 GB (virtualized) - Extra Large Instance and
    High-Memory Double Extra Large Instance hosted at Amazon EC2
  • Linux: Ubuntu 10.04 server, 64-bit
  • Sun Java 1.6.20
  • JIRA 4.1.2 on PostgreSQL 8.4
  • A project with
    • 10 components, issues assigned to each with probability: 70%, 10%, 5%, 4%, 3%, 3%, 3%, 2%
    • 2 custom fields (Text Field (< 255 characters) and Mult Select field)
  • 15 business users: 14 regular workers (simulating call centre line personnel - agent-1 to agent-14) and one supervisor (simulating call centre manager - manager-1)
  • populated with 1,100,000 issues with [reasonable but randomly generated descriptions and comments.

For stress test we used:

  • a custom load generator simulating work of 14 concurrent users repeatedly selecting (with equal probability) one of the following actions:
    • create new issue,
    • add a comment to a randomly selected issue,
    • perform a JQL query:
      • simple (80% of the time): project = X AND components IN (Y, Z)
      • searching through comments (20% of the time): project = X AND components IN (Y, Z) AND comment ~ "random_word"
    • find all unresolved issues in the project assigned to a given user
    • open "view issue screen" of a randomly selected issue
    • assign a random issue to self
    • transition a random issue in the workflow, adding a comment.
  • web browsers with autorefresh plugins to continuously reload dashboards

The verdict

Please notice the big bold "But" in the message below and read the rest of the post before you go too enthusiastic (smile)

(thumbs up) Yes, it can. But...

Given enough RAM (12GB for 1.1M issues), enough system tweaking, careful monitoring and placing the bar realistically, JIRA performance was satisfactory.

Conditions

We have tested JIRA having specific requirements in mind. We don't know if JIRA would scale well for instances exceeding our constraints:

  • One project with a few custom fields. Every additional project and custom field means more stress on JIRA indexes.
  • About 15 concurrent users. With a million issues JIRA cannot serve too many concurrent requests, so performance will become unacceptable if there are too many users.
  • Avoiding too complicated reports and dashboards. JIRA will not survive the load of generating reports involving all million issues for every user at every dashboard refresh.
  • Complicated reports done at night. If the reports involving all million issues are required, they need to be run off hours not to block interactive users. This most probably means custom plugins for such reports.

What does work really well

From user's point of view, most of the time interacting with Million Issue JIRA feels the same as smaller one (typical response time below 3s).

Single issue operations (create, comment, workflow transitions etc) don't suffer at all from having a million issues in your JIRA.

Operations involving multiple issues, such as reports, filters, search etc, work well as long as you use data available in the Lucene index.
This includes searching for your unresolved issues, creating a pie chart for components, viewing issue details or showing the activity stream.

What does not work that well

Anything involving:

  • full reindex - many administrative tasks come to this category,
  • processing all or large percentage of issues - fuzzy text searching, reports that need to touch every single issue etc,
  • many concurrent long running requests.

JIRA is vulnerable to building up amount of concurrent threads serving requests, leading to out of memory errors.
Any system-imposed pause (garbage collector, choking on disk access, reindex in the wrong time) makes this trip order of magnitude faster.

Good thing is that JIRA can often recover from the out of memory condition - the long-running processes eventually finish, release their memory and things go back to normal.
Bad is that it can take a few hours to eventually recover - the backlog of tasks started may grow too big, concurrency and GC overhead may be too great and there is no choice but to kill (and reindex) the instance.

Keeping the Million Issue JIRA instance alive requires:

  • Mandatory tweaking (see next post).
  • Constant monitoring
    • Java heap state - for out of memory symptoms,
    • JIRA logs - for queries becoming slow,
    • users - for not killing JIRA with too heavy gadgets (JIRA 4.x OpenSocial dashboards are wonderful, but as they also put much more stress ,measured per user, on a JIRA instance).
  • Limiting the killer requests
    • educate the users not to put long-running gadgets on their everyday dashboards (splitting sophisticated dashboards into several simpler dashboards - easily accessible from Dashboard tabs does the trick),
    • calculate data for complicated (these ones which require processing all issues and cannot be based solely on Lucene index) reports overnight and serve them as custom gadgets to avoid recalculations.
  • Limit the amount of concurrent heavy work:
    • decrease the amount of http server threads (default 200 is way too much). With 1M issues each thread will consume a lot of memory. If many threads are serving many concurrent requests, due to limited number of cores and non-ideal scalability of the system, subsequent requests (and corresponding threads) will take more and more time to complete thus further increasing the amount of concurrent serving new incoming requests, thus leading to the maximum number very fast and even with 16 - 20 GB OOME is almost guaranteed sooner or later, when several concurrent requests would need to make heavy use of the index.
    • set up Apache in front of JIRA and throttle intelligently (if you are an http wizard)

Numbers & details

Take them with a grain of salt, as esp. timing values depend on the performance of the hardware. We used Amazon servers that are not advised for JIRA due to virtualization overhead, we also used PostgreSQL instance sitting on the same machine - which may be good or bad (network connection between JIRA and the DB vs DB eating up local resources).

We haven't repeated the tests to get statistically meaningful data either.
Having said that, without virtualization JIRA can behave only better, so if this figures look good for you, you can make it even better in your non-virtualized environment.

Hardware recommendation

I suggest taking this one with a whole teaspoon of salt ;-)
We deny any responsibility if you buy hardware based on what we have written and it turns out to be insufficient in production environment.

  • 32 GB of RAM. 24 GB should be fine, but 16 was barely enough for tests. DDR3 (triple channel) is recommended, as JIRA does many memory intensive operations - especially reading whole index into memory.
  • Quad Core CPU. With Turbo or whatever boosts single thread performance nowadays. 2 or more of them wouldn't harm if you have budget, but it's the RAM that is the bottleneck. Worst tasks in JIRA are single-threaded anyway (index optimization utilizes two threads: one for issues, one for comments; complicated search queries do not look to be parallelized).
  • Fast disks. We tested on RAID-0 (stripe) 7200 RPM hard drives and it gave significant boost (up to double) to index operations (all operations effectively) over a single HDD setup. JIRA indexes are read very extensively, so we suggest trying SSD (or 10000 RPM SCSI drive) for the jira-home/caches - it doesn't need to be big disk.
  • Linux. Well, we like Linux, so why not :-). If you don't want Linux, select an OS well supporting most current Sun Java 1.6, though - this is the Java version recommended by Atlassian.
  • PostgreSQL. As recommended by Atlassian. If you have a DBA who is a wizard in some other database, take that one - proper setup is more important, and JIRA doesn't really use any advanced database features.

Inserting issues

During bulk import from CSV we noticed the following:

  • Index update happening every 50 issues took 300ms, the same on empty JIRA and at 1000000 issue. It was 500ms when index was not on RAID-1.
  • JIRA was creating issues at the speed of 10/s. This also hasn't changed over time. The limit is imposed by the database inserts speed - we have set up non-optimized PostgreSQL on non-optimized ext4 file system which is supposedly not the right solution. Issue creation speed was not our priority so we didn't mind.
  • Index optimization happening every 4000 inserts started at 100ms, but at the end it rose to 320 seconds for fully loaded JIRA.

Disk usage

Indexes in our Million Issue JIRA occupy ~4.5GB - this means that you can put them on some extra-fast-but-extra-expensive-per-MB media. Including a ramdisk, but that would be too extreme (we have tested that - worked like charm until system killed our JIRA due to lack of system memory)

Same for the DB - there is relatively little data there - attachments are stored on the file system.

During the tests we had disk data transfers reaching 30MB/s and a lot of time spent waiting on IO, so the faster disk you can organize the sooner you see the results of your complicated queries and the less chance of overloading the system.

Data import

Make sure you have enough RAM. The full reindex operation performed during import never finishes if you run out of heap.

  • 40 minutes - SAX parser phase
  • 45 minutes - "storing generic values" phase
  • ~25 minutes - reindexing phase. Very disk IO intensive.

Resource usage during reindex operation:

26 minutes - not bad.

Bad smell

Notice the amount of threads in the picture below and what happened to heap & CPU usage.
The new threads were http server threads.

First bump - reindex happened. Second - JIRA slowly collapsing under it's own load.

This happened with the Parallel garbage collector - we believe it might look a bit better with Concurrent collector.
JIRA would have been overloaded anyway, so what needs to be done to let it survive is limit the amount of http serving threads (less concurrent calculations)

Managerial dashboard

This one includes the killer Recently Created Chart, which took about a minute to render on our setup. We recommend you don't enable it on your production Million Issue JIRA :-)

In the real world this chart would not be that bad - you wouldn't have 1000000 issues on your "last 30 days" chart, would you?

The dashboard was refreshed every minute to simulate a manager who likes to be up-to-date with project status.

Agent dashboard

What we assumed a typical user dashboard would contain (no slow gadgets here, each showing up in a few seconds)

One copy refreshed every 30 seconds - users should be busy doing something else, not refreshing their dashboards all the time :-)

Generating the initial data

We created a CSV file (see our CSV generator) to be imported by built-in CSV importer.
Surprisingly we were able to populate JIRA with more than 1 million issues in less than 2 days.

Generating comments

We have written a JIRA service that added comments to issues present.

The comments consisted of strings of random English words.

The parameters we used for the service:

  • comment count: 1 + (Gaussian distribution * 2). Effectively, 95% of issues got from 1 to 5 comments.
  • comment length in words: 10 + (Gaussian distribution * 20). 95% of comments were 10 to 50 words long.

See details of our comment generation code.

Simulating load

Having JIRA populated with initial data, it was time to perform load tests. See details of the load agent including the source code.

Keeping JIRA alive

With so many issues (pun intended (wink) ), JIRA becomes really sensitive to various configuration settings and usage patterns. To let it survive the load for many days - you will need to do your homework, as we did.