Understanding Artificial Documents #19545

EdWeller · 2024-10-31T22:59:00Z

EdWeller
Oct 31, 2024

I have a data structure that looks like the following:

public record Project(string Location, string CustomerName, string InvoiceId);
public record Team(string Location, string Name);

public record MilestoneKey(string StartWorkCenter, string EndWorkCenter);

public class MilestoneTracking 
{
  public enum State 
  {
    Added,
    Scheduled,
    Started,
    Finished
  }

   public MilestoneTracking(MilestoneKey milestoneDataKey)
   {
      MilestoneKey = milestoneDataKey;
   }

   public State CurrentState { get; set; } = State.Added;
   public MilestoneKey MilestoneKey { get; init; }

   public required DateTimeOffSet PlanningStartDate{ get; set; }
   public required DateTimeOffSet PlanningStartDate{ get; set; }
   public required DateTimeOffSet? ActualStartDate{ get; set; }   
   public required DateTimeOffSet? ActualEndDate{ get; set; }   

  public bool IsFinished => CurrentState == State.Finished;
  public bool IsStarted => CurrentState == State.Started;
  public bool IsPlanned=> CurrentState == State.Added;
  public bool IsPending => IsScheduled || IsPlanned;

}

public class ProjectTracking 
{
 
 public ProjectTracking (Project project, Team team) 
  {
    Project = project;
    Team = team;
  }

  public Project Project { get; } 
  public Team Team{ get; } 
 
  public DateTimeOffset RefreshDate { get; set; }     
  public Dictionary<MilestoneKey, MilestoneTracking> TrackedMilestones { get; set; } = new();
}

	public class MilestoneSummary
	{
		public int Started { get; set; }
		public int Finished { get; set; }
		public int LateStart { get; set; }
		public int PendingLateStart { get; set; }
		public int LateEnd { get; set; }
		public int PendingLateEnd { get; set; }
	}

public class MilestoneSummaryDaily : MilestoneSummary
{
  public required Team  Team { get; set; }
  public MilestoneKey MilestoneKey { get; set; } = new();
  public int Year { get; set; }
  public int Month { get; set; }
  public int Day { get; set; }
}

I have an index that looks like this:

	public class ProjectTracking_MilestoneSummaryDaily : AbstractIndexCreationTask<ProjectTracking, MilestoneSummaryDaily>

	{

		private class MidTerms
		{
#nullable disable
			public Team Team { get; set; }
			public MilestoneKey MilestoneKey { get; set; }
			public MilestoneTracking Milestone { get; set; }
			public DateTimeOffset RefreshDate { get; set; }
#nullable enable

		}
		public ProjectTracking_MilestoneSummaryDaily()
		{
			MapConfiguration();

			ReduceConfiguration();

			StoreAllFields(FieldStorage.Yes);
			Index(i => i.Team, FieldIndexing.Exact);
			SearchEngineType = Raven.Client.Documents.Indexes.SearchEngineType.Lucene;

                        // CONFIGURE Artificial Documents
			OutputReduceToCollection = "MilestoneSummaryDaily";
			PatternReferencesCollectionName = "MilestoneSummaryDaily/References";
			PatternForOutputReduceToCollectionReferences = crId => $"{crId.Team}/{crId.MilestoneKey}/{crId.Year}/{crId.Month}/{crId.Day}";
		}

		private void ReduceConfiguration()
		{
			Reduce = entries => 
                       from entry in entries
			group entry by new
			{
				entry.Team,
				entry.MilestoneKey,
				entry.Year,
				entry.Month,
				entry.Day
			} into grp
			select new MilestoneSummaryDaily
			{
				Team = grp.Key.Team,
				MilestoneKey = grp.Key.MilestoneKey,
				Year = grp.Key.Year,
				Month = grp.Key.Month,
				Day = grp.Key.Day,
				LateEnd = grp.Sum(g => g.LateEnd),
				LateStart = grp.Sum(g => g.LateStart),
				PendingLateStart = grp.Sum(g => g.PendingLateStart),
				PendingLateEnd = grp.Sum(g => g.PendingLateEnd),
				Started = grp.Sum(g => g.Started),
				Finished = grp.Sum(g => g.Finished)
			};
		}

		private void MapConfiguration()
		{

			Map = (jobTrackings) => jobTrackings
				.SelectMany(jt => jt.TrackedMilestones, (jt, kvPair) => new MidTerms
				{
					Team = jt.Team,
					MilestoneKey = kvPair.Key,
					RefreshDate = jt.RefreshDate,
					Milestone = kvPair.Value
				})
				.Select(t => new MilestoneSummaryDaily
				{
					PlantSectionId = t.PlantSectionId,
					MilestoneKey = t.MilestoneKey,
					Year = t.RefreshDate.Year,
					Month = t.RefreshDate.Month,
					Day = t.RefreshDate.Day,
					LateEnd = t.Milestone.ActualEnd != null && t.Milestone.PlanningEndDate < t.Milestone.ActualEnd ? 1 : 0,
					LateStart =t.Milestone.ActualStart != null && t.Milestone.PlanningStartDate < t.Milestone.ActualEnd ? 1 : 0,,
					PendingLateStart = t.Milestone.IsPending && t.JobMilestone.PlanningStartDate < t.RefreshDate ? 1 : 0,
					PendingLateEnd = t.Milestone.IsStarted && t.JobMilestone.PlanningEndDate < t.RefreshDate ? 1 : 0,
					Started = t.Milestone.CurrentState == MilestoneTracking.State.Started ? 1 : 0,
					Finished = t.Milestone.CurrentState == MilestoneTracking.State.Finished ? 1 : 0,
				});
		}
	}

So my index works and calculates my current summary but I think my understanding of Artificial documents is off. I thought that the artificial documents would be written every time the index is updated. So I wrote a process that would update the project tracking document every day at midnight, setting the refresh date to the latest day. The index updates fine, but I was expecting my artificial documents from the previous day to stay. WRONG! :) I only have the daily summary for the last day. So I think I need a different plan.

What I am trying to do is have a rolling daily number for each milestone by team across projects. The idea is to show over time the daily summary, then roll that up into monthly and keep rolling up over time.

So what is the best approach to do this? Should I create a subscription that sees changes to the ProjectTracking documents and then just grab the index results and off load them into a daily summary document? Or do I keep my artificial documents and create a subscription on them to capture the latest summary?

Any guidance would be appreciated!

Answered by karmeli87

Nov 5, 2024

Using ETL over subscription in this case is not about the performance.
I simply think that ETL is a better fit here, since you don't need any client to be connect and everything can be done on the server-side.
You can read here what options the Raven ETL has.

19k is perfectly fine, we consider huge documents to be >5MB more or less, but even then everything will work, but you might have some performance penalty.

View full answer

karmeli87 · 2024-11-03T09:08:32Z

karmeli87
Nov 3, 2024
Maintainer

The artificial documents are bound to the index and reflect the reduce result of that index. If the index changes, so are the relevant artificial documents.

You can setup an ETL of the output collection to a different collection and filter out the deletions.

As a side note: if your ProjectTracking expected to contain a lot of TrackedMilestones those documents can end up to be huge and cause performance issues down the line.

7 replies

EdWeller Nov 4, 2024
Author

Thank you for the idea of using an ETL process. I have not messed with them so it might be a nice learning diversion. Are the ETL process faster than using a subscription and just changing the collection as needed?

On the side item, Currently we are running about 30 milestones and I am seeing document sizes around 19k. Is that leaning towards too big?

I have thought about changing my project tracking document to having an array of tracking references and then keeping the individual milestones in separate documents. The issue with that is the way the business wants to display the data. I can solve that by using includes on the tracking id array. We currently have around 500 projects active at any one time and I load and display all of them and their milestones. It would be better to page the data I know, but convincing the company won't happen until they hit the load wall. :)

karmeli87 Nov 5, 2024
Maintainer

Using ETL over subscription in this case is not about the performance.
I simply think that ETL is a better fit here, since you don't need any client to be connect and everything can be done on the server-side.
You can read here what options the Raven ETL has.

19k is perfectly fine, we consider huge documents to be >5MB more or less, but even then everything will work, but you might have some performance penalty.

Answer selected by EdWeller

EdWeller Nov 5, 2024
Author

The problem with ETL is that it is untestable outside of a full license. So running a DEV/QA test would not be available.

Also, I already have a background service consolidating data from the project generation back end. So I could start a subscription service that does the work relatively easy.

ayende Nov 6, 2024
Maintainer

It is available in the developer license, which is usable for DEV / QA

EdWeller Nov 11, 2024
Author

I don't understand what I am doing wrong then. My screen says it is not alowed:

EdWeller Nov 11, 2024
Author

Maybe this has something to do with it:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Artificial Documents #19545

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Understanding Artificial Documents #19545

EdWeller Oct 31, 2024

Replies: 1 comment · 7 replies

karmeli87 Nov 3, 2024 Maintainer

EdWeller Nov 4, 2024 Author

karmeli87 Nov 5, 2024 Maintainer

EdWeller Nov 5, 2024 Author

ayende Nov 6, 2024 Maintainer

EdWeller Nov 11, 2024 Author

EdWeller Nov 11, 2024 Author

EdWeller
Oct 31, 2024

Replies: 1 comment 7 replies

karmeli87
Nov 3, 2024
Maintainer

EdWeller Nov 4, 2024
Author

karmeli87 Nov 5, 2024
Maintainer

EdWeller Nov 5, 2024
Author

ayende Nov 6, 2024
Maintainer

EdWeller Nov 11, 2024
Author

EdWeller Nov 11, 2024
Author